Bda Unit-2
Bda Unit-2
UNIT-II
Syllabus: Intorducing Technologies For Handling Big Data: Distributed and Parallel
Computing for big data, Introducing Hadoop, And Cloud Computing in Big Data.
Understanding Hadoop eco system: Hadoop EcoSystem, Hadoop Distributed file system,
Map Reduce, Hadoop Yarn,Hive, Pig, Sqoop, Zookeeper, Flum, OOzie.
In this article, you will learn about the difference between Parallel Computing and Distributed
Computing. But before discussing the differences, you must know about parallel computing and
distributed computing.
Parallel computing provides numerous advantages. Parallel computing helps to increase the CPU
utilization and improve the performance because several processors work simultaneously.
Moreover, the failure of one CPU has no impact on the other CPUs' functionality. Furthermore, if
one processor needs instructions from another, the CPU might cause latency.
Advantages
1. It saves time and money because many resources working together cut down on time and costs.
2. It may be difficult to resolve larger problems on Serial Computing.
3. You can do many things at once using many computing resources.
4. Parallel computing is much better than serial computing for modeling, simulating, and
comprehending complicated real-world events.
Disadvantages
There are various benefits of using distributed computing. It enables scalability and makes it
simpler to share resources. It also aids in the efficiency of computation processes.
Advantages
Disadvantages
1. Data security and sharing are the main issues in distributed systems due to the features of open
systems
2. Because of the distribution across multiple servers, troubleshooting and diagnostics are more
challenging.
3. The main disadvantage of distributed computer systems is the lack of software support.
Here, you will learn the various key differences between parallel computing and distributed
computation. Some of the key differences between parallel computing and distributed computing
are as follows:
1. Parallel computing is a sort of computation in which various tasks or processes are run at the same
time. In contrast, distributed computing is that type of computing in which the components are
located on various networked systems that interact and coordinate their actions by passing
messages to one another.
2. In parallel computing, processors communicate with another processor via a bus. On the other
hand, computer systems in distributed computing connect with one another via a network.
3. Parallel computing takes place on a single computer. In contrast, distributed computing takes
place on several computers.
4. Parallel computing aids in improving system performance. On the other hand, distributed
computing allows for scalability, resource sharing, and the efficient completion of computation
tasks.
5. The computer in parallel computing can have shared or distributed memory. In contrast, every
system in distributed computing has its memory.
6. Multiple processors execute multiple tasks simultaneously in parallel computing. In contrast,
many computer systems execute tasks simultaneously in distributed computing.
The processors
The computer systems connect with one another
Communication communicate with
via a network.
one another via a bus.
Several processors
execute various tasks
Functionality Several computers execute tasks simultaneously.
simultaneously in
parallel computing.
Conclusion
There are two types of computations: parallel computing and distributed computing. Parallel
computing allows several processors to accomplish their tasks at the same time. In contrast,
distributed computing splits a single task among numerous systems to achieve a common goal.
Examples
1. Retail Industry: The Power of Personalization
Think about a retail environment where product
recommendations seem uncannily accurate and marketing
campaigns speak to your soul. This is made possible by cloud-
based big data analytics. Retailers use these tools to process
immense volumes of customer information, such as
purchase history, browsing habits and social media
sentiment. They then apply this knowledge to:
Customize marketing campaigns: Higher conversion rates
and increased customer satisfaction are achieved through
targeted email blasts and social media ads that cater for
individual preferences.
Optimize product recommendations: Recommender
systems driven by big data analytics propose products
customers are likely to find interesting thereby increasing sales
and reducing cart abandonment rates.
Enhance inventory management: Retailers can optimize
their inventory levels by scrutinizing sales trends alongside
customer demand patterns which eliminates stockouts while
minimizing clearance sales.
2. Healthcare: From Diagnosis to Personalized Care
The healthcare industry has rapidly adopted cloud-based big data
analytics for better patient care and operational efficiency. Here’s
how:
Improved diagnosis: Healthcare providers can now diagnose
patients faster and more accurately by analyzing medical
records together with imaging scans besides wearable device
sensor data.
Individual treatment plans: Big data analytics makes it
possible to create individualized treatment plans through
identification of factors affecting response to certain drugs or
therapies.
Predictive prevention care: Through cloud based analytics it
is possible to identify people at high risk of particular illnesses
before they actually occur thus leading to better outcomes for
patients and lower healthcare expenses.
3. Financial Services: Risk Management & Fraud
Detection
Effectively managing risks and making informed decisions are
crucial in the ever changing banking industry. Here’s how
financial companies can use big data analytics in the cloud:
Identify fraudulent activity: By using advanced algorithms
to make sense of real-time transaction patterns, banks are able
to detect and prevent fraudulent transactions from taking
place, thereby protecting both themselves and customers.
Evaluate credit riskiness: By checking borrowers’ financial
histories against other types of relevant data points, lenders
can make better choices concerning approvals on loans and
interest rates hence reducing credit risk.
Develop cutting-edge financial products: Banks can use
big data analytics to craft unique financial products for
different market segments as they continue studying their
clients’ desires and preferences.
These are only a few instances of the current industry
transformations brought about by cloud-based big data analytics.
It is inevitable that as technology advances and data quantities
expand, more inventive applications will surface, enabling
businesses to obtain more profound insights, make fact-based
decisions, and accomplish remarkable outcomes.
The Future Of Cloud Computing And Big Data Analytics
The future of big data analysis is directly related to that of cloud
computing. The significance of cloud platforms will only increase
as enterprises grapple with information overload and seek deeper
insights. The following are some tendencies to watch out for:
Hybrid and Multi-Cloud Environments: As per their unique
needs, companies will use more and more Hybrid and Multi
Cloud approaches to take advantage of the specific
capabilities typical for different providers.
Serverless Computing: Businesses will increasingly adopt
serverless computing due to its liberation of administrators
from the management of underlying infrastructure to
concentrate on analytics functions.
Integration Of AI & ML: Cloud platforms will seamlessly
integrate artificial intelligence (AI) alongside machine
learning (ML) functionalities thus enabling advanced
analytics as well as automated decision making.
Emphasis on Data Governance and Privacy: To keep pace
with shifting rules on data security and privacy, businesses will
need more advanced means of governing their information,
which cloud providers can supply.
Conclusion
Cloud computing has become the bedrock of big data analytics; it
is inexpensive, flexible, secure, and capable of accommodating
large quantities of information that companies can use to make
sense of what’s going on around them. As cloud technology
and big data analytics continue to evolve, we can expect
even more powerful tools and services to emerge,
enabling organizations to unlock the true potential of their
data and make data-driven decisions that fuel innovation
and success.
With growing data velocity the data size easily outgrows the
storage limit of a machine. A solution would be to store the data
across a network of machines. Such filesystems are
called distributed filesystems. Since data is stored across a
network all the complications of a network come in.
This is where Hadoop comes in. It provides one of the most
reliable filesystems. HDFS (Hadoop Distributed File System) is a
unique design that provides storage for extremely large files with
streaming data access pattern and it runs on commodity
hardware. Let’s elaborate the terms:
Extremely large files: Here we are talking about the data in
range of petabytes(1000 TB).
Streaming Data Access Pattern: HDFS is designed on
principle of write-once and read-many-times. Once data is
written large portions of dataset can be processed any number
times.
Commodity hardware: Hardware that is inexpensive and
easily available in the market. This is one of feature which
specially distinguishes HDFS from other file system.
Nodes: Master-slave nodes typically forms the HDFS cluster.
1. NameNode(MasterNode):
Manages all the slave nodes and assign work to them.
It executes filesystem namespace operations like opening,
closing, renaming files and directories.
It should be deployed on reliable hardware which has the
high config. not on commodity hardware.
2. DataNode(SlaveNode):
Actual worker nodes, who do the actual work like reading,
writing, processing etc.
They also perform creation, deletion, and replication upon
instruction from the master.
They can be deployed on commodity hardware.
HDFS daemons: Daemons are the processes running in
background.
Namenodes:
o Run on the master node.
o Store metadata (data about data) like file path, the
number of blocks, block Ids. etc.
o Require high amount of RAM.
o Store meta-data in RAM for fast retrieval i.e to reduce
seek time. Though a persistent copy of it is kept on
disk.
DataNodes:
o Run on slave nodes.
o Require high memory as data is actually stored here.
Data storage in HDFS: Now let’s see how the data is stored in a
distributed manner.
Lets assume that 100TB file is inserted, then
masternode(namenode) will first divide the file into blocks of 10TB
(default size is 128 MB in Hadoop 2.x and above). Then these
blocks are stored across different datanodes(slavenode).
Datanodes(slavenode)replicate the blocks among themselves and
the information of what blocks they contain is sent to the master.
Default replication factor is 3 means for each block 3 replicas are
created (including itself). In hdfs.site.xml we can increase or
decrease the replication factor i.e we can edit its configuration
here.
Note: MasterNode has the record of everything, it knows the
location and info of each and every single data nodes and the
blocks they contain, i.e. nothing is done without the permission of
master node.
Why divide the file into blocks?
Answer: Let’s assume that we don’t divide, now it’s very difficult
to store a 100 TB file on a single machine. Even if we store, then
each read and write operation on that whole file is going to take
very high seek time. But if we have multiple blocks of size 128MB
then its become easy to perform various read and write
operations on it compared to doing it on a whole file at once. So
we divide the file to have faster data access i.e. reduce seek
time.
Why replicate the blocks in data nodes while storing?
Answer: Let’s assume we don’t replicate and only one yellow
block is present on datanode D1. Now if the data node D1 crashes
we will lose the block and which will make the overall data
inconsistent and faulty. So we replicate the blocks to
achieve fault-tolerance.
Terms related to HDFS:
HeartBeat : It is the signal that datanode continuously sends
to namenode. If namenode doesn’t receive heartbeat from a
datanode then it will consider it dead.
Balancing : If a datanode is crashed the blocks present on it
will be gone too and the blocks will be under-
replicated compared to the remaining blocks. Here master
node(namenode) will give a signal to datanodes containing
replicas of those lost blocks to replicate so that overall
distribution of blocks is balanced.
Replication:: It is done by datanode.
Note: No two replicas of the same block are present on the same
datanode.
Features:
Distributed data storage.
Blocks reduce seek time.
The data is highly available as the same block is present at
multiple datanodes.
Even if multiple datanodes are down we can still do our work,
thus making it highly reliable.
High fault tolerance.
Limitations: Though HDFS provide many features there are
some areas where it doesn’t work well.
Low latency data access: Applications that require low-
latency access to data i.e in the range of milliseconds will not
work well with HDFS, because HDFS is designed keeping in
mind that we need high-throughput of data even at the cost of
latency.
Small file problem: Having lots of small files will result in lots
of seeks and lots of movement from one datanode to another
datanode to retrieve each small file, this whole process is a
very inefficient data access pattern.
HADOOP ECOSYSTEM
Overview: Apache Hadoop is an open source framework intended to make interaction
with big data easier, However, for those who are not acquainted with this technology, one
question arises that what is big data ? Big data is a term given to the data sets which can’t
be processed in an efficient manner with the help of traditional methodology such as
RDBMS. Hadoop has made its place in the industries and companies that need to work on
large data sets which are sensitive and needs efficient handling. Hadoop is a framework that
enables processing of large data sets which reside in the form of clusters. Being a
framework, Hadoop is made up of several modules that are supported by a large ecosystem
of technologies.
Introduction: Hadoop Ecosystem is a platform or a suite which provides various services
to solve the big data problems. It includes Apache projects and various commercial tools
and solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN,
and Hadoop Common Utilities. Most of the tools or solutions are used to supplement or
support these major elements. All these tools work collectively to provide services such as
absorption, analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
Note: Apart from the above-mentioned components, there are many other components too
that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the beauty of
Hadoop that it revolves around data and hence making its synthesis easier.
HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is responsible for
storing large data sets of structured or unstructured data across various nodes and
thereby maintaining the metadata in the form of log files.
HDFS consists of two core components i.e.
1. Name node
2. Data Node
Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These data
nodes are commodity hardware in the distributed environment. Undoubtedly, making
Hadoop cost effective.
HDFS maintains all the coordination between the clusters and hardware, thus working
at the heart of the system.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.
Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
Resource manager has the privilege of allocating resources for the applications in a
system whereas Node managers work on the allocation of resources such as CPU,
memory, bandwidth per machine and later on acknowledges the resource manager.
Application manager works as an interface between the resource manager and node
manager and performs negotiations as per the requirement of the two.
MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it possible
to carry over the processing’s logic and helps to write applications which transform big
data sets into a manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the
form of group. Map generates a key-value pair based result which is later on
processed by the Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the mapped
data. In simple, Reduce() takes the output generated by Map() as input and
combines those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is
Query based language similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge data sets.
Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig Runtime.
Just the way Java runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading and writing
of large data sets. However, its query language is called as HQL (Hive Query
Language).
It is highly scalable as it allows real-time processing and batch processing both. Also,
all the SQL datatypes are supported by Hive thus, making the query processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
JDBC, along with ODBC drivers work on establishing the data storage permissions and
connection whereas HIVE Command line helps in the processing of queries.
Mahout:
Mahout, allows Machine Learnability to a system or application. Machine Learning, as
the name suggests helps the system to develop itself based on some patterns,
user/environmental interaction or on the basis of algorithms.
It provides various libraries or functionalities such as collaborative filtering, clustering,
and classification which are nothing but concepts of Machine learning. It allows
invoking algorithms as per our need with the help of its own libraries.
Apache Spark:
It’s a platform that handles all the process consumptive tasks like batch processing,
interactive or iterative real-time processing, graph conversions, and visualization, etc.
It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
Spark is best suited for real-time data whereas Hadoop is best suited for structured data
or batch processing, hence both are used in most of the companies interchangeably.
Apache HBase:
It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus able
to work on Big Data sets effectively.
At times where we need to search or retrieve the occurrences of something small in a
huge database, the request must be processed within a short quick span of time. At such
times, HBase comes handy as it gives us a tolerant way of storing limited data
Other Components: Apart from all of these, there are some other components too that
carry out a huge task in order to make Hadoop capable of processing large datasets. They
are as follows:
Solr, Lucene: These are the two services that perform the task of searching and
indexing with the help of some java libraries, especially Lucene is based on Java which
allows spell check mechanism, as well. However, Lucene is driven by Solr.
Zookeeper: There was a huge issue of management of coordination and
synchronization among the resources or the components of Hadoop which resulted in
inconsistency, often. Zookeeper overcame all the problems by performing
synchronization, inter-component based communication, grouping, and maintenance.
Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding
them together as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie
coordinator jobs. Oozie workflow is the jobs that need to be executed in a sequentially
ordered manner whereas Oozie Coordinator jobs are those that are triggered when some
data or external stimulus is given to it.
YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to
remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was
described as a “Redesigned Resource Manager” at the time of its launching, but it has now
evolved to be known as large-scale distributed operating system used for Big Data
processing.
YARN architecture basically separates resource management layer from the processing layer.
In Hadoop 1.0 version, the responsibility of Job tracker is split between the resource manager
and application manager.
YARN also allows different data processing engines like graph processing, interactive
processing, stream processing as well as batch processing to run and process data stored in
HDFS (Hadoop Distributed File System) thus making the system much more efficient.
Through its various components, it can dynamically allocate various resources and schedule
the application processing. For large volume data processing, it is quite necessary to manage
the available resources properly so that every application can leverage them.
YARN Features: YARN gained popularity because of the following features-
Scalability: The scheduler in Resource manager of YARN architecture allows Hadoop to
extend and manage thousands of nodes and clusters.
Compatibility: YARN supports the existing map-reduce applications without disruptions
thus making it compatible with Hadoop 1.0 as well.
Cluster Utilization:Since YARN supports Dynamic utilization of cluster in Hadoop,
which enables optimized Cluster Utilization.
Multi-tenancy: It allows multiple engine access thus giving organizations a benefit of
multi-tenancy.
Advantages :
Flexibility: YARN offers flexibility to run various types of distributed processing
systems such as Apache Spark, Apache Flink, Apache Storm, and others. It allows
multiple processing engines to run simultaneously on a single Hadoop cluster.
Resource Management: YARN provides an efficient way of managing resources in the
Hadoop cluster. It allows administrators to allocate and monitor the resources required by
each application in a cluster, such as CPU, memory, and disk space.
Scalability: YARN is designed to be highly scalable and can handle thousands of nodes
in a cluster. It can scale up or down based on the requirements of the applications running
on the cluster.
Improved Performance: YARN offers better performance by providing a centralized
resource management system. It ensures that the resources are optimally utilized, and
applications are efficiently scheduled on the available resources.
Security: YARN provides robust security features such as Kerberos authentication,
Secure Shell (SSH) access, and secure data transmission. It ensures that the data stored
and processed on the Hadoop cluster is secure.
Disadvantages :
Apache Hive
Prerequisites – Introduction to Hadoop , Computing Platforms and Technologies
Apache Hive is a data warehouse and an ETL(ETL stands for "extract, transform, and
load". It's a process that combines data from multiple sources into a single repository, such as
a data warehouse, data store, or data lake) tool which provides an SQL-like interface
between the user and the Hadoop distributed file system (HDFS) which integrates Hadoop.
It is built on top of Hadoop. It is a software project that provides data query and analysis. It
facilitates reading, writing and handling wide datasets that stored in distributed storage and
queried by Structure Query Language (SQL) syntax. It is not built for Online Transactional
Processing (OLTP) workloads. It is frequently used for data warehousing tasks like data
encapsulation, Ad-hoc Queries, and analysis of huge datasets. It is designed to enhance
scalability, extensibility, performance, fault-tolerance and loose-coupling with its input
formats.
Initially Hive is developed by Facebook and Amazon, Netflix and It delivers standard SQL
functionality for analytics. Traditional SQL queries are written in the MapReduce Java API
to execute SQL Application and SQL queries over distributed data. Hive provides
portability as most data warehousing applications functions with SQL-based query
languages like NoSQL.
Apache Hive is a data warehouse software project that is built on top of the Hadoop
ecosystem. It provides an SQL-like interface to query and analyze large datasets stored in
Hadoop’s distributed file system (HDFS) or other compatible storage systems.
Hive uses a language called HiveQL, which is similar to SQL, to allow users to express
data queries, transformations, and analyses in a familiar syntax. HiveQL statements are
compiled into MapReduce jobs, which are then executed on the Hadoop cluster to process
the data.
Hive includes many features that make it a useful tool for big data analysis, including
support for partitioning, indexing, and user-defined functions (UDFs). It also provides a
number of optimization techniques to improve query performance, such as predicate
pushdown, column pruning, and query parallelization.
Hive can be used for a variety of data processing tasks, such as data warehousing, ETL
(extract, transform, load) pipelines, and ad-hoc data analysis. It is widely used in the big
data industry, especially in companies that have adopted the Hadoop ecosystem as their
primary data processing platform.
Components of Hive:
1. HCatalog –
It is a Hive component and is a table as well as a store management layer for Hadoop. It
enables user along with various data processing tools like Pig and MapReduce which
enables to read and write on the grid easily.
2. WebHCat –
It provides a service which can be utilized by the user to run Hadoop MapReduce (or
YARN), Pig, Hive tasks or function Hive metadata operations with an HTTP interface.
Modes of Hive:
1. Local Mode –
It is used, when the Hadoop is built under pseudo mode which has only one data node,
when the data size is smaller in term of restricted to single local machine, and when
processing will be faster on smaller datasets existing in the local machine.
2. Map Reduce Mode –
It is used, when Hadoop is built with multiple data nodes and data is divided across
various nodes, it will function on huge datasets and query is executed parallelly, and to
achieve enhanced performance in processing large datasets.
Characteristics of Hive:
1. Databases and tables are built before loading the data.
2. Hive as data warehouse is built to manage and query only structured data which is
residing under tables.
3. At the time of handling structured data, MapReduce lacks optimization and usability
function such as UDFs whereas Hive framework have optimization and usability.
4. Programming in Hadoop deals directly with the files. So, Hive can partition the data
with directory structures to improve performance on certain queries.
5. Hive is compatible for the various file formats which are TEXTFILE,
SEQUENCEFILE, ORC, RCFILE, etc.
6. Hive uses derby database in single user metadata storage and it uses MYSQL for
multiple user Metadata or shared Metadata.
Features of Hive:
1. It provides indexes, including bitmap indexes to accelerate the queries. Index type
containing compaction and bitmap index as of 0.10.
2. Metadata storage in a RDBMS, reduces the time to function semantic checks during
query execution.
3. Built in user-defined functions (UDFs) to manipulation of strings, dates, and other data-
mining tools. Hive is reinforced to extend the UDF set to deal with the use-cases not
reinforced by predefined functions.
4. DEFLATE, BWT, snappy, etc are the algorithms to operation on compressed data
which is stored in Hadoop Ecosystem.
5. It stores schemas in a database and processes the data into the Hadoop File Distributed
File System (HDFS).
6. It is built for Online Analytical Processing (OLAP).
7. It delivers various types of querying language which are frequently known as Hive
Query Language (HVL or HiveQL).
Advantages:
Scalability: Apache Hive is designed to handle large volumes of data, making it a scalable
solution for big data processing.
Familiar SQL-like interface: Hive uses a SQL-like language called HiveQL, which makes
it easy for SQL users to learn and use.
Integration with Hadoop ecosystem: Hive integrates well with the Hadoop ecosystem,
enabling users to process data using other Hadoop tools like Pig, MapReduce, and Spark.
Supports partitioning and bucketing: Hive supports partitioning and bucketing, which
can improve query performance by limiting the amount of data scanned.
User-defined functions: Hive allows users to define their own functions, which can be
used in HiveQL queries.
Disadvantages:
Limited real-time processing: Hive is designed for batch processing, which means it may
not be the best tool for real-time data processing.
Slow performance: Hive can be slower than traditional relational databases because it is
built on top of Hadoop, which is optimized for batch processing rather than interactive
querying.
Steep learning curve: While Hive uses a SQL-like language, it still requires users to have
knowledge of Hadoop and distributed computing, which can make it difficult for beginners
to use.
Limited flexibility: Hive is not as flexible as other data warehousing tools because it is
designed to work specifically with Hadoop, which can limit its usability in other
environments.
Pig Represents Big Data as data flows. Pig is a high-level platform or tool which is used to
process the large datasets. It provides a high-level of abstraction for processing over the
MapReduce. It provides a high-level scripting language, known as Pig Latin which is used to
develop the data analysis codes. First, to process the data which is stored in the HDFS, the
programmers will write the scripts using the Pig Latin Language. Internally Pig Engine(a
component of Apache Pig) converted all these scripts into a specific map and reduce task. But
these are not visible to the programmers in order to provide a high-level of abstraction. Pig
Latin and Pig Engine are the two main components of the Apache Pig tool. The result of Pig
always stored in the HDFS.
Note: Pig Engine has two type of execution environment i.e. a local execution
environment in a single JVM (used when dataset is small in size)and distributed execution
environment in a Hadoop Cluster.
Need of Pig: One limitation of MapReduce is that the development cycle is very long.
Writing the reducer and mapper, compiling packaging the code, submitting the job and
retrieving the output is a time-consuming task. Apache Pig reduces the time of development
using the multi-query approach. Also, Pig is beneficial for programmers who are not
from Java background. 200 lines of Java code can be written in only 10 lines using the Pig
Latin language. Programmers who have SQL knowledge needed less effort to learn Pig Latin.
It uses query approach which results in reducing the length of the code.
Pig Latin is SQL like language.
It provides many builtIn operators.
It provides nested data types (tuples, bags, map).
Evolution of Pig: Earlier in 2006, Apache Pig was developed by Yahoo’s researchers. At
that time, the main idea to develop Pig was to execute the MapReduce jobs on extremely
large datasets. In the year 2007, it moved to Apache Software Foundation(ASF) which makes
it an open source project. The first version(0.1) of Pig came in the year 2008. The latest
version of Apache Pig is 0.18 which came in the year 2017.
Features of Apache Pig:
For performing several operations Apache Pig provides rich sets of operators like the
filtering, joining, sorting, aggregation etc.
Easy to learn, read and write. Especially for SQL-programmer, Apache Pig is a boon.
Apache Pig is extensible so that you can make your own process and user-defined
functions(UDFs) written in python, java or other programming languages .
Join operation is easy in Apache Pig.
Fewer lines of code.
Apache Pig allows splits in the pipeline.
By integrating with other components of the Apache Hadoop ecosystem, such as Apache
Hive, Apache Spark, and Apache ZooKeeper, Apache Pig enables users to take advantage
of these components’ capabilities while transforming data.
The data structure is multivalued, nested, and richer.
Pig can handle the analysis of both structured and unstructured data.
SQOOP ARCHITECTURE
Basically the operations that take place in Sqoop are usually user-friendly. Sqoop used the
command-line interface to process command of user. The Sqoop can also use alternative
ways by using Java APIs to interact with the user. Basically, when it receives command by
the user, it is handled by the Sqoop and then the further processing of the command takes
place. Sqoop will only be able to perform the import and export of data based on user
command it is not able to form an aggregation of data.
Sqoop is a tool in which works in the following manner, it first parses argument which is
provided by user in the command-line interface and then sends those arguments to a further
stage where arguments are induced for Map only job. Once the Map receives arguments it
then gives command of release of multiple mappers depending upon the number defined by
the user as an argument in command line Interface. Once these jobs are then for Import
command, each mapper task is assigned with respective part of data that is to be imported on
basis of key which is defined by user in the command line interface. To increase efficiency of
process Sqoop uses parallel processing technique in which data is been distributed equally
among all mappers. After this, each mapper then creates an individual connection with the
database by using java database connection model and then fetches individual part of the data
assigned by Sqoop. Once the data is been fetched then the data is been written in HDFS or
Hbase or Hive on basis of argument provided in command line. thus the process Sqoop
import is completed.
The export process of the data in Sqoop is performed in same way, Sqoop export tool which
available performs the operation by allowing set of files from the Hadoop distributed system
back to the Relational Database management system. The files which are given as an input
during import process are called records, after that when user submits its job then it is
mapped into Map Task that brings the files of data from Hadoop data storage, and these data
files are exported to any structured data destination which is in the form of relational database
management system such as MySQL, SQL Server, and Oracle, etc.
Let us now understand the two main operations in detail:
Sqoop Import :
Sqoop import command helps in implementation of the operation. With the help of the import
command, we can import a table from the Relational database management system to the
Hadoop database server. Records in Hadoop structure are stored in text files and each record
is imported as a separate record in Hadoop database server. We can also create load and
partition in Hive while importing data..Sqoop also supports incremental import of data which
means in case we have imported a database and we want to add some more rows, so with the
help of these functions we can only add the new rows to existing database, not the complete
database.
Sqoop Export :
Sqoop export command helps in the implementation of operation. With the help of the export
command which works as a reverse process of operation. Herewith the help of the export
command we can transfer the data from the Hadoop database file system to the Relational
database management system. The data which will be exported is processed into records
before operation is completed. The export of data is done with two steps, first is to examine
the database for metadata and second step involves migration of data.
Here you can get the idea of how the import and export operation is performed in Hadoop
with the help of Sqoop.
Advantages of Sqoop :
With the help of Sqoop, we can perform transfer operations of data with a variety of
structured data stores like Oracle, Teradata, etc.
Sqoop helps us to perform ETL operations in a very fast and cost-effective manner.
With the help of Sqoop, we can perform parallel processing of data which leads to fasten
the overall process.
Sqoop uses the MapReduce mechanism for its operations which also supports fault
tolerance.
Disadvantages of Sqoop :
The failure occurs during the implementation of operation needed a special solution to
handle the problem.
The Sqoop uses JDBC connection to establish a connection with the relational database
management system which is an inefficient way.
The performance of Sqoop export operation depends upon hardware configuration
relational database management system.
Coordination Challenge
Apache Zookeeper
Architecture of Zookeeper
Zookeeper Services
The ZooKeeper architecture consists of a hierarchy of nodes called znodes, organized in a
tree-like structure. Each znode can store data and has a set of permissions that control access
to the znode. The znodes are organized in a hierarchical namespace, similar to a file system.
At the root of the hierarchy is the root znode, and all other znodes are children of the root
znode. The hierarchy is similar to a file system hierarchy, where each znode can have
children and grandchildren, and so on.
Zookeeper is used to manage and coordinate the nodes in a Hadoop cluster, including the
NameNode, DataNode, and ResourceManager. In a Hadoop cluster, Zookeeper helps to:
Maintain configuration information: Zookeeper stores the configuration information for
the Hadoop cluster, including the location of the NameNode, DataNode, and
ResourceManager.
Manage the state of the cluster: Zookeeper tracks the state of the nodes in the Hadoop
cluster and can be used to detect when a node has failed or become unavailable.
Coordinate distributed processes: Zookeeper can be used to coordinate distributed
processes, such as job scheduling and resource allocation, across the nodes in a Hadoop
cluster.
Zookeeper helps to ensure the availability and reliability of a Hadoop cluster by providing a
central coordination service for the nodes in the cluster.
ZooKeeper operates as a distributed file system and exposes a simple set of APIs that enable
clients to read and write data to the file system. It stores its data in a tree-like structure called
a znode, which can be thought of as a file or a directory in a traditional file system.
ZooKeeper uses a consensus algorithm to ensure that all of its servers have a consistent view
of the data stored in the Znodes. This means that if a client writes data to a znode, that data
will be replicated to all of the other servers in the ZooKeeper ensemble.
One important feature of ZooKeeper is its ability to support the notion of a “watch.” A watch
allows a client to register for notifications when the data stored in a znode changes. This can
be useful for monitoring changes to the data stored in ZooKeeper and reacting to those
changes in a distributed system.
In Hadoop, ZooKeeper is used for a variety of purposes, including:
Storing configuration information: ZooKeeper is used to store configuration information
that is shared by multiple Hadoop components. For example, it might be used to store the
locations of NameNodes in a Hadoop cluster or the addresses of JobTracker nodes.
Providing distributed synchronization: ZooKeeper is used to coordinate the activities of
various Hadoop components and ensure that they are working together in a consistent
manner. For example, it might be used to ensure that only one NameNode is active at a
time in a Hadoop cluster.
Maintaining naming: ZooKeeper is used to maintain a centralized naming service for
Hadoop components. This can be useful for identifying and locating resources in a
distributed system.
ZooKeeper is an essential component of Hadoop and plays a crucial role in coordinating the
activity of its various subcomponents.
ZooKeeper provides a simple and reliable interface for reading and writing data. The data is
stored in a hierarchical namespace, similar to a file system, with nodes called znodes. Each
znode can store data and have children znodes. ZooKeeper clients can read and write data to
these znodes by using the getData() and setData() methods, respectively. Here is an example
of reading and writing data using the ZooKeeper Java API:
Java
Python3
System.out.println(readData);
zk.close();
Session
Requests in a session are executed in FIFO order.
Once the session is established then the session id is assigned to the client.
Client sends heartbeats to keep the session valid
session timeout is usually represented in milliseconds
Watches
Watches are mechanisms for clients to get notifications about the changes in the
Zookeeper
Client can watch while reading a particular znode.
Znodes changes are modifications of data associated with the znodes or changes in the
znode’s children.
Watches are triggered only once.
If the session is expired, watches are also removed.
What is Flume?
Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and
transporting large amounts of streaming data such as log files, events (etc...) from various
sources to a centralized data store.
Assume an e-commerce web application wants to analyze the customer behavior from a
particular region. To do so, they would need to move the available log data in to Hadoop for
analysis. Here, Apache Flume comes to our rescue.
Flume is used to move the log data generated by application servers into HDFS at a higher
speed.
Advantages of Flume
Using Apache Flume we can store the data in to any of the centralized stores (HBase,
HDFS).
When the rate of incoming data exceeds the rate at which data can be written to the
destination, Flume acts as a mediator between data producers and the centralized
stores and provides a steady flow of data between them.
Flume provides the feature of contextual routing.
The transactions in Flume are channel-based where two transactions (one sender and
one receiver) are maintained for each message. It guarantees reliable message
delivery.
Flume is reliable, fault tolerant, scalable, manageable, and customizable.
Features of Flume
Flume ingests log data from multiple web servers into a centralized store (HDFS,
HBase) efficiently.
Using Flume, we can get the data from multiple servers immediately into Hadoop.
Along with the log files, Flume is also used to import huge volumes of event data
produced by social networking sites like Facebook and Twitter, and e-commerce
websites like Amazon and Flipkart.
Flume supports a large set of sources and destinations types.
Flume supports multi-hop flows, fan-in fan-out flows, contextual routing, etc.
Flume can be scaled horizontally.
Oozie Workflow Jobs − These are represented as Directed Acyclic Graphs (DAGs)
to specify a sequence of actions to be executed.
Oozie Coordinator Jobs − These consist of workflow jobs triggered by time and data
availability.
Oozie Bundle − These can be referred to as a package of multiple coordinator and
workflow jobs.
A sample workflow with Controls (Start, Decision, Fork, Join and End) and Actions (Hive,
Shell, Pig) will
look like the
following
diagram −
Workflow will always start with a W567-Start tag and end with an End tag.
Apache Oozie is used by Hadoop system administrators to run complex log analysis
on HDFS. Hadoop Developers use Oozie for performing ETL operations on data in a
sequential order and saving the output in a specified format (Avro, ORC, etc.) in HDFS.
Oozie Editors
Before we dive into Oozie lets have a quick look at the available editors for Oozie.
Most of the time, you won’t need an editor and will write the workflows using any popular
text editors (like Notepad++, Sublime or Atom) as we will be doing in this tutorial.
But as a beginner it makes some sense to create a workflow by the drag and drop method
using the editor and then see how the workflow gets generated. Also, to map GUI with the
actual workflow.xml created by the editor. This is the only section where we will discuss
about Oozie editors and won’t use it in our tutorial.
The most popular among Oozie editors is Hue.
This editor is very handy to use and is available with almost all Hadoop vendors’ solutions.
http://gethue.com/new-apache-oozie-workflow-coordinator-bundle-editors/
Oozie Eclipse plugin (OEP) is an Eclipse plugin for editing Apache Oozie workflows
graphically. It is a graphical editor for editing Apache Oozie workflows inside Eclipse.