BDA - Unit 4
BDA - Unit 4
BDA - Unit 4
• The term ‘Big Data’ is used for collections of large datasets that include huge volume, high velocity, and a
variety of data that is increasing day by day. Using traditional data management systems, it is difficult to
process Big Data.
• Therefore, the Apache Software Foundation introduced a framework called Hadoop to solve Big Data
management and processing challenges
Features of Big Data Platform
Here are most important features of any good Big Data Analytics Platform:
a) Big Data platform should be able to accommodate new platforms and tool based on the business requirement.
Because business needs can change due to new technologies or due to change in business process.
b) It should support linear scale-out
c) It should have capability for rapid deployment
d) It should support variety of data format
e) Platform should provide data analysis and reporting tools
f) It should provide real-time data analysis software
g) It should have tools for searching the data through large data sets.
Big data is a term for data sets that are so large or complex that traditional data processing applications are
inadequate. Challenges include:-
• Analysis,
• Capture,
• Data Curation,
• Search,
• Sharing,
• Storage,
• Transfer,
• Visualization,
• Querying,
• Updating
d) Hortonworks
Hortonworks is using 100% open-source software without any propriety software. Hortonworks were the one
who first integrated support for Apache HCatalog. Hortonworks is a Big Data company based in California.
This company is developing and supports application for Apache Hadoop.
Hortonworks Hadoop distribution is 100% open source and its enterprise ready with following features:
• Centralized management and configuration of clusters
• Security and data governance are built in feature of the system
• Centralized security administration across the system
e) MapR
• MapR is another Big Data platform which us using the UNIX file system for handling data.
• It is not using HDFS and this system is easy to learn anyone familiar with the UNIX system.
• This solution integrates Hadoop, Spark, and Apache Drill with a real-time data processing feature.
g) Microsoft HDInsight
• The Microsoft HDInsight is also based on the Hadoop distribution and it’s a commercial Big Data platform
from Microsoft.
• Microsoft is software giant which is into development of windows operating system for Desktop users and
Server users.
• This is the big Hadoop distribution offering which runs on the Windows and Azure environment.
• It offers customized, optimized open source Hadoop based analytics clusters which uses Spark, Hive,
MapReduce, HBase, Strom, Kafka and R Server which runs on the Hadoop system on windows/Azure
environment.
ZooKeeper- Coordination Provisions high-performance coordination service for distributed running
service of applications and tasks
Avro-Data serializationProvisions data serialization during data transfer between application and
and transfer utility processing layers
Oozie Provides a way to package and bundles multiple coordinator and
workflow jobs and manage the lifecycle of those jobs
Sqoop(SQL-to-Hadoop)A Provisions for data-transfer between data stores such as relational DBs
data-transfer software and Hadoop
Flume - Large data Provisions for reliable data transfer and provides for recovery in case of
transfer utility failure. Transfers large amount of data in applications, eg: social-media
Ambari-A messages
Provisions, monitors, manages, and viewing of functioning of the cluster,
web-based tool MapReduce, Hive and Pig APis
Chukwa-A data collection Provisions and manages data collection system for large and distributed
system systems
HBase-A structured data Provisions a scalable and structured database for large tables
store using database
Cassandra - A database Provisions scalable and fault-tolerant database for multiple masters
Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to
summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it
further as an open source under the name Apache Hive. It is used by different companies. For example, Amazon
uses it in Amazon Elastic MapReduce.
Apache Hive is a data ware house system for Hadoop that runs SQL like queries called HQL (Hive Query
language) which gets internally converted to map reduce jobs.
Hive supports Data definition Language(DDL), Data Manipulation Language(DML) and user defined functions.
DDL: create table, create index, create views.
DML: Select, Where, group by, Join, Order By
Pluggable Functions:
UDF: User Defined Function
UDAF: User Defined Aggregate Function
UDTF: User Defined Table Function
Hive is not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
Hive's metastore is used to persist schema i.e. table definition(table name, columns, types), location of table
files, row format of table files, storage format of files.
Hive Query Language HiveQL or HQL is similar to SQL for querying and gets reduced to map reduce jobs in
backend.
Hive's default database is derby.
It stores schema in a database and processed data into HDFS.
It is designed for OLAP.
It is familiar, fast, scalable and extensible.
Architecture of Hive
The following component diagram depicts the architecture of Hive:
This component diagram contains different units. The following table describes each unit:
User Interface Hive is a data warehouse infrastructure software that can create interaction between
user and HDFS. The user interfaces thatHive supports are Hive Web UI, Hive command
line, and Hive HDInsight (In Windows server).
Meta Store Hive chooses respective database servers to store the schema or Metadata of tables,
databases, columns in a table, their data types, and HDFS mapping.
HiveQL HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of
Process the replacements of traditional approach for MapReduce program. Instead of writing
Engine MapReduce program in Java, we can write a query for MapReduce job and process it.
Execution The conjunction part of HiveQL process Engine and MapReduce is Hive Execution
Engine Engine. Execution engine processes the query and generates results as same as
MapReduce results. It uses the flavor of MapReduce.
HDFS or Hadoop distributed file system or HBASE are the datastorage techniques to store
HBASE data into file system.
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:
StepNo. Operation
1 Execute Query
The Hive interface such as Command Line or Web UI sends query to Driver (any databasedriver
such as JDBC, ODBC, etc.) to execute.
2 Get Plan
The driver takes the help of query compiler that parses the query to check the syntax andquery
plan or the requirement of query.
3 Get Metadata
The compiler sends metadata request to Metastore (any database).
4 Send Metadata
Metastore sends metadata as a response to the compiler.
5 Send Plan
The compiler checks the requirement and resends the plan to the driver. Up to here, the
parsing and compiling of a query is complete.
6 Execute Plan
The driver sends the execute plan to the execution engine.
7 Execute Job
Internally, the process of execution job is a MapReduce job. The execution engine sends the
job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in
Data node. Here, the query executes MapReduce job.
7.1 Metadata Ops
The execution engine can execute metadata operations with Metastore.
8 Fetch Result
The execution engine receives the results from Data nodes.
9 Send Results
The execution engine sends those resultant values to the driver.
10 Send Results
The driver sends the results to Hive Interfaces.
File Formats in Hive
File Format specifies how records are encoded in files
Record Format implies how a stream of bytes for a given record are encoded
The default file format is TEXTFILE – each record is a line in the file
Hive uses different control characters as delimeters in textfiles
ᶺA ( octal 001) , ᶺB(octal 002), ᶺC(octal 003), \n
The term field is used when overriding the default delimiter
FIELDS TERMINATED BY ‘\001’
Supports text files – csv, tsv
TextFile can contain JSON or XML documents.
Hive Commands
Hive supports Data definition Language(DDL), Data Manipulation Language(DML) and User defined functions.
SELECT statement is used to retrieve the data from a table. WHERE clause works similar to a condition. It filters
the data using the condition and gives you a finite result. The built-in operators and functions generate an expression,
which fulfils the condition.
Syntax of the SELECT query:
The ORDER BY clause is used to retrieve the details based on one column and sort the result set by ascending or
descending order.
Syntax of the ORDER BY clause:
Example - Generate a query to retrieve the employee details who earn a salary of more than Rs 30000.
The GROUP BY clause is used to group all the records in a result set using a particular collection
column. It is used to query a group of records.
Syntax of GROUP BY clause is as follows:
JOIN is a clause that is used for combining specific fields from two tables by using values common to each one. It is
used to combine records from two or more tables in the database.
Syntax of JOIN clause is as follows:
Example
Consider following two tables named CUSTOMERS and ORDERS:-
CUSTOMERS ORDERS
JOIN
JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is same as OUTER JOIN in
SQL. A JOIN condition is to be raised using the primary keys and foreign keys of the tables.
Eg. - The following query executes JOIN on the CUSTOMER and ORDER tables and retrieves the records:
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT
FROM CUSTOMERS c JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
Eg.- The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
LEFT OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
RIGHT OUTER JOIN
The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if there are no matches in the
left table.
If the ON clause matches 0 (zero) records in the left table, the JOIN still returns a row in the result, but with
NULL in each column from the left table.
A RIGHT JOIN returns all the values from the right table, plus the matched values from the left table, or NULL
in case of no matching join predicate.
Eg.- The following query demonstrates RIGHT OUTER JOIN between the CUSTOMER and ORDER tables.
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
RIGHT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
Eg.- The following query demonstrates FULL OUTER JOIN between CUSTOMER and ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
Apache Pig
Apache Pig is a high-level data flow platform for executing MapReduce programs of Hadoop.
The language used for Pig is Pig Latin. The Pig Latin is a data flow language used by Apache Pig to analyze the
data in Hadoop. It is a
textual language that abstracts the programming from the Java MapReduce idiom into a notation.
The Pig scripts get internally converted to Map Reduce jobs and get executed on data stored in HDFS.
Apart from that, Pig can also execute its job in Apache Tez or Apache Spark.
Pig can handle any type of data, i.e., structured, semi-structured or unstructured and stores the corresponding
results into Hadoop Data File System.
Every task which can be achieved using PIG can also be achieved using JAVA used in MapReduce.
Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode and embedded mode.
Interactive Mode (Grunt shell) − Apache Pig can be executed in interactive mode using the Grunt shell. In this
shell, Pig Latin statements can be entered and the output can be produced(using Dump operator).
Batch Mode (Script) − Apache Pig can be executed in Batch mode by writing the Pig Latin script in a single file
with .pig extension.
Embedded Mode (UDF) − Apache Pig provides the provision of defining our own functions (User Defined
Functions) in programming languages such as Java, and using them in script.
Pig Data Types -A list of Apache Pig Data Types with description and examples are given below.
Given below in the table are some frequently used Pig Commands:-
Command Function
load Reads data from the system
Pig Vs Hive
Apache Spark
Apache Spark is a general-purpose & lightning fast cluster computing system designed for fast computation. It
provides a high-level API. For example, Java, Scala, Python, and R. Apache Spark is a tool for Running Spark
Applications.
Spark is 100 times faster han Bigdata Hadoop and 10 times faster than accessing data from disk. It was built on top
of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations which
includes Interactive Queries and Stream Processing. It can be integrated with Hadoop and can process existing
Hadoop HDFS data.
Spark is written in Scala but provides rich APIs in Scala, Java, Python, and R.
Spark Architecture
Apache Spark uses master-slave architecture.
Just like in the real world, the master will get the job done by using
his slaves. It means that you will have a master process and
multiple slave processes which are controlled by that dedicated
master process. In the Spark environment, master nodes are
referred to as drivers and slaves are referred to as executors.
Drivers - Master manages, maintains, and monitors the slaves while slaves are the actual workers who perform the
processing tasks. User tells the master what works to be done and the master will take care of the rest. It will
complete the task, using its slaves.
Drivers are the master process in the Spark environment. It contains all the metadata about slaves or, in Spark terms,
executors.
Drivers are responsible for: Analyzing, Distributing, Scheduling, and Monitoring the executors.
Executors
Executors are the slave processes in the Spark environment. They perform the data processing which is assigned to
them by their master.
i. Spark Core -It is the kernel of Spark, which provides an execution platform for all the Spark applications. It is a
generalized platform to support a wide array of applications.
ii. Spark SQL - It enables users to run SQL/HQL queries on the top of Spark. Using Apache Spark SQL, we
can process structured as well as semi-structured data. It also provides an engine for Hive to run
unmodified queries up to 100 times faster on existing deployments.
iii. Spark Streaming- Apache Spark Streaming enables powerful interactive and data analytics application
across live streaming data. The live streams are converted into micro-batches which are executed on top of
spark core.
iv. Spark MLlib- It is the scalable machine learning library which delivers both efficiencies as well as the
high-quality algorithm. Apache Spark MLlib is one of the hottest choices for Data Scientist due to its
capability of in-memory data processing, which improves the performance of iterative algorithm
drastically.
v. Spark GraphX- Apache Spark GraphX is the graph computation engine built on top of spark that enables
to process graph data at scale.
vi. SparkR- It is R package that gives light-weight frontend to use Apache Spark from R. It allows data
scientists to analyze large datasets and interactively run jobs on them from the R shell. The main idea
behind SparkR was to explore different techniques to integrate the usability of R with the scalability of
Spark.
There are two main ways to execute Spark applications.
1. Using Spark shell
2. Using the Spark submit method
#1) Spark shell
Spark shell is an interactive way to execute Spark applications on the terminal.Spark-shell prints information
about user’s last action as user enters it into the shell.
#2) Spark-submit
In a real-world scenario, you cannot execute programs interactively. More often than not, you have to process
data periodically or in real-time using jar files, which contain the code to process data. Spark submit is the
method used in Spark to run application jars in a Spark environment. Just like running a Java jar file on a
terminal, can execute Spark applications using spark-submit. Spark gets the data file, splits it into small chunks
of data, and distributes those chunks of data across the cluster.
Zookeeper is implemented in Java and is an Apache Software Foundation project and is released under the
Apache License 2.0.
Why Apache Zookeeper is needed?
Coordination services: The integration/communication of services in a distributed environment.
Coordination services are complex to get right. They are especially prone to errors such as race conditions
and deadlock.
Race condition-Two or more systems trying to perform some task.
Deadlocks– Two or more operations are waiting for each other.
To make the coordination between distributed environments easy, developers came up with an idea called
zookeeper so that they don’t have to relieve distributed applications of the responsibility of implementing
coordination services from scratch.
Architecture of Zookeeper
The ZooKeeper architecture consists of a hierarchy of nodes called znodes, organized in a tree-like structure.
Each znode can store data and has a set of permissions that control access to the znode. The znodes are
organized in a hierarchical namespace, similar to a file system. At the root of the hierarchy is the root znode
and all other znodes are children of the root znode. Each znode can have children and grandchildren, and so on.
Important Components in Zookeeper
Other Components
Client – One of the nodes in our distributed application cluster. Access information from the server. Every client
sends a message to the server to let the server know that client is alive.
Server– Provides all the services to the client. Gives acknowledgment to the client.
Ensemble– Group of Zookeeper servers. Minimum number of nodes that are required to form an ensemble is 3.
In Zookeeper, data is stored in a hierarchical namespace, similar to a file system. Each node in the namespace is
called a Znode, and it can store data and have children. Znodes are similar to files and directories in a file system.
Zookeeper provides a simple API for creating, reading, writing, and deleting Znodes. It also provides mechanisms for
detecting changes to the data stored in Znodes, such as watches and triggers. Znodes maintain a stat structure that
includes: Version number, ACL, Timestamp, Data Length.
Benefits of ZooKeeper
Simple distributed coordination process
Synchronization − Mutual exclusion and co-operation between server processes. This process helps in Apache
HBase for configuration management.
Ordered Messages
Serialization − Encode the data according to specific rules. Ensure your application runs consistently. This
approach can be used in MapReduce to coordinate queue to execute running threads.
Reliability
Atomicity − Data transfer either succeed or fail completely, but no transaction is partial.
Applications of ZooKeeper
1. Yahoo!
The ZooKeeper framework was originally built at “Yahoo!”. A well-designed distributed application needs to
meet requirements such as data transparency, better performance, robustness, centralized configuration, and
coordination. So, they designed the ZooKeeper framework to meet these requirements.
2. Apache Hadoop
Apache Hadoop is the driving force behind the growth of Big Data industry. Hadoop relies on ZooKeeper for
configuration management and coordination. ZooKeeper provides the facilities for cross-node
synchronization and ensures the tasks across Hadoop projects are serialized and synchronized.
Multiple ZooKeeper servers support large Hadoop clusters. Each client machine communicates with one of
the ZooKeeper servers to retrieve and update its synchronization information. Some of the real-time examples
are −
Human Genome Project − The Human Genome Project contains terabytes of data. Hadoop
MapReduce framework can be used to analyze the dataset and find interesting facts for human development.
Healthcare − Hospitals can store, retrieve, and analyze huge sets of patient medical records, which are
normally in terabytes.
3. Apache HBase
Apache HBase is an open source, distributed, NoSQL database used for real-time read/write access of large
datasets and runs on top of the HDFS. HBase follows master-slave architecture where the HBase Master
governs all the slaves. Slaves are referred as Region servers.
HBase distributed application installation depends on a running ZooKeeper cluster. Apache HBase uses
ZooKeeper to track the status of distributed data throughout the master and region servers with the help
of centralized configuration management and distributed mutex mechanisms. Some use-cases of HBase −
Telecom − Telecom industry stores billions of mobile call records (around 30TB / month) and accessing
these call records in real time become a huge task. HBase can be used to process all the records in real
time, easily and efficiently.
Social network − Similar to telecom industry, sites like Twitter, LinkedIn, and Facebook receive huge
volumes of data through the posts created by users. HBase can be used to find recent trends and other
interesting facts.
4. Apache Solr
Apache Solr is a fast, open source search platform written in Java. It is a blazing fast, faulttolerant distributed
search engine. Built on top of Lucene, it is a high-performance, full-featured text search engine.
Solr extensively uses every feature of ZooKeeper such as Configuration management, Leader election, node
management, Locking and syncronization of data. Some of the use-cases of Apache Solr include e-commerce,
job search, etc.
Solr uses ZooKeeper for both indexing the data in multiple nodes and searching from multiple nodes.
ZooKeeper contributes the following features −
Add / remove nodes as and when needed
Replication of data between nodes and subsequently minimizing data loss
Sharing of data between multiple nodes and subsequently searching from multiple nodes for faster search
results
5. Apache Mesos
A tool which offers efficient resource isolation and sharing across distributed applications, as a cluster
manager is what we call Apache Mesos. Hence for the fault-tolerant replicated master, Mesos uses
ZooKeeper.
6. Apache Accumulo
Moreover, on top of Apache ZooKeeper (and Apache Hadoop), another sorted distributed key/value store
“Apache Accumulo” is built.
7. Neo4j
For write master selection and read slave coordination, No4j is a distributed graph database which uses
ZooKeeper.
8. Cloudera
Basically, for centralized configuration management, Cloudera search integrates search functionality with
Hadoop by using ZooKeeper.