0% found this document useful (0 votes)
108 views10 pages

Unit 5 2 Marks

The document provides an overview of Big Data frameworks, focusing on Pig and Hive, including their definitions, features, and differences. It also covers HBase, ZooKeeper, and data visualization techniques, detailing their functionalities and applications in data processing and analysis. Key topics include Pig Latin, HiveQL, data types, and the advantages of using these tools in handling large datasets.

Uploaded by

ramya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views10 pages

Unit 5 2 Marks

The document provides an overview of Big Data frameworks, focusing on Pig and Hive, including their definitions, features, and differences. It also covers HBase, ZooKeeper, and data visualization techniques, detailing their functionalities and applications in data processing and analysis. Key topics include Pig Latin, HiveQL, data types, and the advantages of using these tools in handling large datasets.

Uploaded by

ramya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

MC7011 BIG DATA ANALYTICS

UNIT V – FRAMEWORKS

Applications on Big Bata using Pig and Hive, Data processing Operators in
Pig, Hive Services, Hive QL, Querying Data in Hive, Fundamentals of Hbase
and ZooKeeper, IBM InfoSphere Big Insights and Streams, Visualizations,
Visual Data Analysis techniques, Interaction techniques, Systems and
Applications

1. Define PIG
Pig is a high level data flow platform for creating Map Reduce programs of
Hadoop.
It is provided by Apache.
It is treated like a compiler which takes high level language like java as input
and converts into assembly level language.
The language for Pig is pig Latin.
Every task which can be achieved using PIG can also be achieved using
java used in Map reduce.

2. Mention the two modes for PIG execution


The Pig execution environment has two modes:
Local mode: All scripts are run on a single machine. Hadoop MapReduce
and HDFS are not required.
Hadoop: Also called MapReduce mode, all scripts are run on a given
Hadoop cluster.

3. List the 3 ways by which PIG programs can be run


Pig Latin Script: Simply a file containing Pig Latin commands, identified by
the .pig suffix (for example, file.pig or myscript.pig).
Grunt shell: Grunt is a command interpreter. You can type Pig Latin on the
grunt command line and Grunt will execute the command on your behalf.
Embedded: Pig programs can be executed as part of a Java program.

4. Mention the Features of PIG


It is a large-scale data processing system
Scripts are written in Pig Latin, a dataflow language
Developed by Yahoo, and open source
Pig runs on Hadoop. It makes use of both the Hadoop Distributed File
System, HDFS, and Hadoop’s processing system, MapReduce.
5. Differentiate PIG and Map reduce
PIG MAP REDUCE
PIG is a data flow language, the key Map/Reduce on the other hand is, it is
focus of Pig is manage the flow of data a programming model, or framework
from input source to output store. for processing large data sets in
distributed manner, using large number
of computers, i.e. nodes.

Pig is written specifically for managing


Prepared by : Mrs.M.Nirmala / AP / MCA Page No : 1 / 10
MC7011 BIG DATA ANALYTICS

UNIT V – FRAMEWORKS

data flow of Map reduce type of jobs.


PIG commands are submitted as
MapReduce jobs internally.
It is more concise. The 200 lines is A 200 lines Java code written for
reduced to 10 Lines in PIG MapReduce
it is bit slower as compared to No Translation Requried
MapReduce since PIG commands are
translated into Map reduce Prior to
Execution

6. Mention the various operations supported by PIG


Loading and storing of data
Streaming data
Filtering data
Grouping and joining data
Sorting data
Combining and splitting data

7. Define GRUNT
Grunt is Pig’s interactive shell. It enables users to enter Pig Latin interactively and
provides a shell for users to interact with HDFS.It is a command interpreter.

8. Mention the various data types supported by PIG

Pig’s data types can be divided into two categories: scalar types, which contain a
single value, and complex types, which contain other types.

Scalar Types
Pig’s scalar types are simple types that appear in most programming languages.
int
An integer. They store a four-byte signed integer
long
A long integer. They store an eight-byte signed integer.
float
A floating-point number. Uses four bytes to store their value.
double
A double-precision floating-point number. and use eight bytes to store their value
chararray
A string or character array, and are expressed as string literals with single quotes
bytearray
A blob or array of bytes.

Complex Types

Prepared by : Mrs.M.Nirmala / AP / MCA Page No : 2 / 10


MC7011 BIG DATA ANALYTICS

UNIT V – FRAMEWORKS

Pig has several complex data types such as maps, tuples, and bags. All of these
types can contain data of any type, including other complex types. So it is possible
to have a map where the value field is a bag, which contains a tuple where one of
the fields is a map.
Map
A map in Pig is a chararray to data element mapping, where that element can be
any Pig type, including a complex type. The chararray is called a key and is used
as an index to find the element, referred to as the value.
Tuple
A tuple is a fixed-length, ordered collection of Pig data elements. Tuples are
divided into fields, with each field containing one data element. These elements
can be of any type—they do not all need to be the same type. A tuple is analogous
to a row in SQL, with the fields being SQL columns.
Bag
A bag is an unordered collection of tuples. Because it has no order, it is not
possible to reference tuples in a bag by position. Like tuples, a bag can, but is not
required to, have a schema associated with it. In the case of a bag, the schema
describes all tuples within the bag.
Nulls
Pig includes the concept of a data element being null. Data of any type can be
null. It is important to understand that in Pig the concept of null is the same as in
SQL, which is completely different from the concept of null in C, Java, Python, etc.
In Pig a null data element means the value is unknown.
Casts
Indicates convert one type of content to any other type.

Type Description Example


Int Signed 32 bit integer 2

Long Signed 64 bit integer 15L or 15l

Float 32 bit floating point 2.5f or 2.5F


1.5 or 1.5e2 or
Double 32 bit floating point
1.5E2
charArray Character array hello javatpoint

byteArray BLOB(Byte array)

tuple Ordered set of fields (12,43)

bag Collection f tuples {(12,43),(54,28)}


map collection of tuples [open#apache]

9. Define HIVE

Prepared by : Mrs.M.Nirmala / AP / MCA Page No : 3 / 10


MC7011 BIG DATA ANALYTICS

UNIT V – FRAMEWORKS

• Hive is a data ware house system for Hadoop. It runs SQL like queries
called HQL (Hive query language) which gets internally converted to map
reduce jobs.
• Hive was developed by Facebook.
• Hive supports Data definition Language(DDL), Data Manipulation
Language(DML) and user defined functions.

10. List the various HIVE services


• cli - The command-line interface to Hive (the shell). This is the default
service.can run using the hive command.
• hiveserver2 -Runs Hive as a server exposing a Thrift service, enabli ng
access from a range of clients written in different languages.
• beeline- A command-line interface to Hive that works in embedded mode
• hwi - The Hive Web Interface
• jar - The Hive equivalent of hadoop jar, a convenient way to run Java
applications

11. Mention the various clients connected to Hive server


• Thrift Client (economy)
The Hive server is exposed as a Thrift service, so it’s possible to interact
with it using any programming language that supports Thrift.
• JDBC driver
• ODBC driver
An ODBC driver allows applications that support the ODBC protocol (such
as business intelligence software) to connect to Hive.
• The Metastore
The metastore is the central repository of Hive metadata

12. List the advantages of Hive


• Perfectly fits low level interface requirement of Hadoop
• Hive supports external tables and ODBC/JDBC
• Having Intelligence Optimizer
• Hive support of Table-level Partitioning to speed up the query times
• Metadata store is a big plus in the architecture that makes the lookup easy

13. List the Data Units of Hive


Hive data is organized into:
• Databases: Namespaces that separate tables and other data units from
naming confliction.
• Tables: Homogeneous units of data, which have the same schema.
• Partitions: Each Table can have one or more partition Keys which
determines how the data is stored. Partitions - apart from being storage
units - also allow the user to efficiently identify the rows that satisfy a certain
criteria.
• Partition columns are virtual columns, they are not part of the data itself but
are derived on load.
Prepared by : Mrs.M.Nirmala / AP / MCA Page No : 4 / 10
MC7011 BIG DATA ANALYTICS

UNIT V – FRAMEWORKS

• Buckets (or Clusters): Data in each partition may in turn be divided into
Buckets based on the value of a hash function of some column of the Table.

14. Mention the various Hive data Types


Hive Support two types of data type formats
1. Primitive data type
2. Collection data type

Primitive Data Types

Collection Data Types

Prepared by : Mrs.M.Nirmala / AP / MCA Page No : 5 / 10


MC7011 BIG DATA ANALYTICS

UNIT V – FRAMEWORKS

15. Differentiate PIG vs Hive


Pig Hive
Procedural Data Flow Language Declarative SQLish Language
For Programming For creating reports
Mainly used by Researchers and
Mainly used by Data Analysts
Programmers
Operates on the client side of a Operates on the server side of a
cluster. cluster.
Makes use of exact variation of
Does not have a dedicated
dedicated SQL DDL language by
metadata database.
defining tables beforehand.
Directly leverages SQL and is
Pig is SQL like but varies to a
easy to learn for database
great extent.
experts.
Pig supports Avro file format. Hive does not support it.
Developed by Yahoo Developed by Facebook
Language used is Pig Latin Hive QL

16. Define Hive QL


HiveQL is the Hive query language
Hive query language (HiveQL) supports SQL features like CREATE tables, DROP
tables, SELECT ... FROM ... WHERE clauses, Joins (inner, left outer, right outer
and outer joins), Cartesian products, GROUP BY, SORT BY, aggregations, union
and many useful functions on primitive as well as complex data types.

hive> CREATE DATABASE IF NOT EXISTS financials;

hive> SHOW DATABASES;

hive> CREATE DATABASE human_resources;

hive> SHOW DATABASES;

DESCRIBE database
shows the directory location for the database.

hive> DESCRIBE DATABASE financials;


USE database

The USE command sets a database as your working database, analogous to


changing working directories in a filesystem

Prepared by : Mrs.M.Nirmala / AP / MCA Page No : 6 / 10


MC7011 BIG DATA ANALYTICS

UNIT V – FRAMEWORKS

hive> USE financials;

DROP database

hive> DROP DATABASE IF EXISTS financials;


The IF EXISTS is optional and suppresses warnings if financials doesn’t exist.

Alter Database
You can set key-value pairs in the DBPROPERTIES associated with a database
using the ALTER DATABASE command. No other metadata about the database
can be changed,including its name and directory location:

hive> ALTER DATABASE financials SET DBPROPERTIES ('edited-by' = 'active


steps');

17. Define HBase


Hbase is an open source and sorted map data built on Hadoop. It is column
oriented and horizontally scalable

18. Explain the need for HBase


• RDBMS get exponentially slow as the data becomes large
• Expects data to be highly structured, i.e. ability to fit in a well -defined
schema
• Any change in schema might require a downtime
• For sparse (thin) datasets, too much of overhead of maintaining NULL
values

19. List the Features of Hbase


• Horizontally scalable: You can add any number of columns anytime.
• Automatic Failover: Automatic failover is a resource that allows a system
administrator to automatically switch data handling to a standby system in
the event of system compromise
• Integrations with Map/Reduce framework: Al the commands and java codes
internally implement Map/ Reduce to do the task and it is built over Hadoop
Distributed File System.
• sparse, distributed, persistent, multidimensional sorted map, which is
indexed by rowkey, column key,and timestamp.
• Often referred as a key value store or column family -oriented database, or
storing versioned maps of maps.
• fundamentally, it's a platform for storing and retrieving data with random
access.
• It doesn't care about datatypes(storing an integer in one row and a string in
another for the same column).
• It doesn't enforce relationships within your data.

Prepared by : Mrs.M.Nirmala / AP / MCA Page No : 7 / 10


MC7011 BIG DATA ANALYTICS

UNIT V – FRAMEWORKS

• It is designed to run on a cluster of computers, built using commodity


hardware.

20. Differentiate RDBMS vs Hbase

RDBMS HBASE
Schema / Database HBase is schema-less, it doesn't
have the concept of fixed columns
schema; defines only column families
Built for small tables Built for wide tables
Table is RDBMS Column Family in Hbase
Record in RDBMS Record in Hbase
Data layout is row oriented Column Oriented
SQL is the query language Get/put/scan are used
used
Maximum data size is TBs Hundrends of PBs
1000s queries/second can Millions of queries per second
be read and written
RDBMS is transactional. No transactions are there in
HBase.
It has de-normalized data. It will have normalized data.
It is good for semi- It is good for structured data.
structured as well as
structured data.

21. List the Applications of Hbase

• It is used whenever there is a need to write heavy applications.


• HBase is used whenever we need to provide fast random access to available
data.
• Companies such as Facebook, Twitter, Yahoo,and Adobe use HBase
internally.

22. List the Components of Hbase


HBase has three major components:
• the client library,
• a master server,
• region servers.
• Region servers can be added or removed as per requirement.

23. Define Zoo Keeper


Apache ZooKeeper is a service used by a cluster (group of nodes) to coordinate
between themselves and maintain shared data with robust synchroniz ation
techniques.

Prepared by : Mrs.M.Nirmala / AP / MCA Page No : 8 / 10


MC7011 BIG DATA ANALYTICS

UNIT V – FRAMEWORKS

24. List the Benefits of Zoo Keeper


• Simple distributed coordination process
• Synchronization − Mutual exclusion and coo-operation between server
processes. This process helps in Apache HBase for configuration
management.
• Ordered Messages
• Serialization − Encode the data according to specific rules. Ensure your
application runs consistently. This approach can be used in MapReduce to
coordinate queue to execute running threads.
• Reliability
• Atomicity − Data transfer either succeed or fail
• completely, but no transaction is partial

25. Define Data Visualization


Data visualization is a general term that describes any effort to help people
understand the significance of data by placing it in a visual context. Patterns,
trends and correlations that might go undetected in text-based data can be
exposed and recognized easier with data visualization software.

Most business intelligence software vendors embed data visualization tools into
their products, either developing the visualization technology themsel ves or
sourcing it from companies that specialize in visualization.

26. List the various Interaction Techniques used in information visualization


A multiple view–system uses two or more distinct views to support the
investigation of a single conceptual entity.
Fish-eye lenses magnify the center of the field of view, with a continuous fall -off
in magnification toward the edges. Degree-of-interest values determine the level of
detail to be displayed for each item and are assigned through user interaction
Dynamic queries continuously update the data that is filtered from the database
and visualized.
The details on demand–technique allows interactively selecting parts of data to
be visualized more detailed while providing an overview of the whole informational
concept.
Filtering is one of the basic interaction techniques often used in information
visualization used to limit the amount of displayed information through filter
criteria.
The idea of linking and brushing is to combine different visualization methods to
overcome the shortcomings of single techniques. Interactive changes made in one
visualization are automatically reflected in the other visualizations.

Zooming is one of the basic interaction techniques of information visualizations.


Since the maximum amount of information can be limited by the resolution and
color depth of a display, zooming is a crucial technique to overcome this limitation.
There are three different zooming techniques.
Prepared by : Mrs.M.Nirmala / AP / MCA Page No : 9 / 10
MC7011 BIG DATA ANALYTICS

UNIT V – FRAMEWORKS

Geometric Zoom
Fisheye Zoom
Flip Zooming
Semantic Zoom
There are three basic types of zooming.
Geometric zooming allows the user to specify the scale of magnification and
increasing or decreasing the magnification of an image by that scale. This allows
the user focus on a specific area and information outside of this area is generally
discarded. A great example is mapping software like MapQuest or Yahoo.
The fisheye zoom is similar to the geometric zoom with the exception that the
outside information is not lost from view; this information is merely distorted.
Semantic zooming approaches the process from a different angle. Semantic
zooming changes the shape or context in which the information is being presented.
An example of this type of technique is the use of a digital clock within an
application.

In a normal view, the clock may show the hour of the day and date. If the user
zooms in then the clock may alter it’s appearance by adding the seconds and
minutes. If the user that zooms out, information is discarded with only the date
remaining. The actual information did not change, only the presentation method.
Magic Lens filters are new a user interface tool that combine an arbitrarily-shaped
region with an operator that changes the view of objects viewed through that
region.

Brushing is the process of interactively selecting data items from a visual


representation. The original intention of brushing is to highlight brushed data items
in different views of visualization.

Prepared by : Mrs.M.Nirmala / AP / MCA Page No : 10 / 10

You might also like