0% found this document useful (0 votes)

12 views

unit5-part1-notes

The document provides an overview of Big Data applications using tools like Pig, Hive, and HBase, detailing their functionalities and benefits. Pig is a high-level platform for processing large datasets with its scripting language, Pig Latin, while Hive serves as a data warehouse infrastructure for structured data in Hadoop. HBase is a non-relational database management system that supports various applications across different industries, highlighting the importance of these tools in managing and analyzing big data.

Uploaded by

voila.alberto

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

unit5-part1-notes

Uploaded by

voila.alberto

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

lOMoARcPSD|34736634

Unit5 part1 - Notes

Big Data (Dr. A.P.J. Abdul Kalam Technical University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university

Downloaded by Aditi Gusain ([email protected])
lOMoARcPSD|34736634

Application of Big Data using :

1. Pig :
Pig is a high-level platform or tool which is used to process large datasets.
It provides a high level of abstraction for processing over MapReduce.
It provides a high-level scripting language, known as Pig Latin which is used
to develop the data analysis codes.
Applications :

1. For exploring large datasets Pig Scripting is used.

2. Provides supports across large data sets for Ad-hoc queries.
3. In the prototyping of large data-sets processing algorithms.
4. Required to process the time-sensitive data loads.
5. For collecting large amounts of datasets in form of search logs and web
crawls.
6. Used where the analytical insights are needed using the sampling.

2. Hive :
Hive is a data warehouse infrastructure tool to process structured data in
Hadoop.
It resides on top of Hadoop to summarize Big Data and makes querying and
analyzing easy.
It is used by different companies. For example, Amazon uses it in Amazon
Elastic MapReduce.
Benefits :

1. Ease of use
2. Accelerated initial insertion of data
3. Superior scalability, flexibility, and cost-efficiency
4. Streamlined security
5. Low overhead
6. Exceptional working capacity

3. HBase :
HBase is a column-oriented non-relational database management system that
runs on top of the Hadoop Distributed File System (HDFS).
HBase provides a fault-tolerant way of storing sparse data sets, which are
common in many big data use cases
HBase does support writing applications in Apache Avro, REST and Thrift.

Downloaded by Aditi Gusain ([email protected])

lOMoARcPSD|34736634

Application :

1. Medical
2. Sports
3. Web
4. Oil and petroleum
5. e-commerce

PIG
Introduction to PIG :
Pig is a high-level platform or tool which is used to process large datasets.
It provides a high level of abstraction for processing over MapReduce.
It provides a high-level scripting language, known as Pig Latin which is used
to develop the data analysis codes.
Pig Latin and Pig Engine are the two main components of the Apache Pig tool.
The result of Pig is always stored in the HDFS.
One limitation of MapReduce is that the development cycle is very long.
Writing the reducer and mapper, compiling packaging the code, submitting the
job and retrieving the output is a time-consuming task.
Apache Pig reduces the time of development using the multi-query approach.
Pig is beneficial for programmers who are not from Java backgrounds.
200 lines of Java code can be written in only 10 lines using the Pig Latin
language.
Programmers who have SQL knowledge needed less effort to learn Pig Latin.

Execution Modes of Pig :

Apache Pig scripts can be executed in three ways :
Interactive Mode (Grunt shell) :
You can run Apache Pig in interactive mode using the Grunt shell.
In this shell, you can enter the Pig Latin statements and get the output (using
the Dump operator).
Batch Mode (Script) :

Downloaded by Aditi Gusain ([email protected])

lOMoARcPSD|34736634

You can run Apache Pig in Batch mode by writing the Pig Latin script in a
single file with the .pig extension.
Embedded Mode (UDF) :
Apache Pig provides the provision of defining our own functions
(User Defined Functions) in programming languages such as Java and using
them in our script.
Comparison of Pig with Databases :

PIG SQL

SQL is a declarative
Pig Latin is a procedural language
language

In Apache Pig, the schema is optional. We can store

Schema is mandatory in
data without designing a schema (values are stored as
SQL.
$01, $02 etc.)

The data model used in SQL

The data model in Apache Pig is nested relational.
is flat relational.

Apache Pig provides limited opportunity for Query There is more opportunity for
optimization. query optimization in SQL.

Grunt :
Grunt shell is a shell command.
The Grunt shell of the Apace pig is mainly used to write pig Latin scripts.
Pig script can be executed with grunt shell which is a native shell provided by
Apache pig to execute pig queries.
We can invoke shell commands using sh and fs.
Syntax of sh command :
grunt> sh
ls

Syntax of fs command :
grunt>fs -
ls

Pig Latin :
The Pig Latin is a data flow language used by Apache Pig to analyze the data
in Hadoop.

Downloaded by Aditi Gusain ([email protected])

lOMoARcPSD|34736634

It is a textual language that abstracts the programming from the Java

MapReduce idiom into a notation.
The Pig Latin statements are used to process the data.
It is an operator that accepts a relation as an input and generates another
relation as an output.
· It can span multiple lines.
· Each statement must end with a semi-colon.
· It may include expression and schemas.
· By default, these statements are processed using multi-query execution

User-Defined Functions :
Apache Pig provides extensive support for User Defined Functions(UDF’s).
Using these UDF’s, we can define our own functions and use them.
The UDF support is provided in six programming languages:
· Java
· Jython
· Python
· JavaScript
· Ruby
· Groovy
For writing UDF’s, complete support is provided in Java and limited support is
provided in all the remaining languages.
Using Java, you can write UDF’s involving all parts of the processing like data
load/store, column transformation, and aggregation.
Since Apache Pig has been written in Java, the UDF’s written using Java
language work efficiently compared to other languages.
Types of UDF’s in Java :
Filter Functions :

 The filter functions are used as conditions in filter statements.

 These functions accept a Pig value as input and return a Boolean value.
Eval Functions :

 The Eval functions are used in FOREACH-GENERATE statements.

 These functions accept a Pig value as input and return a Pig result.

Downloaded by Aditi Gusain ([email protected])

lOMoARcPSD|34736634

Algebraic Functions :

 The Algebraic functions act on inner bags in a FOREACHGENERATE

statement.
 These functions are used to perform full MapReduce operations on an
inner bag.

Data Processing Operators :

The Apache Pig Operators is a high-level procedural language for querying
large data sets using Hadoop and the Map-Reduce Platform.
A Pig Latin statement is an operator that takes a relation as input and
produces another relation as output.
These operators are the main tools for Pig Latin provides to operate on the
data.
They allow you to transform it by sorting, grouping, joining, projecting, and
filtering.
The Apache Pig operators can be classified as :
Relational Operators :
Relational operators are the main tools Pig Latin provides to operate on the
data.
Some of the Relational Operators are :
LOAD: The LOAD operator is used to loading data from the file system or
HDFS storage into a Pig relation.
FOREACH: This operator generates data transformations based on columns
of data. It is used to add or remove fields from a relation.
FILTER: This operator selects tuples from a relation based on a condition.
JOIN: JOIN operator is used to performing an inner, equijoin join of two or
more relations based on common field values
ORDER BY: Order By is used to sort a relation based on one or more fields in
either ascending or descending order using ASC and DESC keywords.
GROUP: The GROUP operator groups together the tuples with the same
group key (key field).

Downloaded by Aditi Gusain ([email protected])

lOMoARcPSD|34736634

COGROUP: COGROUP is the same as the GROUP operator. For readability,

programmers usually use GROUP when only one relation is involved and
COGROUP when multiple relations are reinvolved.
Diagnostic Operator :
The load statement will simply load the data into the specified relation in
Apache Pig.
To verify the execution of the Load statement, you have to use the Diagnostic
Operators.
Some Diagnostic Operators are :
DUMP: The DUMP operator is used to run Pig Latin statements and display
the results on the screen.
DESCRIBE: Use the DESCRIBE operator to review the schema of a particular
relation. The DESCRIBE operator is best used for debugging a script.
ILLUSTRATE: ILLUSTRATE operator is used to review how data is
transformed through a sequence of Pig Latin statements. ILLUSTRATE
command is your best friend when it comes to debugging a script.
EXPLAIN: The EXPLAIN operator is used to display the logical, physical, and
MapReduce execution plans of a relation.

Hive
Apache Hive Architecture :

Downloaded by Aditi Gusain ([email protected])

lOMoARcPSD|34736634

The above figure shows the architecture of Apache Hive and its major
components.
The major components of Apache Hive are :
1. Hive Client
2. Hive Services
3. Processing and Resource Management
4. Distributed Storage
HIVE CLIENT :
Hive supports applications written in any language like Python, Java, C++,
Ruby, etc using JDBC, ODBC, and Thrift drivers, for performing queries on the
Hive.
Hence, one can easily write a hive client application in any language of its own
choice.
Hive clients are categorized into three types :
1. Thrift Clients : The Hive server is based on Apache Thrift so that it can
serve the request from a thrift client.

Downloaded by Aditi Gusain ([email protected])

lOMoARcPSD|34736634

2. JDBC client : Hive allows for the Java applications to connect to it using
the JDBC driver. JDBC driver uses Thrift to communicate with the Hive
Server.
3. ODBC client : Hive ODBC driver allows applications based on the
ODBC protocol to connect to Hive. Similar to the JDBC driver, the ODBC
driver uses Thrift to communicate with the Hive Server.
HIVE SERVICE :
To perform all queries, Hive provides various services like the Hive server2,
Beeline, etc.
The various services offered by Hive are :
1. Beeline
2. Hive Server 2
3. Hive Driver
4. Hive Compiler
5. Optimizer
6. Execution Engine
7. Metastore
8. HCatalog
9. WebHCat
PROCESSING AND RESOURCE MANAGEMENT :
Hive internally uses a MapReduce framework as a defacto engine for
executing the queries.
MapReduce is a software framework for writing those applications that
process a massive amount of data in parallel on the large clusters of
commodity hardware.
MapReduce job works by splitting data into chunks, which are processed by
map-reduce tasks.

DISTRIBUTED STORAGE :
Hive is built on top of Hadoop, so it uses the underlying Hadoop Distributed
File System for the distributed storage.

Downloaded by Aditi Gusain ([email protected])

lOMoARcPSD|34736634

Hive Shell :
Hive shell is a primary way to interact with hive.
It is a default service in the hive.
It is also called CLI (command line interference).
Hive shell is similar to MySQL Shell.
Hive users can run HQL queries in the hive shell.
In hive shell up and down arrow keys are used to scroll previous commands.
HiveQL is case-insensitive (except for string comparisons).
The tab key will autocomplete (provides suggestions while you type into the
field) Hive keywords and functions.
Hive Shell can run in two modes :
Non-Interactive mode :
Non-interactive mode means run shell scripts in administer zone.
Hive Shell can run in the non-interactive mode, with the -f option.
Example:
$hive -f script.q, Where script. q is a file.
Interactive mode :
The hive can work in interactive mode by directly typing the command “hive” in
the terminal.
Example:
$hive
Hive> show databases;

Hive Services :
The following are the services provided by Hive :
· Hive CLI: The Hive CLI (Command Line Interface) is a shell where we
can execute Hive queries and commands.

Downloaded by Aditi Gusain ([email protected])

lOMoARcPSD|34736634

· Hive Web User Interface: The Hive Web UI is just an alternative of Hive
CLI. It provides a web-based GUI for executing Hive queries and commands.
· Hive metastore: It is a central repository that stores all the structure
information of various tables and partitions in the warehouse. It also includes
metadata of column and its type information, the serializers and deserializers
which is used to read and write data and the corresponding HDFS files where
the data is stored.
· Hive Server: It is referred to as Apache Thrift Server. It accepts the
request from different clients and provides it to Hive Driver.
· Hive Driver: It receives queries from different sources like web UI, CLI,
Thrift, and JDBC/ODBC driver. It transfers the queries to the compiler.
· Hive Compiler: The purpose of the compiler is to parse the query and
perform semantic analysis on the different query blocks and expressions. It
converts HiveQL statements into MapReduce jobs.
· Hive Execution Engine: Optimizer generates the logical plan in the form
of DAG of map-reduce tasks and HDFS tasks. In the end, the execution
engine executes the incoming tasks in the order of their dependencies.

MetaStore :

Hive metastore (HMS) is a service that stores Apache Hive and other
metadata in a backend RDBMS, such as MySQL or PostgreSQL.
Impala, Spark, Hive, and other services share the metastore.
The connections to and from HMS include HiveServer, Ranger, and the
NameNode, which represents HDFS.
Beeline, Hue, JDBC, and Impala shell clients make requests through thrift or
JDBC to HiveServer.
The HiveServer instance reads/writes data to HMS.
By default, redundant HMS operate in active/active mode.
The physical data resides in a backend RDBMS, one for HMS.
All connections are routed to a single RDBMS service at any given time.
HMS talks to the NameNode over thrift and functions as a client to HDFS.
HMS connects directly to Ranger and the NameNode (HDFS), and so does
HiveServer.

Downloaded by Aditi Gusain ([email protected])

lOMoARcPSD|34736634

One or more HMS instances on the backend can talk to other services, such
as Ranger.
Comparison with Traditional Database :

RDBMS HIVE

It is used to maintain the database. It is used to maintain a data warehouse.

It uses SQL (Structured Query

It uses HQL (Hive Query Language).
Language).

Schema is fixed in RDBMS Schema varies in it.

Normalized and de-normalized both type of data

Normalized data is stored.
is stored.

Tables in rdms are sparse. The table in hive is dense.

It doesn’t support partitioning. It supports automation partition.

No partition method is used The sharding method is used for partition

HiveQL :
Even though based on SQL, HiveQL does not strictly follow the full SQL-92
standard.
HiveQL offers extensions not in SQL, including multitable inserts and create
table as select.
HiveQL lacked support for transactions and materialized views and only
limited subquery support.
Support for insert, update, and delete with full ACID functionality was made
available with release 0.14.
Internally, a compiler translates HiveQL statements into a directed acyclic
graph of MapReduce Tez, or Spark jobs, which are submitted to Hadoop for
execution.
Example :
DROP TABLE IF EXISTS docs;
CREATE TABLE docs

Downloaded by Aditi Gusain ([email protected])

lOMoARcPSD|34736634

(line STRING);

Checks if table docs exist and drop it if it does. Creates a new table
called docs with a single column of type STRING called line.
LOAD DATA INPATH 'input_file' OVERWRITE INTO TABLE docs;

Loads the specified file or directory (In this case “input_file”) into the table.
OVERWRITE specifies that the target table to which the data is being loaded
is to be re-written; Otherwise, the data would be appended.
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\s')) AS word FROM docs) temp
GROUP BY word
ORDER BY word;

The query CREATE TABLE word_counts AS SELECT word, count(1) AS

count creates a table called word_counts with two columns: word and count.
This query draws its input from the inner
query (SELECT explode(split(line, '\s')) AS word FROM docs) temp".
This query serves to split the input words into different rows of a temporary
table aliased as temp.
The GROUP BY WORD groups the results based on their keys.
This results in the count column holding the number of occurrences for each
word of the word column.
The ORDER BY WORDS sorts the words alphabetically.
Tables :
Here are the types of tables in Apache Hive:
Managed Tables :

In a managed table, both the table data and the table schema are managed
by Hive.
The data will be located in a folder named after the table within the Hive data
warehouse, which is essentially just a file location in HDFS.
By managed or controlled we mean that if you drop (delete) a managed table,
then Hive will delete both the Schema (the description of the table) and the
data files associated with the table.
Default location is /user/hive/warehouse.
The syntax for Managed Tables :

Downloaded by Aditi Gusain ([email protected])

lOMoARcPSD|34736634

CREATE TABLE IF NOT EXISTS stocks

(exchange STRING,
symbol STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;

External Tables :
An external table is one where only the table schema is controlled by Hive.
In most cases, the user will set up the folder location within HDFS and copy
the data file(s) there.
This location is included as part of the table definition statement.
When an external table is deleted, Hive will only delete the schema associated
with the table.
The data files are not affected.
Syntax for External Tables :
CREATE EXTERNAL TABLE IF NOT EXISTS stocks
(exchange STRING,
symbol STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';

Querying Data :
A query is a request for data or information from a database table or a
combination of tables.
This data may be generated as results returned by Structured Query
Language (SQL) or as pictorials, graphs or complex results, e.g., trend
analyses from data-mining tools.
One of several different query languages may be used to perform a range of
simple to complex database queries.
SQL, the most well-known and widely-used query language, is familiar to most
database administrators (DBAs)

Downloaded by Aditi Gusain ([email protected])

lOMoARcPSD|34736634

User-Defined Functions :
In Hive, the users can define their own functions to meet certain client
requirements.
These are known as UDFs in Hive.
User-Defined Functions written in Java for specific modules.
Some of UDFs are specifically designed for the reusability of code in
application frameworks.
The developer will develop these functions in Java and integrate those UDFs
with the Hive.
During the Query execution, the developer can directly use the code, and
UDFs will return outputs according to the user-defined tasks.
It will provide high performance in terms of coding and execution.
The general type of UDF will accept a single input value and produce a single
output value.
We can use two different interfaces for writing Apache Hive User-Defined
Functions :
1. Simple API
2. Complex API
Sorting And Aggregating :

Sorting data in Hive can be achieved by use of a standard ORDER BY clause,

but there is a catch.
ORDER BY produces a result that is totally sorted, as expected, but to do so it
sets the number of reducers to one, making it very inefficient for large
datasets.
When a globally sorted result is not required and in many cases it isn’t, then
you can use Hive’s nonstandard extension, SORT BY instead.
SORT BY produces a sorted file per reducer.
If you want to control which reducer a particular row goes to, typically so you
can perform some subsequent aggregation.
This is what Hive’s DISTRIBUTE BY clause does.
Example :

Downloaded by Aditi Gusain ([email protected])

lOMoARcPSD|34736634

· To sort the weather dataset by year and temperature, in such a way to

ensure that all the rows for a given year end up in the same reducer partition :
Hive> FROM records2
> SELECT year, temperature
> DISTRIBUTE BY year
> SORT BY year ASC,
temperature DESC;

· Output :
1949 111
1949 78
1950 22
1950 0
1950 -11

MapReduce Scripts in Hive / Hive Scripts :

Similar to any other scripting language, Hive scripts are used to execute a set
of Hive commands collectively.
Hive scripting helps us to reduce the time and effort invested in writing and
executing the individual commands manually.
Hive scripting is supported in Hive 0.10.0 or higher versions of Hive.
Joins and SubQueries :
JOINS :
Join queries can perform on two tables present in Hive.
Joins are of 4 types, these are :
· Inner join: The Records common to both tables will be retrieved by this
Inner Join.
· Left outer Join: Returns all the rows from the left table even though there
are no matches in the right table.
· Right Outer Join: Returns all the rows from the Right table even though
there are no matches in the left table.

Downloaded by Aditi Gusain ([email protected])

lOMoARcPSD|34736634

· Full Outer Join: It combines records of both the tables based on the JOIN
Condition given in the query. It returns all the records from both tables and fills
in NULL Values for the columns missing values matched on either side.

SUBQUERIES :
A Query present within a Query is known as a subquery.
The main query will depend on the values returned by the subqueries.
Subqueries can be classified into two types :
· Subqueries in FROM clause
· Subqueries in WHERE clause
When to use :
· To get a particular value combined from two column values from different
tables.
· Dependency of one table values on other tables.
· Comparative checking of one column values from other tables.
Syntax :

Subquery in FROM clause

SELECT <column names 1, 2...n>From (SubQuery) <TableName_Main >
Subquery in WHERE clause
SELECT <column names 1, 2...n> From<TableName_Main>WHERE col1 IN (SubQuery);

HBASE
HBase Concepts :
HBase is a distributed column-oriented database built on top of the Hadoop
file system.
It is an open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide
quick random access to huge amounts of structured data.
It leverages the fault tolerance provided by the Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write
access to data in the Hadoop File System.

Downloaded by Aditi Gusain ([email protected])

lOMoARcPSD|34736634

One can store the data in HDFS either directly or through HBase.
Data consumer reads/accesses the data in HDFS randomly using HBase.
HBase sits on top of the Hadoop File System and provides read and write
access.

HBase Vs RDBMS :
RDBMS HBase

It requires SQL (structured query language) NO SQL

It has a fixed schema No fixed schema

It is row-oriented It is column-oriented

It is not scalable It is scalable

It is static in nature Dynamic in nature

Slower retrieval of data Faster retrieval of data

It follows the ACID (Atomicity, Consistency, It follows CAP (Consistency, Availability,

Isolation and Durability) property. Partition-tolerance) theorem.

It can handle structured, unstructured as

It can handle structured data
well as semi-structured data

It cannot handle sparse data It can handle sparse data

Schema Design :
HBase table can scale to billions of rows and any number of columns based
on your requirements.
This table allows you to store terabytes of data in it.
The HBase table supports the high read and writes throughput at low latency.
A single value in each row is indexed; this value is known as the row key.
The HBase schema design is very different compared to the relational
database schema design.
Some of the general concepts that should be followed while designing schema
in Hbase:
· Row key: Each table in the HBase table is indexed on the row key.
There are no secondary indices available on the HBase table.
· Automaticity: Avoid designing a table that requires atomicity across all
rows. All operations on HBase rows are atomic at row level.

Downloaded by Aditi Gusain ([email protected])

lOMoARcPSD|34736634

· Even distribution: Read and write should be uniformly distributed

across all nodes available in the cluster. Design row key in such a way that,
related entities should be stored in adjacent rows to increase read efficacy.

Zookeeper :
ZooKeeper is a distributed coordination service that also helps to manage a
large set of hosts.
Managing and coordinating a service especially in a distributed environment is
a complicated process, so ZooKeeper solves this problem due to its simple
architecture as well as API.
ZooKeeper allows developers to focus on core application logic.
For instance, to track the status of distributed data, Apache HBase uses
ZooKeeper.
They can also support a large Hadoop cluster easily.
To retrieve information, each client machine communicates with one of the
servers.
It keeps an eye on the synchronization as well as coordination across the
cluster
There is some best Apache ZooKeeper feature :
· Simplicity: With the help of a shared hierarchical namespace, it
coordinates.
· Reliability: The system keeps performing, even if more than one node
fails.
· Speed: In the cases where ‘Reads’ are more common, it runs with the
ratio of 10:1.
· Scalability: By deploying more machines, the performance can be
enhanced.

IBM Big Data Strategy :

Downloaded by Aditi Gusain ([email protected])

lOMoARcPSD|34736634

IBM, a US-based computer hardware and software manufacturer, had

implemented a Big Data strategy.
Where the company offered solutions to store, manage, and analyze the huge
amounts of data generated daily and equipped large and small companies to
make informed business decisions.
The company believed that its Big Data and analytics products and services
would help its clients become more competitive and drive growth.
Issues :
· Understand the concept of Big Data and its importance to large,
medium, and small companies in the current industry scenario.
· Understand the need for implementing a Big Data strategy and the
various issues and challenges associated with this.
· Analyze the Big Data strategy of IBM.
· Explore ways in which IBM’s Big Data strategy could be improved
further.

Introduction to InfoSphere :
InfoSphere Information Server provides a single platform for data integration
and governance.
The components in the suite combine to create a unified foundation for
enterprise information architectures, capable of scaling to meet any
information volume requirements.
You can use the suite to deliver business results faster while maintaining data
quality and integrity throughout your information landscape.
InfoSphere Information Server helps your business and IT personnel
collaborate to understand the meaning, structure, and content of information
across a wide variety of sources.
By using InfoSphere Information Server, your business can access and use
information in new ways to drive innovation, increase operational efficiency,
and lower risk.
BigInsights :

BigInsights is a software platform for discovering, analyzing, and visualizing

data from disparate sources.

Downloaded by Aditi Gusain ([email protected])

lOMoARcPSD|34736634

The flexible platform is built on an Apache Hadoop open-source framework

that runs in parallel on commonly available, low-cost hardware.

Big Sheets :

BigSheets is a browser-based analytic tool included in the InfoSphere

BigInsights Console that you use to break large amounts of unstructured data
into consumable, situation-specific business contexts.
These deep insights help you to filter and manipulate data from sheets even
further.

Intro to Big SQL :

IBM Big SQL is a high performance massively parallel processing (MPP) SQL
engine for Hadoop that makes querying enterprise data from across the
organization an easy and secure experience.
A Big SQL query can quickly access a variety of data sources including
HDFS, RDBMS, NoSQL databases, object stores, and WebHDFS by using a
single database connection or single query for best-in-class analytic
capabilities.
Big SQL provides tools to help you manage your system and your databases,
and you can use popular analytic tools to visualize your data.
Big SQL's robust engine executes complex queries for relational data and
Hadoop data.
Big SQL provides an advanced SQL compiler and a cost-based optimizer for
efficient query execution.
Combining these with massive parallel processing (MPP) engine helps
distribute query execution across nodes in a cluster.

Downloaded by Aditi Gusain ([email protected])

Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Notes
No ratings yet
Notes
19 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
19 pages
big-data-unit-5-big-data-notes-of-unit-5
No ratings yet
big-data-unit-5-big-data-notes-of-unit-5
16 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
bda-unit-4-060115-big-data-analytics-unit-4
No ratings yet
bda-unit-4-060115-big-data-analytics-unit-4
19 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
21 pages
UNIT-5
No ratings yet
UNIT-5
24 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
KCS 061 - Big Data - Unit V
No ratings yet
KCS 061 - Big Data - Unit V
17 pages
Unit v Notes
No ratings yet
Unit v Notes
17 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
Notes - 5 Unit Big Data
No ratings yet
Notes - 5 Unit Big Data
22 pages
Apache PIG.pptx
No ratings yet
Apache PIG.pptx
41 pages
Unit 4
No ratings yet
Unit 4
29 pages
Unit 5 Lecture No-2(PIG)
No ratings yet
Unit 5 Lecture No-2(PIG)
101 pages
BDA - Unit-4 Part 1
No ratings yet
BDA - Unit-4 Part 1
47 pages
Big_Data_Unit-5
No ratings yet
Big_Data_Unit-5
81 pages
Unit IV EBDP 22
No ratings yet
Unit IV EBDP 22
97 pages
BDA_UNIT_IV_NOTES (1)
No ratings yet
BDA_UNIT_IV_NOTES (1)
32 pages
BDA-Unit 5-notes
No ratings yet
BDA-Unit 5-notes
36 pages
BIGDATUNIT5
No ratings yet
BIGDATUNIT5
32 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
Unit-5 (1) BD
No ratings yet
Unit-5 (1) BD
18 pages
BDA Module 4 - Part 1 (Pig) 2023
No ratings yet
BDA Module 4 - Part 1 (Pig) 2023
34 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
Unit-4_PIG_
No ratings yet
Unit-4_PIG_
9 pages
BigData2
No ratings yet
BigData2
3 pages
pig
No ratings yet
pig
23 pages
Unit 5(Pig,Hive,Hbase)
No ratings yet
Unit 5(Pig,Hive,Hbase)
18 pages
5 PIG and HIVE
No ratings yet
5 PIG and HIVE
81 pages
Hadoop Pig
No ratings yet
Hadoop Pig
111 pages
bda unit 4
No ratings yet
bda unit 4
16 pages
Big Data Notes Pig
No ratings yet
Big Data Notes Pig
38 pages
BDA Unit - IV
No ratings yet
BDA Unit - IV
81 pages
Unit 5
No ratings yet
Unit 5
76 pages
notes of aktu btech 3 yr big data
No ratings yet
notes of aktu btech 3 yr big data
15 pages
BD 5
No ratings yet
BD 5
28 pages
Introduction To Pig: SESSION 2016-2017
No ratings yet
Introduction To Pig: SESSION 2016-2017
44 pages
Unit 4 Bba
No ratings yet
Unit 4 Bba
10 pages
Apache PIG by Sravanthi
No ratings yet
Apache PIG by Sravanthi
31 pages
BDA unit5
No ratings yet
BDA unit5
36 pages
unit 5 short
No ratings yet
unit 5 short
14 pages
BDA-V
No ratings yet
BDA-V
10 pages
PIG: A Big Data Processor: Tushar B. Kute
No ratings yet
PIG: A Big Data Processor: Tushar B. Kute
50 pages
BDA Unit-4-PPT
No ratings yet
BDA Unit-4-PPT
98 pages
Pig_2
No ratings yet
Pig_2
63 pages
Pig Viva Ques
No ratings yet
Pig Viva Ques
6 pages
Apache Pig in noSql Databases
No ratings yet
Apache Pig in noSql Databases
5 pages
Unit 5
No ratings yet
Unit 5
39 pages
Pig and Pig Latin
No ratings yet
Pig and Pig Latin
16 pages
Pig Hive
No ratings yet
Pig Hive
59 pages
unit-4-apachepig-210825041412
No ratings yet
unit-4-apachepig-210825041412
16 pages
BDA-NOTES-JNTUK-R20-UNIT-4
No ratings yet
BDA-NOTES-JNTUK-R20-UNIT-4
14 pages
Pig Hive
No ratings yet
Pig Hive
58 pages
Session 3.3
No ratings yet
Session 3.3
30 pages
6 part2
No ratings yet
6 part2
45 pages
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
5.15.3 Spring Example in MyEclipse
No ratings yet
5.15.3 Spring Example in MyEclipse
3 pages
5.15.2 Spring Example in Eclipse
No ratings yet
5.15.2 Spring Example in Eclipse
3 pages
5.15.1 Getting Started Building Web Applications With Spring Boot and Kotlin
No ratings yet
5.15.1 Getting Started Building Web Applications With Spring Boot and Kotlin
14 pages
Chapter8 Abstraction
No ratings yet
Chapter8 Abstraction
8 pages
String
No ratings yet
String
19 pages
CS602 Quiz 1 Merged Midterm
No ratings yet
CS602 Quiz 1 Merged Midterm
485 pages
Advanced Tester Guide
No ratings yet
Advanced Tester Guide
8 pages
Adsense Criteo Direct Implementation Guide v1.0 PDF
No ratings yet
Adsense Criteo Direct Implementation Guide v1.0 PDF
10 pages
PWP - Chapter 1
No ratings yet
PWP - Chapter 1
53 pages
As
No ratings yet
As
4 pages
FORM 3 Output Devices Test
No ratings yet
FORM 3 Output Devices Test
2 pages
ATO PTO Basics in Oracle Applications
100% (1)
ATO PTO Basics in Oracle Applications
154 pages
C Output Prediction Questions
No ratings yet
C Output Prediction Questions
35 pages
THW - Hunting For Execution - Deck
No ratings yet
THW - Hunting For Execution - Deck
29 pages
Organization Profile
No ratings yet
Organization Profile
10 pages
Linear Programming: Interpretation of Solution & Sensitivity Analysis
No ratings yet
Linear Programming: Interpretation of Solution & Sensitivity Analysis
40 pages
Module 1 CSCE
No ratings yet
Module 1 CSCE
61 pages
10 Ways To Communicate More Effectively With Customers and Co-Workers
No ratings yet
10 Ways To Communicate More Effectively With Customers and Co-Workers
4 pages
Notes On Function and Modules
No ratings yet
Notes On Function and Modules
24 pages
Full Download How To Become The Worlds No 1 Hacker 2nd Edition Gregory D. Evans PDF DOCX
100% (1)
Full Download How To Become The Worlds No 1 Hacker 2nd Edition Gregory D. Evans PDF DOCX
84 pages
Thor VM3: User Guide
No ratings yet
Thor VM3: User Guide
344 pages
Diccionario Historico Cronologico, Geografico y Universal de La Santa Biblia T2 - Joseph Armesto y Goyanes 1790
No ratings yet
Diccionario Historico Cronologico, Geografico y Universal de La Santa Biblia T2 - Joseph Armesto y Goyanes 1790
397 pages
SentiBotics Trial Documentation
No ratings yet
SentiBotics Trial Documentation
10 pages
F-IoT Unit-4
No ratings yet
F-IoT Unit-4
101 pages
0.boundless - Introduction To The OpenGeo Suite
No ratings yet
0.boundless - Introduction To The OpenGeo Suite
6 pages
8 - Multimedia Authoring Tools
100% (9)
8 - Multimedia Authoring Tools
24 pages
Dell U2415 User's Guide: Model: U2415 Regulatory Model: U2415b
No ratings yet
Dell U2415 User's Guide: Model: U2415 Regulatory Model: U2415b
65 pages
S_4HANA 2022 FPS1 Fully-Activated Appliance standa... - SAP Community
No ratings yet
S_4HANA 2022 FPS1 Fully-Activated Appliance standa... - SAP Community
34 pages
Trace - 2022-10-08 19 - 35 - 31 572
No ratings yet
Trace - 2022-10-08 19 - 35 - 31 572
167 pages
Slope Stability Radar For Monitoring Mine Walls
No ratings yet
Slope Stability Radar For Monitoring Mine Walls
4 pages
Windowsserver
No ratings yet
Windowsserver
7 pages
Web-Security Considerations
No ratings yet
Web-Security Considerations
37 pages
Detection by Machine Learning
No ratings yet
Detection by Machine Learning
14 pages
Variable Block
No ratings yet
Variable Block
3 pages
The Kids Learn App Proposal Report
100% (1)
The Kids Learn App Proposal Report
14 pages

unit5-part1-notes

Uploaded by

unit5-part1-notes

Uploaded by

lOMoARcPSD|34736634

Unit5 part1 - Notes

Big Data (Dr. A.P.J. Abdul Kalam Technical University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university

Application of Big Data using :

1. For exploring large datasets Pig Scripting is used.

Downloaded by Aditi Gusain ([email protected])

Execution Modes of Pig :

Downloaded by Aditi Gusain ([email protected])

In Apache Pig, the schema is optional. We can store

The data model used in SQL

Downloaded by Aditi Gusain ([email protected])

It is a textual language that abstracts the programming from the Java

 The filter functions are used as conditions in filter statements.

 The Eval functions are used in FOREACH-GENERATE statements.

Downloaded by Aditi Gusain ([email protected])

 The Algebraic functions act on inner bags in a FOREACHGENERATE

Data Processing Operators :

Downloaded by Aditi Gusain ([email protected])

COGROUP: COGROUP is the same as the GROUP operator. For readability,

Downloaded by Aditi Gusain ([email protected])

Downloaded by Aditi Gusain ([email protected])

Downloaded by Aditi Gusain ([email protected])

Downloaded by Aditi Gusain ([email protected])

Downloaded by Aditi Gusain ([email protected])

It is used to maintain the database. It is used to maintain a data warehouse.

It uses SQL (Structured Query

Schema is fixed in RDBMS Schema varies in it.

Normalized and de-normalized both type of data

Tables in rdms are sparse. The table in hive is dense.

It doesn’t support partitioning. It supports automation partition.

No partition method is used The sharding method is used for partition

Downloaded by Aditi Gusain ([email protected])

The query CREATE TABLE word_counts AS SELECT word, count(1) AS

Downloaded by Aditi Gusain ([email protected])

CREATE TABLE IF NOT EXISTS stocks

Downloaded by Aditi Gusain ([email protected])

Sorting data in Hive can be achieved by use of a standard ORDER BY clause,

Downloaded by Aditi Gusain ([email protected])

· To sort the weather dataset by year and temperature, in such a way to

MapReduce Scripts in Hive / Hive Scripts :

Downloaded by Aditi Gusain ([email protected])

Subquery in FROM clause

Downloaded by Aditi Gusain ([email protected])

It requires SQL (structured query language) NO SQL

It has a fixed schema No fixed schema

It is not scalable It is scalable

It is static in nature Dynamic in nature

Slower retrieval of data Faster retrieval of data

It follows the ACID (Atomicity, Consistency, It follows CAP (Consistency, Availability,

It can handle structured, unstructured as

It cannot handle sparse data It can handle sparse data

Downloaded by Aditi Gusain ([email protected])

· Even distribution: Read and write should be uniformly distributed

IBM Big Data Strategy :

Downloaded by Aditi Gusain ([email protected])

IBM, a US-based computer hardware and software manufacturer, had

BigInsights is a software platform for discovering, analyzing, and visualizing

Downloaded by Aditi Gusain ([email protected])

The flexible platform is built on an Apache Hadoop open-source framework

BigSheets is a browser-based analytic tool included in the InfoSphere

Intro to Big SQL :

Downloaded by Aditi Gusain ([email protected])

You might also like