Big Data Unit-5
Big Data Unit-5
Big Data
APPLICATIONS ON BIG DATA USING PIG,
HIVE AND HBASE
• Pig:
• ETL (Extract, Transform, Load): Pig is often used for data preparation tasks
such as cleaning, transforming, and aggregating large datasets before they are
loaded into a data warehouse or processed further.
• Data Processing Pipelines: Pig enables the creation of complex data
processing pipelines using its scripting language, Pig Latin. These pipelines can
handle large volumes of data efficiently.
• Data Analysis: Pig can be used for exploratory data analysis tasks, allowing
analysts to quickly prototype and test data processing workflows.
APPLICATIONS ON BIG DATA USING PIG,
HIVE AND HBASE
• Hive:
• Data Warehousing: Hive provides a SQL-like interface (HiveQL) for querying
and analyzing data stored in Hadoop's distributed file system (HDFS). It is
commonly used for creating data warehouses and data lakes.
• Ad Hoc Queries: Analysts and data scientists can use Hive to run ad hoc
queries on large datasets stored in HDFS, without needing to know complex
MapReduce programming.
• Batch Processing: Hive supports batch processing of data, making it suitable
for tasks like log analysis, data mining, and reporting.
APPLICATIONS ON BIG DATA USING PIG,
HIVE AND HBASE
• HBase:
• Real-time Data Storage: HBase is a distributed, scalable, and column-oriented
database built on top of Hadoop. It is optimized for storing and retrieving
large volumes of data in real-time.
• NoSQL Database: HBase is commonly used as a NoSQL database for
applications that require low-latency data access and flexible schema design.
• Time-Series Data Storage: HBase is well-suited for storing time-series data
such as sensor readings, clickstream data, and social media interactions,
where data needs to be stored and retrieved based on timestamps.
APPLICATIONS ON BIG DATA USING PIG,
HIVE AND HBASE
• Applications that leverage all three technologies together might look
something like this:
• Data Ingestion: Use Pig to preprocess raw data, clean it, and transform it into
a structured format.
• Data Storage: Store the processed data in HDFS.
• Data Querying and Analysis: Use Hive to create tables and run SQL-like
queries on the data stored in HDFS.
• Real-Time Access: Store frequently accessed or real-time data in HBase for
fast retrieval.
• Analytics: Perform complex analytics and machine learning tasks on the data
using tools like Spark or MapReduce, with data sourced from HDFS or HBase.
PIG
• Pig is a high-level platform that allows developers to create complex data
transformations using a high-level language called Pig Latin, which is then converted into
a series of MapReduce jobs to be executed on Hadoop.
• Pig Represents Big Data as data flows.
• Pig is a used to process the large datasets.
• First, to process the data which is stored in the HDFS, the programmers will write the
scripts using the Pig Latin Language.
• Internally Pig Engine(a component of Apache Pig) converts all these scripts into a specific
map and reduce task.
• But these are not visible to the programmers in order to provide a high-level of
abstraction.
• The result of Pig is always stored in the HDFS.
• Programmers can use Pig to write data transformations without knowing Java. Pig uses
both structured and unstructured data as input to perform analytics.
FEATURES OF PIG
• Pig Latin: Pig Latin is a dataflow scripting language used to express data transformations.
It provides a rich set of operators and functions for manipulating structured and semi-
structured data.
• Optimization: Pig automatically optimizes Pig Latin scripts to improve performance by
optimizing the execution plan, minimizing data movement, and parallelizing
computations whenever possible.
• Extensibility: Pig is designed to be extensible, allowing developers to create custom user-
defined functions (UDFs) in Java, Python, or other languages to perform specialized
processing tasks.
• Integration with Hadoop: Pig seamlessly integrates with the Hadoop ecosystem,
allowing it to read and write data from and to Hadoop's distributed file system (HDFS)
and process data stored in HDFS using MapReduce.
• Ease of Use: Pig's scripting language is designed to be intuitive and easy to learn for
developers familiar with SQL, scripting languages, or data processing concepts.
FEATURES OF PIG
• For performing several operations Apache Pig provides rich sets of
operators like the filtering, joining, sorting, aggregation etc.
• Join operation is easy in Apache Pig.
• Fewer lines of code.
• Apache Pig allows splits in the pipeline.
• By integrating with other components of the Apache Hadoop
ecosystem, such as Apache Hive, Apache Spark, and Apache
ZooKeeper, Apache Pig enables users to take advantage of these
components’ capabilities while transforming data.
• The data structure is multivalued, nested, and richer.
EXECUTION MODES OF PIG
• Local Mode:
• In Local Mode, Pig runs on a single machine without using Hadoop's
distributed processing capabilities.
• It is useful for testing and debugging Pig scripts on small datasets, as it
provides faster execution and a simpler development environment.
• Local Mode is not suitable for processing large datasets since it does not take
advantage of Hadoop's scalability.
• In the local mode, the Pig engine takes input from the Linux file system and
the output is stored in the same file system.
EXECUTION MODES OF PIG
• MapReduce Mode:
• MapReduce Mode is the default execution mode for Pig.
• In this mode, Pig scripts are translated into MapReduce jobs, which are then
executed on a Hadoop cluster.
• MapReduce Mode leverages Hadoop's distributed processing framework to
process large volumes of data in parallel across multiple nodes in the cluster.
• It is suitable for processing large-scale datasets stored in Hadoop's distributed
file system (HDFS) and provides scalability and fault tolerance.
PIG V/S MAPREDUCE
Apache Pig MapReduce
Less effort is needed for Apache Pig. More development efforts are required for MapReduce.
Code efficiency is less as compared to MapReduce. As compared to Pig efficiency of code is higher.
Pig provides built in functions for ordering, sorting and union. Hard to perform data operations.
It allows nested data types like map, tuple and bag It does not allow nested data types
PIG V/S SQL
Difference Pig SQL
Pig is a scripting language used to interact with SQL is a query language used to interact with
Definition
HDFS. databases residing in the database engine.
Query Style Pig offers a step-by-step execution style. SQL offers the single block execution style.
LOAD To Load the data from the file system (local/HDFS) into a relation.
It uses SQL (Structured Query Language). It uses HQL (Hive Query Language).
• The LOAD DATA statement inserts data into the designated table from
an HDFS path. The "employees" table receives a specific row of data
when the INSERT INTO TABLE query is executed.
HiveQL
Querying Data with HiveQL
• One of the core functions of using Apache Hive is data querying with
HiveQL. You may obtain, filter, transform, and analyse data stored in
Hive tables using HiveQL, which is a language comparable to SQL.
• Following are a few typical HiveQL querying operations:
1. Select All Records:
SELECT * FROM employees;
This Hadoop HiveQL command retrieves all records from the
"employees" table.
HiveQL
2. Filtering:
Example: Select employees older than 25
SELECT * FROM employees WHERE age > 25;
Only those records from the "employees" table that have a "age" greater than 25
are chosen by this.
3. Aggregation:
Example: Count the number of employees
SELECT COUNT(*) FROM employees;
Example: Calculate the average age
SELECT AVG(age) FROM employees;
These Hadoop HiveQL queries count the number of employees and determine the
average age using aggregation operations on the "employees" table.
HiveQL
4. Sorting:
Example: Sort by age in descending order
SELECT * FROM employees ORDER BY age DESC;
In order to extract employee names and their related departments, this query
connects the "employees" and "departments" databases based on the
"department_id" field.
5. Joining Tables:
Example: Join employees and departments based on department_id
SELECT e.id, e.name, d.department
FROM employees e
JOIN departments d ON e.department_id = d.id;
The "department_id" column is used to link the "employees" and "departments"
databases in order to access employee names and their related departments.
HiveQL
6. Grouping and Aggregation:
Example: Count employees in each department
SELECT department, COUNT(*) as employee_count
FROM employees
GROUP BY department;
This query counts the number of employees in each department and organizes employees
by department.
7. Limiting Results:
Example: Get the top 10 oldest employees
SELECT * FROM employees ORDER BY age DESC LIMIT 10;
This search returns the ten oldest employees in order of age.
HiveQL
Data Filtering and Sorting
1. Data Filtering: You can use the WHERE clause to filter rows based on specific conditions.
Example: Select marks which are more than 60.
SELECT * FROM employees WHERE marks > 60;
The "marks" table's field must be greater than 60 in order for this query to return all items with that
value.
2. Sorting Data: You can use the ORDER BY clause to order the result set according to one or more
columns.
Example: Consider ranking the by marks in increasing order.
SELECT * FROM marks ORDER BY INCR;
3. Combining Filtering and Sorting: To obtain particular subsets of data in a specified order, you can
combine filtering and sorting.
Example: Select and sort marks more than 60.
SELECT * FROM marks WHERE marks > 60 ORDER BY INCR;
HiveQL
Data Transformations and Aggregations
1. Data Transformations: HiveQL provides a number of built-in functions for changing the
data in your query.
Example: Change the case of names
SELECT UPPER(name) as upper_case_name FROM employees;
This Hadoop HiveQL query pulls the "name" column from the "employees" table and uses
the UPPER function to change the names to uppercase.
2. Aggregations: Using functions like COUNT, SUM, AVG, and others, aggregates let you
condense data.
Example: Calculate the average age of the workforce, for instance.
SELECT AVG(age) as average_age FROM employees;
Using the AVG function, this query determines the average age of every employee in the
"employees" table.
HiveQL
3. Grouping and Aggregating: To group data into categories, the GROUP BY clause is used with
aggregate functions.
Example: For instance, total the personnel in each department.
SELECT department, COUNT(*) as employee_count
FROM employees
GROUP BY department;
The COUNT function is used in this query to count the number of employees in each department
and group the employees by the "department" column.
4. Filtering Before Aggregating: Before doing aggregations, data transformations and filtering might
be used.
Example: Calculate the typical age of your staff members that are over 35.
SELECT AVG(age) as average_age
FROM employees
WHERE age > 35;
This Hadoop HiveQL query determines the average age of the filtered subset of employees by first
excluding those over the age of 35.
HiveQL
Joins and Subqueries
1. Joins: With the use of joins, you can merge rows from various tables based on a
shared column. The INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN are
examples of common join types.
Example: As an illustration, retrieve the employees and the corresponding
departments from an inner join.
SELECT e.id, e.name, d.department
FROM employees e
JOIN departments d ON e.department_id = d.id;
Based on the "department_id" column, this query combines information from the
"employees" and "departments" tables to retrieve employee names and their
related departments.
HiveQL
2. Subqueries: A subquery is a query that is nested inside another query. The
SELECT, WHERE, and FROM clauses can all use them.
Example: Determine the typical age of employees in each department using
a subquery in the SELECT statement.
SELECT department, (
SELECT AVG(age)
FROM employees e
WHERE e.department_id = d.id
) as avg_age
FROM departments d;
The average age of employees for each department in the "departments"
dataset is determined by this query using a subquery.
HiveQL
3. Correlated Subqueries: An inner query that depends on results from the
outer query is referred to as a correlated subquery.
Example: Find employees whose ages are higher than the department's
average, for instance.
SELECT id, name
FROM employees e
WHERE age > (
SELECT AVG(age)
FROM employees
WHERE department_id = e.department_id
);
To locate employees whose ages are higher than the mean ages of
employees in the same department, this query uses a correlated subquery.
HBase
• HBase is a column-oriented non-relational database management system
that runs on top of Hadoop Distributed File System (HDFS), a main
component of Apache Hadoop.
• HBase provides a fault-tolerant way of storing sparse data sets, which are
common in many big data use cases. It is well suited for real-time data
processing or random read/write access to large volumes of data.
• Unlike relational database systems, HBase does not support a structured
query language like SQL; in fact, HBase isn’t a relational data store at all.
• HBase applications are written in Java much like a typical Apache
MapReduce application. HBase does support writing applications in Apache
Avro, REST and Thrift.
HBase
• An HBase system is designed to scale linearly. It comprises a set of standard
tables with rows and columns, much like a traditional database. Each table must
have an element defined as a primary key, and all access attempts to HBase
tables must use this primary key.
• Avro, as a component, supports a rich set of primitive data types including:
numeric, binary data and strings; and a number of complex types including
arrays, maps, enumerations and records. A sort order can also be defined for the
data.
• HBase relies on ZooKeeper for high-performance coordination. ZooKeeper is built
into HBase, but if you’re running a production cluster, it’s suggested that you have
a dedicated ZooKeeper cluster that’s integrated with your HBase cluster.
• HBase works well with Hive, a query engine for batch processing of big data, to
enable fault-tolerant big data applications.
HBASE V/s RDBMS
S. No. Parameters RDBMS HBase
1. SQL It requires SQL (Structured Query Language). SQL is not required in HBase.
6. Data retrieval In RDBMS, slower retrieval of data. In HBase, faster retrieval of data.
9. Sparse data It cannot handle sparse data. It can handle sparse data.
The rowkey corresponds to the parent entity id, the OrderId. There is one column family
for the order data, and one column family for the order items. The Order Items are nested,
the Order Item IDs are put into the column names and any non-identifying attributes are
put into the value.
This kind of schema design is appropriate when the only way you get at the child entities
is via the parent entity.
A self-join is a relationship in which both match fields are defined in the same table.
Consider a schema for twitter relationships, where the queries are: which users does
userX follow, and which users follow userX? Here‘s a possible solution: The userids are
put in a composite row key with the relationship type as a separator. For example, Carol
follows Steve Jobs and Carol is followed by BillyBob. This allows for row key scans for
everyone carol:follows or carol:followedby
Below is the example Twitter table:
Designing for reads means aggressively de-normalizing data so that the data that is read together
is stored together.
In HBase, the row key provides the same data retrieval benefits as a primary index. So, when you
create a secondary index, use elements that are different from the row key.
Secondary indexes allow you to have a secondary way to read an HBase table. They provide a
way to efficiently access records by means of some piece of information other than the primary
key.
Secondary indexes require additional cluster space and processing because the act of creating a
secondary index requires both space and processing cycles to update.
A method of index maintenance, called Diff-Index, can help IBM® Big SQL to create secondary
indexes for HBase, maintain those indexes, and use indexes to speed up queries.
lOMoARcPSD|43249864
In the 1950s, John Hancock Mutual Life Insurance Co. collected 600 Megabytes of corporate data. This
was the largest amount of corporate data collected till then. The company was one of the pioneers of
digitization. It collected and stored information of two million policy holders on a Univac computing
system. During the 1960s, American Airlines developed a flight reservation system using IBM computing
systems and stored around 807 Megabytes of data. Federal Express, with its scanning and tracking,
collected 80 Gigabytes of data during the 1970s. In the 1980s, with its focus on analyzing ATM
transactions, CitiCorp., gathered 450 Gigabytes of data....
Smarter Planet was a corporate initiative of IBM, which sought to highlight how government and business
leaders were capturing the potential of smarter systems to achieve economic and sustainable growth and
societal progress. In November 2008, in his speech at the Council on Foreign Relations, IBM’s Chairman,
CEO and President Sam Palmisano, outlined an agenda for building a ‘Smarter Planet’. He emphasized
how the world’s various systems – like traffic, water management, communication technology, smart
grids, healthcare solutions, and rail transportation – were struggling to function effectively....
IBM committed itself to Big Data and Analytics through sustained investments and strategic acquisitions.
In 2011, it invested US$100 million in the research and development of services and solutions that
facilitated Big Data analytics. In addition, it had been bringing together as many Big Data technologies as
possible under its roof. The Big Data strategy of the company was to combine a wide array of the Big
Data analytic solutions and conquer the Big Data market. The company’s goal was to offer the broadest
portfolio of products and solutions with the depth and breadth that no other company could match.......
In 2013, IBM was awarded the contract to support Thames Water Utilities Limited’s (Thames Water) Big
Data project. The UK government planned to install smart meters in every home by 2020. Using these
meters, the company would be able to collect a lot of data about the consumption patterns of its
customers. As a part of its next five-year plan, Thames Water planned to invest in Big Data analytics to
improve its operations, customer communication, services, and customer satisfaction using this data. It
chose IBM as an alliance partner for the project to support technology and innovation......
Over the years, consumer attention had shifted from radio, print, and television to the digital media as it
facilitated real-time engagement of consumers. Brands competed for consumer attention through such
media and relied on them for data and analytics for customer acquisition and retention and to offer tailor-
made products and services to them. However, some analysts believed that relying on such data was a
mistake. They said the real challenge with such data was that it was mostly unstructured and the difficulty
lay in structuring it and filtering the genuine data....
Addressing Challenges
IBM had brought in new systems, software, and services to complement its Big Data platform. With these
products it helped its customers to access and analyze data and use it to make informed decisions for the
betterment of their businesses. The Big Data solutions were also meant to protect data and identify and
restrict suspicious activity and block access to company data....
Looking Ahead
In the past few years, Big Data had been the most hyped technology trend, and by 2013 it had started to
gain acceptance as it held promising opportunities for businesses. It was showing its impact on the
healthcare, industrial, retail, and financial sectors to name a few. It enabled companies to run live
simulations of trading strategies, geological and astronomical data, and stock brokers could analyze
public sentiment about a company from social media. Emerging technologies such as Hadoop, NoSQL,
and Storm made such analytics possible. According to a Gartner survey in 2013, 64% of organizations
had invested or planned to invest in the technology, but only 8% of them had actually begun deployment.
Many businesses were in the process of gathering information as to which business problems Big Data
could solve for them.
• Parallel and high performance streams processing software platform that can scale
over a range of hardware environments
• Automated deployment of streams processing applications on configured hardware
• Incremental deployment without restarting to extend streams processing
applications
• Secure and auditable run time environment
InfoSphere Streams offers the IBM® Streams Processing Language (SPL) interface for
users to operate on data streams. SPL provides a language and runtime framework to
support streams processing applications. Users can create applications without needing
to understand the lower-level stream-specific operations. SPL provides numerous
operators, the ability to import data from outside InfoSphere Streams and export results
outside the system, and a facility to extend the underlying system with user-defined
operators. Many of the SPL built-in operators provide powerful relational functions
such as Join and Aggregate.
Starting with InfoSphere Streams Version 4.1, users can also develop streams processing
applications in other supported languages, such as Java™ or Scala. The Java
Application API (Topology Toolkit) supports creating streaming applications
for InfoSphere Streams in these programming languages.
Results from the running applications can be made available to applications that are
running external to InfoSphere Streams by using Sink operators or edge adapters. For
example, an application might use a TCPSink operator to send its results to an
external application that visualizes the results on a map. Alternatively, it might alert an
administrator to unusual or interesting events. InfoSphere Streams also provides many
edge adapters that can connect to external data sources for consuming or storing data.
managers, which run separately from InfoSphere Streams and allocate externally
managed resources.
• Resources
Resources are physical and logical entities that InfoSphere Streams uses to run
services.
• Streams processing applications
The main components of streams processing applications are tuples, data streams,
operators, processing elements (PEs), and jobs.
• Views, charts, and tables
A view defines the set of attributes that can be displayed in a chart or table for a
specific viewable data stream
With Big SQL, your organization can derive significant value from your enterprise data.
• Elastic boost technology to support more granular resource usage and increase
performance without increasing memory or CPU
• High-performance scans, inserts, updates, and deletes
• Deeper integration with Spark 2.1 than other SQL-on-Hadoop technologies
• Machine learning or graph analytics with Spark with a single security model
• Open Data Platform initiative (ODPi) compliance
• Advanced, ANSI-compliant SQL queries
• Best practices
Best practices articles are available for a wide variety of use cases. Check this list to
find the best practices for your environment and tasks.
• System and software compatibility
The system and software compatibility report provides a complete list of supported
operating systems, system requirements, prerequisites, and optional supported
software for Big SQL v5.0.2.