0% found this document useful (0 votes)
7 views18 pages

Unit 3 (Big Data Analytics)

The document discusses various aspects of Big Data Analytics, focusing on data formats supported by HDFS, the advantages of Big Data analysis, and the use of Hadoop for managing large datasets. It covers the architecture of HDFS, fault tolerance, data replication, and the MapReduce programming model. Additionally, it highlights the importance of data integrity, compression, and serialization in Hadoop for efficient data processing and storage.

Uploaded by

navata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views18 pages

Unit 3 (Big Data Analytics)

The document discusses various aspects of Big Data Analytics, focusing on data formats supported by HDFS, the advantages of Big Data analysis, and the use of Hadoop for managing large datasets. It covers the architecture of HDFS, fault tolerance, data replication, and the MapReduce programming model. Additionally, it highlights the importance of data integrity, compression, and serialization in Hadoop for efficient data processing and storage.

Uploaded by

navata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Unit 3(Big Data Analytics) (Search with  for new topic)

Data formats
HDFS file formats supported are Json, Avro and Parquet. The format is specified by setting the storage
format value which can be found on the storage tab of the Data Store. For all files of HDFS, the storage
type (Json, Avro, Parquet) are defined in the data store. JSON, Avro and Parquet formats contain
complex data types, like array or Object. During the Reverse Engineer phase, the schema definition for
these types are converted to Avro and stored in the data format column of the attribute with the
complex data type. This information is used when flattening this data in the mappings.
For JSON, Avro and Parquet that each type requires the location of a schema file to be entered. For
Delimited, you will need to specify the Record and field separator information, number of heading lines.
If you are loading Avro files into Hive, then you will need to copy the Avro Schema file (.avsc) into the
same HDFS location as the HDFS files.

Table 9-1 HDFS File Formats


File Complex Type Load into Write from
Reverse Engineer Load into Hive
Format Support Spark Spark
Yes (Schema Yes (Schema
Avro Yes Yes Yes
required) required)
Delimited No No Yes Yes Yes
Yes (Schema
JSON Yes Yes Yes Yes
required)
Yes (Schema
Parquet Yes Yes Yes Yes
required)

Separate KMs for each file format are not required. You can create just one or two KMs for each target
(a standard LKM and where appropriate a Direct Load LKM). The file can either be delimited or fixed
format. The new LKM HDFS File to Hive supports loading only HDFS file into Hive, the file can be in the
format of JSON, Avro, Parquet, Delimited etc, with complex data.
analyzing data with Hadoop
Big data is mostly generated from social media websites, sensors, devices, video/audio, networks, log
files and web, and much of it is generated in real time and on a very large scale. Big data analytics is the
process of examining this large amount of different data types, or big data, in an effort to uncover
hidden patterns, unknown correlations and other useful information.

Advantages of Big Data Analysis

Big data analysis allows market analysts, researchers and business users to develop deep insights from
the available data, resulting in numerous business advantages. Business users are able to make a precise
analysis of the data and the key early indicators from this analysis can mean fortunes for the business.
Some of the exemplary use cases are as follows:

Whenever users browse travel portals, shopping sites, search flights, hotels or add a particular item into
their cart, then Ad Targeting companies can analyze this wide variety of data and activities and can

Page 1 of 18
provide better recommendations to the user regarding offers, discounts and deals based on the user
browsing history and product history.
In the telecommunications space, if customers are moving from one service provider to another service
provider, then by analyzing huge call data records of the various issues faced by the customers can be
unearthed. Issues could be as wide-ranging as a significant increase in the call drops or some network
congestion problems. Based on analyzing these issues, it can be identified if a telecom company needs
to place a new tower in a particular urban area or if they need to revive the marketing strategy for a
particular region as a new player has come up there. That way customer churn can be proactively
minimized.

analyzing data with Hadoop


If your organization has a huge workload related to Big Data, i.e. a huge volume of data generated from
you can implement Hadoop tools for easy and quick data management. Many enterprises are using this
tool for the purposes of data management and answering complex queries.

Benefits
Hadoop has gained immense popularity with the rise of Big Data platforms that are capable of managing
huge volumes of data.

Using Hadoop, you can avail following benefits for your enterprise.

Cloud or on-premise services: Choosing between deployments of on-premise services or transferring


these services over cloud is the starting point. Technology advancements and skill development is crucial
to deploy necessary infrastructure for your Big Data projects. Sooner you deploy cloud services, you
receive higher business value.
Advanced analytical tools: Open-source as well as larger vendors develop data integration tools which
work well with Hadoop. These tools allow integration of structured as well as big data to receive
valuable insights.
Evolution of predictive analytics presents a huge scope for Hadoop. Data visualization is already in use
but now advanced tools such as Hadoop expands business value by extracting meaningful insights from
big data. Textual analytical reports, extensive data mining, and data visualizations are beneficial for
decision-making processes.
Improve Efficiency: Hadoop provides enhanced capabilities with less programming compared to the
conventional platforms. Hadoop eases the process of big data analytics, reduces operational costs, and
quickens the time to market.
Expertise: A new technology often results in shortage of skilled experts to implement a big data projects.
Advanced Hadoop tools integrate several big data services to help the enterprise evolve on the
technological front. Emerging trends and best practices are being integrated with big data platforms to
achieve desired results.
scaling out
Scaling are of two types

Page 2 of 18
Vertical scaling : When we add more resources to a single machine when the load increases. For
example you need 20gb of ram but currently your server has 10 GB of ram so you add extra ram to the
same server to meet the needs.
Horizontal scaling of scaling out : when you add more machines to match the resources need it's called
horizontal scaling. So if I have a machine of already 10 GB I'll add an extra machine with 10 GB ram.

Architecture of Hadoop distributed file system (HDFS)


The Hadoop Distributed File System (HDFS) is the primary data storage system used
by Hadoop applications. HDFS employs a NameNode and DataNode architecture to implement a
distributed file system that provides high-performance access to data across highly scalable Hadoop
clusters.

HDFS uses a primary/secondary architecture. The HDFS cluster's NameNode is the primary server that
manages the file system namespace and controls client access to files. As the central component of the
Hadoop Distributed File System, the NameNode maintains and manages the file system namespace and
provides clients with the right access permissions. The system's DataNodes manage the storage that's
attached to the nodes they run on.
fault tolerance
Fault tolerance in Hadoop HDFS refers to the working strength of a system in unfavorable conditions and
how that system can handle such a situation.
HDFS is highly fault-tolerant. Before Hadoop 3, it handles faults by the process of replica creation. It
creates a replica of users’ data on different machines in the HDFS cluster. So whenever if any machine in
the cluster goes down, then data is accessible from other machines in which the same copy of data was
created.
HDFS also maintains the replication factor by creating a replica of data on other available machines in
the cluster if suddenly one machine fails.
Hadoop 3 introduced Erasure Coding to provide Fault Tolerance. Erasure Coding in HDFS improves
storage efficiency while providing the same level of fault tolerance and data durability as traditional
replication-based HDFS deployment.

Page 3 of 18
Before Hadoop 3, fault tolerance in Hadoop HDFS was achieved by creating replicas. HDFS creates a
replica of the data block and stores them on multiple machines (DataNode).
The number of replicas created depends on the replication factor (by default 3).
If any of the machines fails, the data block is accessible from the other machine containing the same
copy of data. Hence there is no data loss due to replicas stored on different machines.
Data Replication
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as
a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are
replicated for fault tolerance. The block size and replication factor are configurable per file. An
application can specify the number of replicas of a file. The replication factor can be specified at file
creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any
time.

High availability
High Availability was a new feature added to Hadoop 2.x to solve the Single point of failure problem in
the older versions of Hadoop.
As the Hadoop HDFS follows the master-slave architecture where the NameNode is the master node and
maintains the filesystem tree. So HDFS cannot be used without NameNode. This NameNode becomes a
bottleneck. HDFS high availability feature addresses this issue.

Page 4 of 18
Data Locality in Hadoop
In Hadoop, Data locality is the process of moving the computation close to where the actual data resides
on the node, instead of moving large data to computation. This minimizes network congestion and
increases the overall throughput of the system. This feature of Hadoop we will discuss in detail in this
tutorial. We will learn what is data locality in Hadoop, data locality definition, how Hadoop exploits Data
Locality, what is the need of Hadoop Data Locality, various types of data locality in Hadoop MapReduce,
Data locality optimization in Hadoop and various advantages of Hadoop data locality.

MapReduce Architecture
MapReduce
MapReduce and HDFS are the two major components of Hadoop which makes it so powerful and
efficient to use. MapReduce is a programming model used for efficient processing in parallel over large
data-sets
sets in a distributed manner. The data is first split and then combined to produce the final result.
The libraries for MapReduce is written in so many programming languages with various different-
different optimizations. The purpose of MapReduce in Hadoop is to Map each of the jobs and then it will
reduce it to equivalent tasks for providing less overhead over the cluster network and to reduce the
processing power. The MapReduce task is mainly divided into two phases Map Phase and Reduce Phase.

MapReduce Architecture:

Page 5 of 18
Components of MapReduce Architecture:
1. Client: The MapReduce client is the one who brings the Job to the MapReduce for processing. There
can be multiple clients available that continuously send jobs for processing to the Hadoop
MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is comprised of so
many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result of all the
job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.

Process Flow
A process flow is a composition of one or more activities. It is written in a DSL script that contains all the
activities that make a data flow from source to destination complete.
A process flow is a generic concept and is not limited to BDI. However all the out-of-box process flows
are for data transfers from a retail application to one or more retail applications.
A process flow encapsulates a sequence of activities. An activity can be synchronous or asynchronous. In
BDI some of these activities are invocations of batch jobs.
Figure 5-1 Process Flow

Page 6 of 18
Java Interface
An interface in Java is a blueprint of a class. It has static constants and abstract methods.
The interface in Java is a mechanism to achieve abstraction. There can be only abstract methods in the
Java interface, not method body. It is used to achieve abstraction and multiple inheritance in Java.
In other words, you can say that interfaces can have abstract methods and variables. It cannot have a
method body.
Java Interface also represents the IS-A relationship.
It cannot be instantiated just like the abstract class.
Since Java 8, we can have default and static methods in an interface.
Since Java 9, we can have private methods in an interface.
Why use Java interface?
There are mainly three reasons to use interface. They are given below.
o It is used to achieve abstraction.
o By interface, we can support the functionality of multiple inheritance.
o It can be used to achieve loose coupling.

Page 7 of 18
How to declare an interface?
An interface is declared by using the interface keyword. It provides total abstraction; means all the
methods in an interface are declared with the empty body, and all the fields are public, static and final
by default. A class that implements an interface must implement all the methods declared in the
interface.
Syntax:
1. interface <interface_name>{
2.
3. // declare constant fields
4. // declare methods that abstract
5. // by default.
6. }
Data flow in Java
Local data flow
Local data flow is data flow within a single method or callable. Local data flow is usually easier, faster,
and more precise than global data flow, and is sufficient for many queries.
Using local data flow
The local data flow library is in the module DataFlow, which defines the class Node denoting any
element that data can flow through. Nodes are divided into expression nodes (ExprNode) and parameter
nodes (ParameterNode). You can map between data flow nodes and expressions/parameters using the
member predicates asExpr and asParameter:
class Node {
/** Gets the expression corresponding to this node, if any. */
Expr asExpr() { ... }

/** Gets the parameter corresponding to this node, if any. */


Parameter asParameter() { ... }

...

Page 8 of 18
}
Global data flow
Global data flow tracks data flow throughout the entire program, and is therefore more powerful than
local data flow. However, global data flow is less precise than local data flow, and the analysis typically
requires significantly more time and memory to perform.
Using global data flow
You use the global data flow library by extending the class DataFlow::Configuration:
import semmle.code.java.dataflow.DataFlow

class MyDataFlowConfiguration extends DataFlow::Configuration {


MyDataFlowConfiguration() { this = "MyDataFlowConfiguration" }

override predicate isSource(DataFlow::Node source) {


...
}

override predicate isSink(DataFlow::Node sink) {


...
}
}

These predicates are defined in the configuration:


 isSource—defines where data may flow from
 isSink—defines where data may flow to
 isBarrier—optional, restricts the data flow
 isAdditionalFlowStep—optional, adds additional flow steps
Hadoop I/O
Hadoop comes with a set of primitives for data I/O. Some of these are techniques that are more general
than Hadoop, such as data integrity and compression, but deserve special consideration when dealing
with multiterabyte datasets. Others are Hadoop tools or APIs that form the building blocks for
developing distributed systems, such as serialization frameworks and on-disk data structures.
Data Integrity
Users of Hadoop rightly expect that no data will be lost or corrupted during storage or processing.
However, since every I/O operation on the disk or network carries with it a small chance of introducing
errors into the data that it is reading or writing, when the volumes of data flowing through the system
are as large as the ones Hadoop is capable of handling, the chance of data corruption occurring is high.
Data Integrity in HDFS

Page 9 of 18
HDFS transparently checksums all data written to it and by default verifies checksums when reading
data. A separate checksum is created for every io.bytes.per.checksum bytes of data. The default is 512
bytes, and since a CRC-32 checksum is 4 bytes long, the storage overhead is less than 1%.
Compression
File compression brings two major benefits: it reduces the space needed to store files, and it speeds up
data transfer across the network, or to or from disk. When dealing with large volumes of data, both of
these savings can be significant, so it pays to carefully consider how to use compression in
Hadoop.There are many different compression formats, tools and algorithms, each with different
characteristics.
Serialization
Serialization is the process of turning structured objects into a byte stream for transmission over a
network or for writing to persistent storage. Deserialization is the reverse process of turning a byte
stream back into a series of structured objects.
Serialization appears in two quite distinct areas of distributed data processing: for interprocess
communication and for persistent storage.
In Hadoop, interprocess communication between nodes in the system is implemented using remote
procedure calls (RPCs). The RPC protocol uses serialization to render the message into a binary stream to
be sent to the remote node, which then deserializes the binary stream into the original message. In
general, it is desirable that an RPC serialization format is:
Compact
A compact format makes the best use of network bandwidth, which is the most scarce resource
in a data center.
Fast
Interprocess communication forms the backbone for a distributed system, so it is essential that
there is as little performance overhead as possible for the serialization and deserialization
process.
Extensible
Protocols change over time to meet new requirements, so it should be straightforward to evolve
the protocol in a controlled manner for clients and servers. For example, it should be possible to
add a new argument to a method call, and have the new servers accept messages in the old
format (without the new argument) from old clients.
Interoperable

Page 10 of 18
For some systems, it is desirable to be able to support clients that are written in different
languages to the server, so the format needs to be designed to make this possible.
Introduction to Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of
Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different companies.
For example, Amazon uses it in Amazon Elastic MapReduce.

Hive - Data Types


This chapter takes you through the different data types in Hive, which are involved in the table creation.
All the data types in Hive are classified into four types, given as follows:
 Column Types
 Literals
 Null Values
 Complex Types
Column Types
Column type are used as column data types of Hive. They are as follows:
Integral Types
Integer type data can be specified using integral data types, INT. When the data range exceeds the
range of INT, you need to use BIGINT and if the data range is smaller than the INT, you use SMALLINT.
TINYINT is smaller than SMALLINT.
The following table depicts various INT data types:
Type Postfix Example

TINYINT Y 10Y

SMALLINT S 10S

INT - 10

Page 11 of 18
BIGINT L 10L

String Types
String type data types can be specified using single quotes (' ') or double quotes (" "). It contains two
data types: VARCHAR and CHAR. Hive follows C-types escape characters.
The following table depicts various CHAR data types:
Data Type Length

VARCHAR 1 to 65355

CHAR 255

Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It supports
java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and format “yyyy-mm-dd
hh:mm:ss.ffffffffff”.
Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for representing
immutable arbitrary precision. The syntax and example is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types
Union is a collection of heterogeneous data types. You can create an instance using create union. The
syntax and example is as follows:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>

{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals
The following literals are used in Hive:
Floating Point Types
Floating point types are nothing but numbers with decimal points. Generally, this type of data is
composed of DOUBLE data type.
Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data type. The
range of decimal type is approximately -10-308 to 10308.
Null Value

Page 12 of 18
Missing values are represented by the special value NULL.
Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>
Structs
Structs in Hive is similar to using complex data with comment.
Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>
Hive Different File Formats
Different file formats and compression codecs work better for different data sets in Apache Hive.
Following are the Apache Hive different file formats:
 Text File
 Sequence File
 RC File
 AVRO File
 ORC File
 Parquet File
Hive Text File Format
Hive Text file format is a default storage format. You can use the text format to interchange the data
with other client application. The text file format is very common most of the applications. Data is
stored in lines, with each line being a record. Each lines are terminated by a newline character (\n).
Create table textfile_table
(column_specs)
stored as textfile;

Hive Sequence File Format


Sequence files are Hadoop flat files which stores values in binary key-value pairs. The sequence files are
in binary format and these files are able to split. The main advantages of using sequence file is to merge
two or more files into one file.
Create table sequencefile_table
(column_specs)
stored as sequencefile;

Hive RC File Format


RCFile is row columnar file format. This is another form of Hive file format which offers high row level
compression rates. If you have requirement to perform multiple rows at a time then you can use RCFile
format.
The RCFile are very much similar to the sequence file format. This file format also stores the data as key-
value pairs.
Create table RCfile_table
(column_specs)

Page 13 of 18
stored as rcfile;

Hive AVRO File Format


AVRO is open source project that provides data serialization and data exchange services for Hadoop.
You can exchange data between Hadoop ecosystem and program written in any programming
languages. Avro is one of the popular file format in Big Data Hadoop based applications.
Create AVRO file by specifying ‘STORED AS AVRO’ option at the end of a CREATE TABLE Command.
Create table avro_table
(column_specs)
stored as avro;

Hive ORC File Format


The ORC file stands for Optimized Row Columnar file format. The ORC file format provides a highly
efficient way to store data in Hive table. This file system was actually designed to overcome limitations
of the other Hive file formats. The Use of ORC files improves performance when Hive is reading, writing,
and processing data from large tables.
Create table orc_table
(column_specs)
stored as orc;

Hive Parquet File Format


Parquet is a column-oriented binary file format. The parquet is highly efficient for the types of large-
scale queries. Parquet is especially good for queries scanning particular columns within a particular
table. The Parquet table uses compression Snappy, gzip; currently Snappy by default.
Create table parquet_table
(column_specs)
stored as parquet;
HiveQL data definition
HiveQL is the Hive query language. Like all SQL dialects in widespread use, it doesn’t fully conform to any
particular revision of the ANSI SQL standard. It is perhaps closest to MySQL’s dialect, but with significant
differences. Hive offers no support for row-level inserts, updates, and deletes. Hive doesn’t support
transactions. Hive adds extensions to provide better performance in the context of Hadoop and to
integrate with custom extensions and even external programs.
HiveQL data manipulation
HiveQL Data Manipulation – Load, Insert, Export Data and Create Table
It is important to note that HiveQL data manipulation doesn’t offer any row-level insert, update or
delete operation. Therefore, data can be inserted into hive tables using either “bulk” load operations or
writing the files into correct directories by other methods.
HiveQL Load Data into Managed Tables
Loading data from input file (Schema on Read)
Hive>LOAD DATA LOCAL INPATH '/home/hduser/sampledata/users.txt'

OVERWRITE INTO TABLE users;


 ‘LOCAL’ indicates the source data is on local file system

Page 14 of 18
 Local data will be copied into the final destination (HDFS file system) by Hive
 If ‘Local’ is not specified, the file is assumed to be on HDFS
 Hive does not do any data transformation while loading the data
Loading data into partition requires PARTITION clause
Hive>LOAD DATA LOCAL PATH '/home/hduser/sampledata/Employees.txt' OVERWRITE INTO TABLES
Employees;

PARTITION (country = 'India', city ='Delhi');


HDFS directory is created according to the partition values
Loading data from HDFS directory
Hive>LOAD DATA INPATH '/usr/hadoop/data' OVERWRITE INTO TABLES aliens;
 All the files in the directory are copied into Hive
 ‘OVERWRITE’ causes table to be purged and filled
 Leaving out ‘OVERWRITE’ adds data to existing folder (old data will exist under its name and new one
under a different name)
HiveQL Insert Data into Hive Tables from Queries
Hive> INSERT OVERWRITE TABLE Employee
Partition (country= ‘IN’,state=’KA’)
SELECT * FROM emp_stage ese
WHERE ese.country=’IN’ AND ese.state=’KA’;
Create table and load them from Hive Queries
Hive> CREATE TABLE Employees

AS SELECT eno,ename,sal,address

FROM emp

WHERE country=’IN’;
Exporting Data out of Hive
If LOCAL keyword is used, Hive will write the data to local directory
Hive>INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/data'

SELECT name, age

FROM aliens

WHERE date_sighted >'2014-09-15'


HiveQL - JOIN
The HiveQL Join clause is used to combine the data of two or more tables based on a related column
between them. The various type of HiveQL joins are: -
o Inner Join

Page 15 of 18
o Left Outer Join

o Right Outer Join

o Full Outer Join

Explanation of windowing functions in hive


The use of the windowing feature is to create a window on the set of data , in order to operate
aggregation like Standard aggregations: This can be either COUNT(), SUM(), MIN(), MAX(), or AVG() and
the other analytical
tical functions are like LEAD, LAG, FIRST_VALUE and LAST_VALUE.
 COUNT() function:
Here we are going to count employees in each department is as follows
 MIN function:
MIN function is used to compute the minimum of the rows in the column or expression and on rows
within the group.
 MAX function:
MAX function is used to compute the maximum of the rows in the column or expression and on rows
within the group.
 AVG function:
The avg function returns the average of the elements in the group or the average of the distinct
di values
of the column in the group.
 LEAD function:

Page 16 of 18
The LEAD function, lead(value_expr[,offset[,default]]), is used to return data from the next row. The
number (value_expr) of rows to lead can optionally be specified. If the number of rows (offset) to lead is
not specified, the lead is one row by default. It returns [,default] or null when the default is not specified
and the lead for the current row extends beyond the end of the window.

 LAG function:
The LAG function, lag(value_expr[,offset[,default]]), is used to access data from a previous row. The
number (value_expr) of rows to lag can optionally be specified. If the number of rows (offset) to lag is
not specified, the lag is one row by default. It returns [,default] or null when the default is not specified
and the lag for the current row extends beyond the end of the window.
 FIRST_VALUE function:
This function returns the value from the first row in the window based on the clause and assigned to all
the rows of the same group, simply It returns the first result from an ordered set.
 LAST_VALUE function:
In reverse of FIRST_VALUE, it returns the value from the last row in a window based on the clause and
assigned to all the rows of the same group,simply it returns the last result from an ordered set.

How do you optimize a Hive query?


Performance tuning is key to optimizing a Hive query. First, tweak your data through partitioning,
bucketing, compression, etc. Improving the execution of a hive query is another Hive query optimization
technique. You can do this by using Tez, avoiding skew, and increasing parallel execution. Lastly,
sampling and unit testing can help optimize a query by allowing you to see (and solve) problems on a
smaller scale, first.
Partitioning Tables:
Hive partitioning is an effective method to improve the query performance on larger tables.
Partitioning allows you to store data in separate sub-directories under table location. It dramatically
helps the queries which are queried upon the partition key(s). Although the selection of partition key
is always a prudent decision, it should always be a low cardinal attribute. For example, if your data is
associated with the time dimension, then the date could be a good partition key. Similarly, if data is
associated with location, like a country or state, it’s a good idea to have hierarchical partitions like
country/state.
Bucketing
Bucketing provides flexibility to further segregate the data into more manageable sections called

buckets or clusters. Bucketing is based on the hash function, which depends on the type of the

bucketing column. Records which are bucketed by the same column value will always be saved in the

same bucket. CLUSTERED BY clause is used to divide the table into buckets. It works well for the

columns having high cardinality.


Hive Indexing
Hive Index – Hive Optimization Techniques, one of the best ways is Indexing. To increase your query
performance indexing will definitely help. Basically, for the original table use of indexing will create a

Page 17 of 18
separate called index table which acts as a reference.

As we know, there are many numbers of rows and columns, in a Hive table. Basically, it will take a large
amount of time if we want to perform queries only on some columns without indexing. Because queries
will be executed on all the columns present in the table.

Optimize your joins

If you are using joins to fetch the results, it’s time to revise it. If you have large data in the tables, then it

is not advisable to just use normal joins we use in SQL. There are many other joins like Map Join; bucket

joins, etc. which can be used to improve Hive query performance.


 Use Map Join

Map join is highly beneficial when one table is small so that it can fit into the memory. Hive has a

property which can do auto-map join when enabled. Set the below parameter to true to enable auto

map join.
 Use Skew Join

Skew join is also helpful when your table is skewed. Set the hive.optimize.skewjoin property to true to

enable skew join.


 Bucketed Map Join

If tables are bucketed by a particular column, you can use bucketed map join to improve the hive query

performance.

Page 18 of 18

You might also like