Unit 3 (Big Data Analytics)
Unit 3 (Big Data Analytics)
Data formats
HDFS file formats supported are Json, Avro and Parquet. The format is specified by setting the storage
format value which can be found on the storage tab of the Data Store. For all files of HDFS, the storage
type (Json, Avro, Parquet) are defined in the data store. JSON, Avro and Parquet formats contain
complex data types, like array or Object. During the Reverse Engineer phase, the schema definition for
these types are converted to Avro and stored in the data format column of the attribute with the
complex data type. This information is used when flattening this data in the mappings.
For JSON, Avro and Parquet that each type requires the location of a schema file to be entered. For
Delimited, you will need to specify the Record and field separator information, number of heading lines.
If you are loading Avro files into Hive, then you will need to copy the Avro Schema file (.avsc) into the
same HDFS location as the HDFS files.
Separate KMs for each file format are not required. You can create just one or two KMs for each target
(a standard LKM and where appropriate a Direct Load LKM). The file can either be delimited or fixed
format. The new LKM HDFS File to Hive supports loading only HDFS file into Hive, the file can be in the
format of JSON, Avro, Parquet, Delimited etc, with complex data.
analyzing data with Hadoop
Big data is mostly generated from social media websites, sensors, devices, video/audio, networks, log
files and web, and much of it is generated in real time and on a very large scale. Big data analytics is the
process of examining this large amount of different data types, or big data, in an effort to uncover
hidden patterns, unknown correlations and other useful information.
Big data analysis allows market analysts, researchers and business users to develop deep insights from
the available data, resulting in numerous business advantages. Business users are able to make a precise
analysis of the data and the key early indicators from this analysis can mean fortunes for the business.
Some of the exemplary use cases are as follows:
Whenever users browse travel portals, shopping sites, search flights, hotels or add a particular item into
their cart, then Ad Targeting companies can analyze this wide variety of data and activities and can
Page 1 of 18
provide better recommendations to the user regarding offers, discounts and deals based on the user
browsing history and product history.
In the telecommunications space, if customers are moving from one service provider to another service
provider, then by analyzing huge call data records of the various issues faced by the customers can be
unearthed. Issues could be as wide-ranging as a significant increase in the call drops or some network
congestion problems. Based on analyzing these issues, it can be identified if a telecom company needs
to place a new tower in a particular urban area or if they need to revive the marketing strategy for a
particular region as a new player has come up there. That way customer churn can be proactively
minimized.
Benefits
Hadoop has gained immense popularity with the rise of Big Data platforms that are capable of managing
huge volumes of data.
Using Hadoop, you can avail following benefits for your enterprise.
Page 2 of 18
Vertical scaling : When we add more resources to a single machine when the load increases. For
example you need 20gb of ram but currently your server has 10 GB of ram so you add extra ram to the
same server to meet the needs.
Horizontal scaling of scaling out : when you add more machines to match the resources need it's called
horizontal scaling. So if I have a machine of already 10 GB I'll add an extra machine with 10 GB ram.
HDFS uses a primary/secondary architecture. The HDFS cluster's NameNode is the primary server that
manages the file system namespace and controls client access to files. As the central component of the
Hadoop Distributed File System, the NameNode maintains and manages the file system namespace and
provides clients with the right access permissions. The system's DataNodes manage the storage that's
attached to the nodes they run on.
fault tolerance
Fault tolerance in Hadoop HDFS refers to the working strength of a system in unfavorable conditions and
how that system can handle such a situation.
HDFS is highly fault-tolerant. Before Hadoop 3, it handles faults by the process of replica creation. It
creates a replica of users’ data on different machines in the HDFS cluster. So whenever if any machine in
the cluster goes down, then data is accessible from other machines in which the same copy of data was
created.
HDFS also maintains the replication factor by creating a replica of data on other available machines in
the cluster if suddenly one machine fails.
Hadoop 3 introduced Erasure Coding to provide Fault Tolerance. Erasure Coding in HDFS improves
storage efficiency while providing the same level of fault tolerance and data durability as traditional
replication-based HDFS deployment.
Page 3 of 18
Before Hadoop 3, fault tolerance in Hadoop HDFS was achieved by creating replicas. HDFS creates a
replica of the data block and stores them on multiple machines (DataNode).
The number of replicas created depends on the replication factor (by default 3).
If any of the machines fails, the data block is accessible from the other machine containing the same
copy of data. Hence there is no data loss due to replicas stored on different machines.
Data Replication
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as
a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are
replicated for fault tolerance. The block size and replication factor are configurable per file. An
application can specify the number of replicas of a file. The replication factor can be specified at file
creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any
time.
High availability
High Availability was a new feature added to Hadoop 2.x to solve the Single point of failure problem in
the older versions of Hadoop.
As the Hadoop HDFS follows the master-slave architecture where the NameNode is the master node and
maintains the filesystem tree. So HDFS cannot be used without NameNode. This NameNode becomes a
bottleneck. HDFS high availability feature addresses this issue.
Page 4 of 18
Data Locality in Hadoop
In Hadoop, Data locality is the process of moving the computation close to where the actual data resides
on the node, instead of moving large data to computation. This minimizes network congestion and
increases the overall throughput of the system. This feature of Hadoop we will discuss in detail in this
tutorial. We will learn what is data locality in Hadoop, data locality definition, how Hadoop exploits Data
Locality, what is the need of Hadoop Data Locality, various types of data locality in Hadoop MapReduce,
Data locality optimization in Hadoop and various advantages of Hadoop data locality.
MapReduce Architecture
MapReduce
MapReduce and HDFS are the two major components of Hadoop which makes it so powerful and
efficient to use. MapReduce is a programming model used for efficient processing in parallel over large
data-sets
sets in a distributed manner. The data is first split and then combined to produce the final result.
The libraries for MapReduce is written in so many programming languages with various different-
different optimizations. The purpose of MapReduce in Hadoop is to Map each of the jobs and then it will
reduce it to equivalent tasks for providing less overhead over the cluster network and to reduce the
processing power. The MapReduce task is mainly divided into two phases Map Phase and Reduce Phase.
MapReduce Architecture:
Page 5 of 18
Components of MapReduce Architecture:
1. Client: The MapReduce client is the one who brings the Job to the MapReduce for processing. There
can be multiple clients available that continuously send jobs for processing to the Hadoop
MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is comprised of so
many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result of all the
job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
Process Flow
A process flow is a composition of one or more activities. It is written in a DSL script that contains all the
activities that make a data flow from source to destination complete.
A process flow is a generic concept and is not limited to BDI. However all the out-of-box process flows
are for data transfers from a retail application to one or more retail applications.
A process flow encapsulates a sequence of activities. An activity can be synchronous or asynchronous. In
BDI some of these activities are invocations of batch jobs.
Figure 5-1 Process Flow
Page 6 of 18
Java Interface
An interface in Java is a blueprint of a class. It has static constants and abstract methods.
The interface in Java is a mechanism to achieve abstraction. There can be only abstract methods in the
Java interface, not method body. It is used to achieve abstraction and multiple inheritance in Java.
In other words, you can say that interfaces can have abstract methods and variables. It cannot have a
method body.
Java Interface also represents the IS-A relationship.
It cannot be instantiated just like the abstract class.
Since Java 8, we can have default and static methods in an interface.
Since Java 9, we can have private methods in an interface.
Why use Java interface?
There are mainly three reasons to use interface. They are given below.
o It is used to achieve abstraction.
o By interface, we can support the functionality of multiple inheritance.
o It can be used to achieve loose coupling.
Page 7 of 18
How to declare an interface?
An interface is declared by using the interface keyword. It provides total abstraction; means all the
methods in an interface are declared with the empty body, and all the fields are public, static and final
by default. A class that implements an interface must implement all the methods declared in the
interface.
Syntax:
1. interface <interface_name>{
2.
3. // declare constant fields
4. // declare methods that abstract
5. // by default.
6. }
Data flow in Java
Local data flow
Local data flow is data flow within a single method or callable. Local data flow is usually easier, faster,
and more precise than global data flow, and is sufficient for many queries.
Using local data flow
The local data flow library is in the module DataFlow, which defines the class Node denoting any
element that data can flow through. Nodes are divided into expression nodes (ExprNode) and parameter
nodes (ParameterNode). You can map between data flow nodes and expressions/parameters using the
member predicates asExpr and asParameter:
class Node {
/** Gets the expression corresponding to this node, if any. */
Expr asExpr() { ... }
...
Page 8 of 18
}
Global data flow
Global data flow tracks data flow throughout the entire program, and is therefore more powerful than
local data flow. However, global data flow is less precise than local data flow, and the analysis typically
requires significantly more time and memory to perform.
Using global data flow
You use the global data flow library by extending the class DataFlow::Configuration:
import semmle.code.java.dataflow.DataFlow
Page 9 of 18
HDFS transparently checksums all data written to it and by default verifies checksums when reading
data. A separate checksum is created for every io.bytes.per.checksum bytes of data. The default is 512
bytes, and since a CRC-32 checksum is 4 bytes long, the storage overhead is less than 1%.
Compression
File compression brings two major benefits: it reduces the space needed to store files, and it speeds up
data transfer across the network, or to or from disk. When dealing with large volumes of data, both of
these savings can be significant, so it pays to carefully consider how to use compression in
Hadoop.There are many different compression formats, tools and algorithms, each with different
characteristics.
Serialization
Serialization is the process of turning structured objects into a byte stream for transmission over a
network or for writing to persistent storage. Deserialization is the reverse process of turning a byte
stream back into a series of structured objects.
Serialization appears in two quite distinct areas of distributed data processing: for interprocess
communication and for persistent storage.
In Hadoop, interprocess communication between nodes in the system is implemented using remote
procedure calls (RPCs). The RPC protocol uses serialization to render the message into a binary stream to
be sent to the remote node, which then deserializes the binary stream into the original message. In
general, it is desirable that an RPC serialization format is:
Compact
A compact format makes the best use of network bandwidth, which is the most scarce resource
in a data center.
Fast
Interprocess communication forms the backbone for a distributed system, so it is essential that
there is as little performance overhead as possible for the serialization and deserialization
process.
Extensible
Protocols change over time to meet new requirements, so it should be straightforward to evolve
the protocol in a controlled manner for clients and servers. For example, it should be possible to
add a new argument to a method call, and have the new servers accept messages in the old
format (without the new argument) from old clients.
Interoperable
Page 10 of 18
For some systems, it is desirable to be able to support clients that are written in different
languages to the server, so the format needs to be designed to make this possible.
Introduction to Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of
Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different companies.
For example, Amazon uses it in Amazon Elastic MapReduce.
TINYINT Y 10Y
SMALLINT S 10S
INT - 10
Page 11 of 18
BIGINT L 10L
String Types
String type data types can be specified using single quotes (' ') or double quotes (" "). It contains two
data types: VARCHAR and CHAR. Hive follows C-types escape characters.
The following table depicts various CHAR data types:
Data Type Length
VARCHAR 1 to 65355
CHAR 255
Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It supports
java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and format “yyyy-mm-dd
hh:mm:ss.ffffffffff”.
Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for representing
immutable arbitrary precision. The syntax and example is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types
Union is a collection of heterogeneous data types. You can create an instance using create union. The
syntax and example is as follows:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals
The following literals are used in Hive:
Floating Point Types
Floating point types are nothing but numbers with decimal points. Generally, this type of data is
composed of DOUBLE data type.
Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data type. The
range of decimal type is approximately -10-308 to 10308.
Null Value
Page 12 of 18
Missing values are represented by the special value NULL.
Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>
Structs
Structs in Hive is similar to using complex data with comment.
Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>
Hive Different File Formats
Different file formats and compression codecs work better for different data sets in Apache Hive.
Following are the Apache Hive different file formats:
Text File
Sequence File
RC File
AVRO File
ORC File
Parquet File
Hive Text File Format
Hive Text file format is a default storage format. You can use the text format to interchange the data
with other client application. The text file format is very common most of the applications. Data is
stored in lines, with each line being a record. Each lines are terminated by a newline character (\n).
Create table textfile_table
(column_specs)
stored as textfile;
Page 13 of 18
stored as rcfile;
Page 14 of 18
Local data will be copied into the final destination (HDFS file system) by Hive
If ‘Local’ is not specified, the file is assumed to be on HDFS
Hive does not do any data transformation while loading the data
Loading data into partition requires PARTITION clause
Hive>LOAD DATA LOCAL PATH '/home/hduser/sampledata/Employees.txt' OVERWRITE INTO TABLES
Employees;
AS SELECT eno,ename,sal,address
FROM emp
WHERE country=’IN’;
Exporting Data out of Hive
If LOCAL keyword is used, Hive will write the data to local directory
Hive>INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/data'
FROM aliens
Page 15 of 18
o Left Outer Join
Page 16 of 18
The LEAD function, lead(value_expr[,offset[,default]]), is used to return data from the next row. The
number (value_expr) of rows to lead can optionally be specified. If the number of rows (offset) to lead is
not specified, the lead is one row by default. It returns [,default] or null when the default is not specified
and the lead for the current row extends beyond the end of the window.
LAG function:
The LAG function, lag(value_expr[,offset[,default]]), is used to access data from a previous row. The
number (value_expr) of rows to lag can optionally be specified. If the number of rows (offset) to lag is
not specified, the lag is one row by default. It returns [,default] or null when the default is not specified
and the lag for the current row extends beyond the end of the window.
FIRST_VALUE function:
This function returns the value from the first row in the window based on the clause and assigned to all
the rows of the same group, simply It returns the first result from an ordered set.
LAST_VALUE function:
In reverse of FIRST_VALUE, it returns the value from the last row in a window based on the clause and
assigned to all the rows of the same group,simply it returns the last result from an ordered set.
buckets or clusters. Bucketing is based on the hash function, which depends on the type of the
bucketing column. Records which are bucketed by the same column value will always be saved in the
same bucket. CLUSTERED BY clause is used to divide the table into buckets. It works well for the
Page 17 of 18
separate called index table which acts as a reference.
As we know, there are many numbers of rows and columns, in a Hive table. Basically, it will take a large
amount of time if we want to perform queries only on some columns without indexing. Because queries
will be executed on all the columns present in the table.
If you are using joins to fetch the results, it’s time to revise it. If you have large data in the tables, then it
is not advisable to just use normal joins we use in SQL. There are many other joins like Map Join; bucket
Map join is highly beneficial when one table is small so that it can fit into the memory. Hive has a
property which can do auto-map join when enabled. Set the below parameter to true to enable auto
map join.
Use Skew Join
Skew join is also helpful when your table is skewed. Set the hive.optimize.skewjoin property to true to
If tables are bucketed by a particular column, you can use bucketed map join to improve the hive query
performance.
Page 18 of 18