Unit 6-1
Unit 6-1
Unit 6-1
Data
Introduction to Big Data and Hadoop:
• Definition of Big Data: Big Data refers to large and complex datasets that
exceed the processing capabilities of traditional database systems.
• Challenges of Big Data: Volume, Velocity, Variety, and Veracity.
• Hadoop is an open-source framework for distributed storage and
processing of large-scale datasets across clusters of commodity hardware.
Hadoop Architecture:
• Hive:
• Overview: Hive is a data warehouse infrastructure built on top of Hadoop for
querying and analyzing large datasets using a SQL-like language called HiveQL.
• Use Cases: Data warehousing, ad-hoc querying, analytics.
• Pig:
• Overview: Pig is a high-level data flow scripting language and execution
framework for analyzing large datasets on Hadoop.
• Use Cases: Data transformation, ETL (Extract, Transform, Load), data
processing pipelines.
• Spark:
• Overview: Apache Spark is a fast and general-purpose cluster computing
system for big data processing. It provides in-memory computation and
supports multiple programming languages.
• Use Cases: Batch processing, real-time stream processing, machine learning,
graph processing.
Hadoop Deployment and Administration:
• Hive Client:
• The Hive client is the interface through which users interact with the Hive
system. It can be a command-line interface (CLI), web-based interface (Hue),
or JDBC/ODBC-based client application.
• Users submit HiveQL queries to the Hive client for processing.
• Hive Driver:
• The Hive Driver receives queries from the Hive client and coordinates their
execution within the Hive system.
• It parses, compiles, optimizes, and executes HiveQL queries, generating an
execution plan that specifies the sequence of tasks required to execute the
query.
• Hive Compiler:
• The Hive Compiler is responsible for translating HiveQL queries into a series of
MapReduce, Tez, or Spark jobs for execution on the Hadoop cluster.
• It generates an execution plan (also known as the query plan) based on the
query semantics and optimization rules.
• Hive Metastore:
• The Hive Metastore is a centralized repository that stores metadata about
Hive tables, partitions, columns, storage location, and other schema-related
information.
• It maintains a mapping between logical Hive tables and their corresponding
physical storage locations in HDFS or other storage systems.
• The Metastore is typically backed by a relational database such as MySQL,
PostgreSQL, or Derby.
• Hive Server:
• The Hive Server provides a Thrift or JDBC/ODBC interface for external clients
to submit HiveQL queries and interact with the Hive system.
• It manages connections from multiple clients and coordinates query
execution across the Hadoop cluster.
• Hadoop Distributed File System (HDFS):
• HDFS is the primary storage system used by Hive for storing large volumes of
data in distributed fashion across nodes in the Hadoop cluster.
• Hive tables are typically stored as files in HDFS, with each table represented
as a directory containing one or more data files.
• Execution Engine (MapReduce, Tez, Spark):
• The Execution Engine is responsible for executing the tasks generated by the
Hive Compiler on the Hadoop cluster.
• Hive supports different execution engines, including MapReduce (default),
Tez, and Spark, depending on the underlying data processing framework and
configuration.
• The execution engine distributes tasks across nodes in the cluster, manages
task execution, and aggregates results for query processing.
• Hive UDFs (User-Defined Functions):
• Hive UDFs are custom functions developed by users to extend the
functionality of Hive and perform specialized data processing tasks.
• UDFs can be implemented in Java, Scala, Python, or other programming
languages and registered with Hive for use in HiveQL queries.
• In summary, the Hive architecture consists of components such as the
client, driver, compiler, metastore, server, storage system (HDFS),
execution engine, and UDFs, working together to enable querying,
analyzing, and processing large datasets stored in Hadoop.
• Each component plays a specific role in the Hive ecosystem,
facilitating the execution of HiveQL queries and providing a scalable
and efficient platform for Big Data analytics.
Data types in Hive
• In Apache Hive, data types define the type of values that can be
stored in columns of Hive tables.
• Hive supports a wide range of primitive and complex data types to
accommodate various data formats and use cases.
• Here are the common data types supported by Hive:
Primitive Data Types:
• Hive allows users to define custom data types using SerDe (Serializer/Deserializer) libraries
and use them in Hive tables.
• Complex Data Types in JSON Format:
• JSON: Hive supports storing and querying data in JSON format using the STRING data type.
• Collection Data Types (Introduced in Hive 0.13.0):
• INTERVAL: Represents intervals of time or time spans.
• These data types provide flexibility and versatility for storing and processing various types of
data in Hive tables.
• Users can choose appropriate data types based on the nature of the data, storage
requirements, and query processing needs.
• Additionally, Hive provides typecasting functions to convert data between different types
when necessary.
Database Operation in Hive ( Refer slide 93 &
ahead)
• https://www.geeksforgeeks.org/database-operations-in-hive-using-
cloudera-vmware-work-station/
• https://www.tutorialspoint.com/hive/hive_introduction.htm
• https://www.simplilearn.com/tutorials/hadoop-tutorial/hive
Partitioning in Hive
• The partitioning in Hive means dividing the table into some parts based on the
values of a particular column like date, course, city or country.
• The advantage of partitioning is that since the data is stored in slices, the query
response time becomes faster.
• As we know that Hadoop is used to handle the huge amount of data, it is
always required to use the best approach to deal with it.
• The partitioning in Hive is the best example of it.
• Let's assume we have a data of 10 million students studying in an
institute. Now, we have to fetch the students of a particular course.
• If we use a traditional approach, we have to go through the entire
data. This leads to performance degradation.
• In such a case, we can adopt the better approach i.e., partitioning in
Hive and divide the data among the different datasets based on
particular columns.
The partitioning in Hive can be executed in two
ways -
• Static partitioning
• Dynamic partitioning
• Static Partitioning
• In static or manual partitioning, it is required to pass the values of
partitioned columns manually while loading the data into the table.
Hence, the data file doesn't contain the partitioned columns.
• Dynamic Partitioning
• In dynamic partitioning, the values of partitioned columns exist within
the table. So, it is not required to pass the values of partitioned
columns manually.
HBASE
Limitations of Hadoop
• Hadoop can perform only batch processing, and data will be accessed
only in a sequential manner.
• That means one has to search the entire dataset even for the simplest
of jobs.
• At this point, a new solution is needed to access any point of data in a
single unit of time (random access).
Hadoop Random Access Databases
• A relational database
• A design for OnLine Transaction Processing (OLTP)
• A language for real-time queries and row-level updates
Features of Hive
• All the data types in Hive are classified into four types, given as
follows:
1. Column Types
2. Literals
3. Null Values
4. Complex Types
Column Types
1. Integral Types
• Integer type data can be specified using integral data types,
INT. When the data range exceeds the range of INT, you need
to use BIGINT and if the data range is smaller than the INT, you
use SMALLINT. TINYINT is smaller than SMALLINT.
2. String Types
• String type data types can be specified using single quotes (' ')
or double quotes (" ").
• It contains two data types: VARCHAR and CHAR.
3. Timestamp
• It supports traditional UNIX timestamp with optional
nanosecond precision.
• It supports java.sql.Timestamp format “YYYY-MM-DD
HH:MM:SS.fffffffff” and format “yyyy-mm-dd
hh:mm:ss.ffffffffff”.
• 4. Dates
• DATE values are described in year/month/day format
in the form {{YYYY-MM-DD}}.
5. Decimals
• The DECIMAL type in Hive is as same as Big Decimal
format of Java. It is used for representing immutable
arbitrary precision. The syntax and example is as
follows:
• DECIMAL(precision, scale) decimal(10,0)
6. Union Types
• Union is a collection of heterogeneous data types. You
can create an instance using create union. The syntax
and example is as follows:
Literals
1. Floating Point Types
• Floating point types are nothing but numbers with
decimal points. Generally, this type of data is
composed of DOUBLE data type.
2. Decimal Type
• Decimal type data is nothing but floating point value
with higher range than DOUBLE data type. The range
of decimal type is approximately -10-308 to 10308.
Null Value