Unit 6-1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 128

UNIT 6 -Processing in Big

Data
Introduction to Big Data and Hadoop:

• Definition of Big Data: Big Data refers to large and complex datasets that
exceed the processing capabilities of traditional database systems.
• Challenges of Big Data: Volume, Velocity, Variety, and Veracity.
• Hadoop is an open-source framework for distributed storage and
processing of large-scale datasets across clusters of commodity hardware.
Hadoop Architecture:

• Hadoop Distributed File System (HDFS):


• Overview: HDFS is the primary storage system used by Hadoop. It stores data
across multiple nodes in a cluster, providing high throughput and fault
tolerance.
• Components: NameNode (metadata), DataNode (storage), Secondary
NameNode (backup metadata).
• MapReduce:
• Overview: MapReduce is a programming model and processing
engine for parallel data processing on Hadoop clusters.
• Phases: Map Phase (data processing), Shuffle and Sort Phase (data
exchange), Reduce Phase (aggregation).
• Programming Model: Map (key, value) → (key', value'), Reduce (key',
list of values') → (key'', value'').
• YARN (Yet Another Resource Negotiator):
• Overview: YARN is a resource management layer in Hadoop that
manages resources and schedules tasks across the cluster.
• Components: ResourceManager (global resource allocation),
NodeManager (per-node resource management), ApplicationMaster
(per-application resource negotiation).
Hadoop Ecosystem:

• Hive:
• Overview: Hive is a data warehouse infrastructure built on top of Hadoop for
querying and analyzing large datasets using a SQL-like language called HiveQL.
• Use Cases: Data warehousing, ad-hoc querying, analytics.
• Pig:
• Overview: Pig is a high-level data flow scripting language and execution
framework for analyzing large datasets on Hadoop.
• Use Cases: Data transformation, ETL (Extract, Transform, Load), data
processing pipelines.
• Spark:
• Overview: Apache Spark is a fast and general-purpose cluster computing
system for big data processing. It provides in-memory computation and
supports multiple programming languages.
• Use Cases: Batch processing, real-time stream processing, machine learning,
graph processing.
Hadoop Deployment and Administration:

• Deployment Options: On-premises, cloud (AWS, Azure, Google


Cloud), hybrid.
• Cluster Configuration: Master nodes (NameNode, ResourceManager),
worker nodes (DataNode, NodeManager).
• Monitoring and Management: Hadoop ecosystem tools (Ambari,
Cloudera Manager, Apache Ranger) for cluster monitoring, resource
management, and security.
Best Practices and Optimization:

• Data Partitioning and Replication: Properly partition data to optimize


processing and ensure fault tolerance.
• Compression: Use compression techniques (e.g., Snappy, Gzip) to
reduce storage and improve processing efficiency.
• Tuning Parameters: Adjust Hadoop configuration parameters (e.g.,
memory allocation, block size) for optimal performance.
Challenges and Future Directions:

• Scalability: Managing and processing increasingly larger datasets


efficiently.
• Real-time Processing: Supporting real-time analytics and stream
processing.
• Integration with AI and ML: Incorporating machine learning and
artificial intelligence capabilities into the Hadoop ecosystem.
Conclusion:

• Hadoop revolutionized big data processing by providing scalable and


cost-effective solutions for storing and analyzing large datasets.
• Understanding Hadoop architecture, ecosystem components,
deployment options, and best practices is essential for building robust
big data solutions.
Ways to execute MapReduce

• MapReduce, the processing paradigm introduced by Hadoop, can be


executed in various ways, depending on the specific requirements,
preferences, and constraints of the organization.
Methods to execute MapReduce:

• Using Hadoop MapReduce Framework:


• Hadoop provides a built-in MapReduce framework for processing large
datasets across distributed clusters.
• Developers can write MapReduce programs using Java, the native language
for Hadoop MapReduce, and submit them to the Hadoop cluster for
execution.
• Hadoop manages job scheduling, task distribution, fault tolerance, and
resource management automatically, making it a convenient option for
executing MapReduce jobs.
• Apache Spark:
• Apache Spark is a fast and general-purpose cluster computing system that
provides an alternative to Hadoop MapReduce for processing large-scale data.
• Spark offers a more flexible and efficient execution engine compared to
MapReduce, with support for in-memory processing, iterative algorithms, and
interactive analytics.
• Developers can write MapReduce jobs using Spark's RDD (Resilient
Distributed Dataset) API in languages such as Java, Scala, Python, and R.
• Spark can run on top of Hadoop YARN or standalone mode, providing
compatibility with existing Hadoop clusters or standalone deployments.
• Apache Flink:
• Apache Flink is a stream processing framework that supports both batch and
real-time processing of data.
• Flink provides an alternative to Hadoop MapReduce for executing MapReduce
jobs with low-latency and high throughput.
• Developers can write MapReduce programs using Flink's DataSet API or
DataStream API in languages such as Java and Scala.
• Flink offers advanced features such as stateful computations, event time
processing, and exactly-once semantics for data consistency.
• Amazon EMR (Elastic MapReduce):
• Amazon EMR is a managed Big Data platform offered by AWS (Amazon Web
Services) that provides Hadoop MapReduce as a service.
• Organizations can use Amazon EMR to launch and manage Hadoop clusters
on AWS infrastructure, without the need to provision or manage hardware.
• EMR supports other processing engines such as Apache Spark, Apache Hive,
and Apache HBase, providing flexibility for executing MapReduce jobs and
other Big Data workloads.
• Google Cloud Dataproc:
• Google Cloud Dataproc is a managed Big Data service provided by Google
Cloud Platform (GCP) for running Apache Hadoop, Apache Spark, and other
Big Data frameworks.
• Organizations can use Dataproc to create and manage Hadoop clusters on
GCP infrastructure, with support for executing MapReduce jobs and other
data processing tasks.
• Dataproc integrates with other GCP services such as BigQuery, Cloud Storage,
and Dataflow, enabling seamless data ingestion, storage, and analysis in the
cloud.
• These are some of the common methods for executing MapReduce
jobs, each offering its own advantages in terms of performance,
scalability, ease of use, and integration with existing infrastructure.
Organizations can choose the approach that best fits their
requirements and infrastructure environment.
Introduction to Apache Hive

• Apache Hive is an open-source data warehousing infrastructure built


on top of Hadoop for querying and analyzing large datasets using
SQL-like language called HiveQL. It provides a familiar interface to
interact with Hadoop Distributed File System (HDFS) and other data
storage systems compatible with Hadoop.
Key Components of Hive:

• Hive Metastore: The Hive Metastore is a centralized repository that


stores metadata about Hive tables, partitions, columns, and other
schema-related information.
• It maintains a mapping between Hive tables and their corresponding
physical storage locations in HDFS or other storage systems.
• Hive Query Language (HiveQL):
• HiveQL is a SQL-like language used to query and analyze data stored
in Hive tables.
• It supports standard SQL operations such as SELECT, INSERT, UPDATE,
DELETE, JOIN, GROUP BY, ORDER BY, and WHERE clauses.
• HiveQL queries are translated into MapReduce or Spark jobs for
execution on the Hadoop cluster.
• Hive Execution Engine:
• Hive can execute queries using different execution engines, including
MapReduce, Tez, and Spark, depending on the underlying data
processing framework and configuration.
• MapReduce is the default execution engine for Hive, but Tez and
Spark offer better performance and optimization for certain types of
queries.
• Hive SerDe (Serializer/Deserializer):
• Hive SerDe is a library that allows Hive to process various data
formats such as CSV, JSON, Avro, Parquet, ORC, and others.
• SerDe enables Hive to read and write data in different formats by
providing serialization and deserialization capabilities.
Key Features of Hive:

• SQL-Like Interface: Hive provides a familiar SQL-like interface for


querying and analyzing data, making it accessible to users with SQL
skills. diverse and evolving data formats.
• Integration with Hadoop Ecosystem: Hive seamlessly integrates with
other Hadoop ecosystem tools and frameworks such as HDFS,
MapReduce, Tez, Spark, HBase, and others, enabling organizations to
build comprehensive Big Data solutions.

• Data Warehousing Capabilities: Hive is designed for data warehousing


and analytics use cases, allowing organizations to perform complex
queries, aggregations, joins, and transformations on large datasets
stored in Hadoop.
• Partitioning and Buckets: Hive supports partitioning and bucketing of
data, enabling efficient data organization and retrieval based on
specific partition keys and bucketing columns.

• Extensibility: Hive is extensible and customizable, allowing users to


define custom functions (UDFs), custom SerDe libraries, and custom
data formats to meet specific business requirements.
Use Cases of Hive:

• Data Warehousing: Hive is commonly used for builSchema on Read:


Hive supports schema-on-read, allowing users to define the structure
of data at query time rather than during data ingestion.
• This flexibility enables Hive to work with ding data warehouses and data lakes
on Hadoop, enabling organizations to store, manage, and analyze large
volumes of structured and semi-structured data.

• Ad-Hoc Querying: Hive is suitable for ad-hoc querying and


exploratory data analysis, allowing users to interactively query and
explore data using SQL-like language without the need for complex
programming.
• ETL (Extract, Transform, Load): Hive can be used for data
transformation and ETL processes, enabling organizations to extract
data from various sources, transform it into a suitable format, and
load it into Hadoop for analysis.

• Batch Processing: Hive is well-suited for batch processing of large


datasets using MapReduce, Tez, or Spark, allowing organizations to
perform batch analytics, reporting, and data processing tasks.
• In summary, Apache Hive provides a powerful and versatile platform
for querying and analyzing large datasets stored in Hadoop, offering
SQL-like interface, scalability, extensibility, and integration with the
Hadoop ecosystem.
• It is widely used in various industries for data warehousing, ad-hoc
querying, ETL, and batch processing tasks.
Hive Architecture
• The architecture of Apache Hive comprises several components that
work together to enable querying and analyzing large datasets stored
in Hadoop Distributed File System (HDFS) or other compatible storage
systems.
key components and their roles in the Hive architecture:

• Hive Client:
• The Hive client is the interface through which users interact with the Hive
system. It can be a command-line interface (CLI), web-based interface (Hue),
or JDBC/ODBC-based client application.
• Users submit HiveQL queries to the Hive client for processing.
• Hive Driver:
• The Hive Driver receives queries from the Hive client and coordinates their
execution within the Hive system.
• It parses, compiles, optimizes, and executes HiveQL queries, generating an
execution plan that specifies the sequence of tasks required to execute the
query.
• Hive Compiler:
• The Hive Compiler is responsible for translating HiveQL queries into a series of
MapReduce, Tez, or Spark jobs for execution on the Hadoop cluster.
• It generates an execution plan (also known as the query plan) based on the
query semantics and optimization rules.
• Hive Metastore:
• The Hive Metastore is a centralized repository that stores metadata about
Hive tables, partitions, columns, storage location, and other schema-related
information.
• It maintains a mapping between logical Hive tables and their corresponding
physical storage locations in HDFS or other storage systems.
• The Metastore is typically backed by a relational database such as MySQL,
PostgreSQL, or Derby.
• Hive Server:
• The Hive Server provides a Thrift or JDBC/ODBC interface for external clients
to submit HiveQL queries and interact with the Hive system.
• It manages connections from multiple clients and coordinates query
execution across the Hadoop cluster.
• Hadoop Distributed File System (HDFS):
• HDFS is the primary storage system used by Hive for storing large volumes of
data in distributed fashion across nodes in the Hadoop cluster.
• Hive tables are typically stored as files in HDFS, with each table represented
as a directory containing one or more data files.
• Execution Engine (MapReduce, Tez, Spark):
• The Execution Engine is responsible for executing the tasks generated by the
Hive Compiler on the Hadoop cluster.
• Hive supports different execution engines, including MapReduce (default),
Tez, and Spark, depending on the underlying data processing framework and
configuration.
• The execution engine distributes tasks across nodes in the cluster, manages
task execution, and aggregates results for query processing.
• Hive UDFs (User-Defined Functions):
• Hive UDFs are custom functions developed by users to extend the
functionality of Hive and perform specialized data processing tasks.
• UDFs can be implemented in Java, Scala, Python, or other programming
languages and registered with Hive for use in HiveQL queries.
• In summary, the Hive architecture consists of components such as the
client, driver, compiler, metastore, server, storage system (HDFS),
execution engine, and UDFs, working together to enable querying,
analyzing, and processing large datasets stored in Hadoop.
• Each component plays a specific role in the Hive ecosystem,
facilitating the execution of HiveQL queries and providing a scalable
and efficient platform for Big Data analytics.
Data types in Hive

• In Apache Hive, data types define the type of values that can be
stored in columns of Hive tables.
• Hive supports a wide range of primitive and complex data types to
accommodate various data formats and use cases.
• Here are the common data types supported by Hive:
Primitive Data Types:

• BOOLEAN: Represents a boolean value (true or false).


• TINYINT: Represents a 1-byte signed integer (-128 to 127).
• SMALLINT: Represents a 2-byte signed integer (-32,768 to 32,767).
• INT (INTEGER): Represents a 4-byte signed integer (-2^31 to 2^31 - 1).
• BIGINT: Represents an 8-byte signed integer (-2^63 to 2^63 - 1).
• FLOAT: Represents a single-precision floating-point number.
• DOUBLE: Represents a double-precision floating-point number.
• STRING: Represents variable-length character strings.
• VARCHAR: Represents variable-length character strings with a specified
maximum length.
• CHAR: Represents fixed-length character strings with a specified length.
• DATE: Represents a date in the format 'YYYY-MM-DD'.
• TIMESTAMP: Represents a timestamp in the format 'YYYY-MM-DD
HH:MM:SS.FFFFFF'.
• BINARY: Represents binary data (arbitrary byte array).
• DECIMAL: Represents fixed-point decimal numbers with configurable
precision and scale.
Complex Data Types:

• ARRAY: Represents an ordered collection of elements of the same type.


• MAP: Represents an associative array (key-value pairs) where keys and values can
be of different types.
• STRUCT: Represents a complex type consisting of multiple named fields (similar to
a struct in programming languages).
• UNIONTYPE: Represents a union of multiple data types, allowing a column to
contain values of different types.
User-Defined Types (UDTs):

• Hive allows users to define custom data types using SerDe (Serializer/Deserializer) libraries
and use them in Hive tables.
• Complex Data Types in JSON Format:
• JSON: Hive supports storing and querying data in JSON format using the STRING data type.
• Collection Data Types (Introduced in Hive 0.13.0):
• INTERVAL: Represents intervals of time or time spans.
• These data types provide flexibility and versatility for storing and processing various types of
data in Hive tables.
• Users can choose appropriate data types based on the nature of the data, storage
requirements, and query processing needs.
• Additionally, Hive provides typecasting functions to convert data between different types
when necessary.
Database Operation in Hive ( Refer slide 93 &
ahead)
• https://www.geeksforgeeks.org/database-operations-in-hive-using-
cloudera-vmware-work-station/

• https://www.tutorialspoint.com/hive/hive_introduction.htm

• https://www.simplilearn.com/tutorials/hadoop-tutorial/hive
Partitioning in Hive
• The partitioning in Hive means dividing the table into some parts based on the
values of a particular column like date, course, city or country.
• The advantage of partitioning is that since the data is stored in slices, the query
response time becomes faster.
• As we know that Hadoop is used to handle the huge amount of data, it is
always required to use the best approach to deal with it.
• The partitioning in Hive is the best example of it.
• Let's assume we have a data of 10 million students studying in an
institute. Now, we have to fetch the students of a particular course.
• If we use a traditional approach, we have to go through the entire
data. This leads to performance degradation.
• In such a case, we can adopt the better approach i.e., partitioning in
Hive and divide the data among the different datasets based on
particular columns.
The partitioning in Hive can be executed in two
ways -
• Static partitioning
• Dynamic partitioning
• Static Partitioning
• In static or manual partitioning, it is required to pass the values of
partitioned columns manually while loading the data into the table.
Hence, the data file doesn't contain the partitioned columns.
• Dynamic Partitioning
• In dynamic partitioning, the values of partitioned columns exist within
the table. So, it is not required to pass the values of partitioned
columns manually.
HBASE
Limitations of Hadoop

• Hadoop can perform only batch processing, and data will be accessed
only in a sequential manner.
• That means one has to search the entire dataset even for the simplest
of jobs.
• At this point, a new solution is needed to access any point of data in a
single unit of time (random access).
Hadoop Random Access Databases

• Applications such as HBase, Cassandra, couchDB, Dynamo, and


MongoDB are some of the databases that store huge amounts of data
and access the data in a random manner.
What is HBase?

• HBase is a distributed column-oriented database built on top of the


Hadoop file system.
• It is an open-source project and is horizontally scalable.
• HBase is a data model that is similar to Google’s big table designed to
provide quick random access to huge amounts of structured data.
• It leverages the fault tolerance provided by the Hadoop File System
(HDFS).
• It is a part of the Hadoop ecosystem that provides random real-time
read/write access to data in the Hadoop File System.
• One can store the data in HDFS either directly or through HBase.
• Data consumer reads/accesses the data in HDFS randomly using HBase.
• HBase sits on top of the Hadoop File System and provides read and write
access.
HBase and HDFS
Storage Mechanism in HBase
• HBase is a column-oriented database and the tables in it are sorted by
row.
• The table schema defines only column families, which are the key value
pairs.
• A table have multiple column families and each column family can have
any number of columns.
• Subsequent column values are stored contiguously on the disk. Each cell
value of the table has a timestamp.
• In short, in an HBase:
Table is a collection of rows.
Row is a collection of column families.
Column family is a collection of columns.
Column is a collection of key value pairs.
Given below is an example schema of table in
HBase
HBase and RDBMS
Features of HBase
• HBase is linearly scalable.
• It has automatic failure support.
• It provides consistent read and writes.
• It integrates with Hadoop, both as a source and a destination.
• It has easy java API for client.
• It provides data replication across clusters.
HBase - Architecture
• In HBase, tables are split into regions and are served by the region
servers.
• Regions are vertically divided by column families into “Stores”.
• Stores are saved as files in HDFS.
• HBase has three major components: the client library, a master
server, and region servers.
• Region servers can be added or removed as per requirement.
MasterServer
• The master server -
• Assigns regions to the region servers and takes the help of Apache
ZooKeeper for this task.
• Handles load balancing of the regions across region servers. It unloads the
busy servers and shifts the regions to less occupied servers.
• Maintains the state of the cluster by negotiating the load balancing.
• Is responsible for schema changes and other metadata operations such as
creation of tables and column families.
Regions
• Regions are nothing but tables that are split up and spread across the
region servers.
• Region server:
The region servers have regions that -
Communicate with the client and handle data-related operations.
Handle read and write requests for all the regions under it.
Decide the size of the region by following the region size thresholds.
• When we take a deeper look into the region server, it contain
regions and stores as shown below:
• The store contains memory store and HFiles. Memstore is just like a
cache memory.
• Anything that is entered into the HBase is stored here initially.
• Later, the data is transferred and saved in Hfiles as blocks and the
memstore is flushed.
Zookeeper
• Zookeeper is an open-source project that provides services like
maintaining configuration information, naming, providing distributed
synchronization, etc.
• Zookeeper has ephemeral nodes representing different region servers.
Master servers use these nodes to discover available servers.
• In addition to availability, the nodes are also used to track server failures
or network partitions.
• Clients communicate with region servers via zookeeper.
• In pseudo and standalone modes, HBase itself will take care of zookeeper.
HBase - Shell
• HBase contains a shell using which you can communicate with HBase.
• HBase uses the Hadoop File System to store its data.
• It will have a master server and region servers.
• The data storage will be in the form of regions (tables).
• These regions will be split up and stored in region servers.
• The master server manages these region servers and all these tasks
take place on HDFS.
• Given below are some of the commands supported by HBase Shell.
General Commands
• status - Provides the status of HBase, for example, the number of
servers.
• version - Provides the version of HBase being used.
• table_help - Provides help for table-reference commands.
• whoami - Provides information about the user.
Data Definition Language
• create - Creates a table.
• list - Lists all the tables in HBase.
• disable - Disables a table.
• is_disabled - Verifies whether a table is disabled.
• enable - Enables a table.
• is_enabled - Verifies whether a table is enabled.
• describe - Provides the description of a table.
• alter - Alters a table.
• exists - Verifies whether a table exists.
• drop - Drops a table from HBase.
• drop_all - Drops the tables matching the ‘regex’ given in the command.
Data Manipulation Language
• put - Puts a cell value at a specified column in a specified row in a particular
table.
• get - Fetches the contents of row or a cell.
• delete - Deletes a cell value in a table.
• deleteall - Deletes all the cells in a given row.
• scan - Scans and returns the table data.
• count - Counts and returns the number of rows in a table.
• truncate - Disables, drops, and recreates a specified table.
• Java client API - Prior to all the above commands, Java provides a client API to
achieve DML functionalities, CRUD (Create Retrieve Update Delete) operations
and more through programming, under org.apache.hadoop.hbase.client
package. HTable Put and Get are the important classes in this package.
HIVE
• Hive is a data warehouse infrastructure tool to process structured data in
Hadoop.
• It resides on top of Hadoop to summarize Big Data, and makes querying and
analyzing easy.
• Initially Hive was developed by Facebook,
• later the Apache Software Foundation took it up and developed it further as
an open source under the name Apache Hive.
• It is used by different companies. For example, Amazon uses it in Amazon
Elastic MapReduce.
Hive is not

• A relational database
• A design for OnLine Transaction Processing (OLTP)
• A language for real-time queries and row-level updates
Features of Hive

• It stores schema in a database and processed data into HDFS.


• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
Architecture of Hive
Working of Hive
Hive - Data Types

• All the data types in Hive are classified into four types, given as
follows:
1. Column Types
2. Literals
3. Null Values
4. Complex Types
Column Types
1. Integral Types
• Integer type data can be specified using integral data types,
INT. When the data range exceeds the range of INT, you need
to use BIGINT and if the data range is smaller than the INT, you
use SMALLINT. TINYINT is smaller than SMALLINT.
2. String Types
• String type data types can be specified using single quotes (' ')
or double quotes (" ").
• It contains two data types: VARCHAR and CHAR.
3. Timestamp
• It supports traditional UNIX timestamp with optional
nanosecond precision.
• It supports java.sql.Timestamp format “YYYY-MM-DD
HH:MM:SS.fffffffff” and format “yyyy-mm-dd
hh:mm:ss.ffffffffff”.
• 4. Dates
• DATE values are described in year/month/day format
in the form {{YYYY-MM-DD}}.
5. Decimals
• The DECIMAL type in Hive is as same as Big Decimal
format of Java. It is used for representing immutable
arbitrary precision. The syntax and example is as
follows:
• DECIMAL(precision, scale) decimal(10,0)
6. Union Types
• Union is a collection of heterogeneous data types. You
can create an instance using create union. The syntax
and example is as follows:
Literals
1. Floating Point Types
• Floating point types are nothing but numbers with
decimal points. Generally, this type of data is
composed of DOUBLE data type.
2. Decimal Type
• Decimal type data is nothing but floating point value
with higher range than DOUBLE data type. The range
of decimal type is approximately -10-308 to 10308.
Null Value

• Missing values are represented by the special value NULL.


Complex Types
1. Arrays
• Arrays in Hive are used the same way they are used in Java.
• Syntax: ARRAY<data_type>
2. Maps
• Maps in Hive are similar to Java Maps.
• Syntax: MAP<primitive_type, data_type>
3. Structs
• Structs in Hive is similar to using complex data with comment.
• Syntax: STRUCT<col_name : data_type [COMMENT
col_comment], ...>
Hive - Create Database
• Create Database Statement
• Create Database is a statement used to create a database in
Hive. A database in Hive is a namespace or a collection of
tables. The syntax for this statement is as follows:
• CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
• Here, IF NOT EXISTS is an optional clause, which notifies the
user that a database with the same name already exists.
• We can use SCHEMA in place of DATABASE in this command.
The following query is executed to create a database
named userdb:
• hive> CREATE DATABASE [IF NOT EXISTS] userdb;
• hive> CREATE SCHEMA userdb;
Hive - Drop Database
• Drop Database Statement
• Drop Database is a statement that drops all the tables and
deletes the database. Its syntax is as follows:
• DROP DATABASE StatementDROP (DATABASE|SCHEMA) [IF
EXISTS] database_name [RESTRICT|CASCADE];
• The following queries are used to drop a database. Let us
assume that the database name is userdb.
• hive> DROP DATABASE IF EXISTS userdb;
• hive> DROP SCHEMA userdb;
Hive - Create Table
• Create Table Statement
• Create Table is a statement used to create a table in Hive. The
syntax and example are as follows:
Load Data Statement
• Generally, after creating a table in SQL, we can insert data using the Insert
statement. But in Hive, we can insert data using the LOAD DATA
statement.
• While inserting data into Hive, it is better to use LOAD DATA to store bulk
records.
• There are two ways to load data: one is from local file system and second
is from Hadoop file system.
• Syntax
• The syntax for load data is as follows:
• LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE
tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
• LOCAL is identifier to specify the local path. It is optional.
• OVERWRITE is optional to overwrite the data in the table.
• PARTITION is optional.
Hive - Alter Table
• It is used to alter a table in Hive.
Change Statement
• The following table contains the fields of employee table
and it shows the fields to be changed (in bold).
Hive - Drop Table
Hive - Partitioning
• Hive organizes tables into partitions. It is a way of dividing a table into
related parts based on the values of partitioned columns such as date,
city, and department.
• Using partition, it is easy to query a portion of the data.
• For example, a table named Tab1 contains employee data such as id,
name, dept, and yoj (i.e., year of joining).
• Suppose you need to retrieve the details of all employees who joined in
2012.
• A query searches the whole table for the required information.
• However, if you partition the employee data with the year and store it in
a separate file, it reduces the query processing time.

You might also like