0% found this document useful (0 votes)
13 views12 pages

Hive

Uploaded by

mustafaalj2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views12 pages

Hive

Uploaded by

mustafaalj2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

 Hadoop is an open source framework that is

used to efficiently store and process large


datasets ranging in size from gigabytes to
petabytes of data.
 Hadoop is an open source distributed processing
framework that manages data processing and
storage for big data applications in scalable clusters
of computer servers. It's at the center of an
ecosystem of big data technologies that are primarily
used to support advanced analytics initiatives, data
mining and machine learning. Hadoop systems can
handle various forms of structured and unstructured
data, giving users more flexibility for collecting,
processing, analyzing and managing data than
relational databases and data warehouses provide.
 Hadoop's ability to process and store
different types of data makes it a particularly
good fit for big data environments. They
typically involve not only large amounts of
data, but also a mix of structured
transaction data and semistructured and
unstructured information, such as internet
clickstream records, web server and mobile
application logs, social media posts,
customer emails and sensor data from the
internet of things (IoT).
 Formally known as Apache Hadoop, the technology
is developed as part of an open source project within
the Apache software foundation. Multiple vendors
offer commercial Hadoop distributions, although the
number of Hadoop vendors has declined because of
an overcrowded market and then competitive
pressures driven by the increased deployment of big
data systems in the cloud. The shift to the cloud also
enables users to store data in lower-cost
cloud object storage services instead of Hadoop's
namesake file system; as a result, Hadoop's role is
being reduced in some big data architectures.
Hive
Motivation
 Yahoo worked on Pig to facilitate application
deployment on Hadoop.
◦ Their need mainly was focused on unstructured
data
 Simultaneously Facebook started working on
deploying warehouse solutions on Hadoop
that resulted in Hive.
◦ The size of data being collected and analyzed in
industry for business intelligence (BI) is growing
rapidly making traditional warehousing solution
prohibitively expensive.

05/31/2024 7
Hive architecture (from the paper)

05/31/2024 8
Data model
 Hive structures data into well-understood
database concepts such as: tables, rows, cols,
partitions
 It supports primitive types: integers, floats,
doubles, and strings
 Hive also supports:
◦ associative arrays: map<key-type, value-type>
◦ Lists: list<element type>
◦ Structs: struct<file name: file type…>
 SerDe: serialize and deserialized API is used to
move data in and out of tables
05/31/2024 9
Query Language (HiveQL)
 Subset of SQL
 Meta-data queries
 No inserts on existing tables

◦ Can overwrite an entire table

10
Data Model
 Tables
 Basic type columns (int, float, boolean)
 Complex type: List / Map ( associate array)
 Partitions
 Buckets
 CREATE TABLE sales( id INT, items
ARRAY<STRUCT<id:INT,name:STRING>
) PARITIONED BY (ds STRING)
CLUSTERED BY (id) INTO 32 BUCKETS;

 SELECT id FROM sales TABLESAMPLE (BUCKET 1 OUT OF


32)
05/31/2024 11
Introduction to Hive
Apache Hive

 Run HiveQL, SQL-like language, to interact with Hadoop.


 Demo: Create and load wordcount results from Pig script into table.
Retrieve data.

You might also like