0% found this document useful (0 votes)

6 views37 pages

05b Hive

HIVE is a data warehousing system designed to manage and query unstructured data as if it were structured, utilizing Hadoop's file system for storage and Map-Reduce for execution. It was developed at Facebook to handle the exponential growth of data, providing a familiar SQL interface and extensibility through user-defined functions and types. Key components include a shell for interactive queries, a driver for session management, a compiler for query optimization, and a metastore for schema management.

Uploaded by

Frederic Vargas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views37 pages

05b Hive

Uploaded by

Frederic Vargas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

HIVE

Why Another Data
Warehousing System?
 Problem : Data, data and more data
 Several TBs of data everyday

 The Hadoop Experiment:
 Uses Hadoop File System (HDFS)
 Scalable/Available

 Problem
 Lacked Expressiveness
 Map-‐Reduce hard to program

 SoluOon : HIVE
Copyright Ellis Horowitz, 2011 - 2012 2
What is HIVE?
 A system for managing and querying unstructured data as if it
were structured
 Uses Map-‐Reduce for execuOon
 HDFS for Storage
 Key Building Principles
 SQL as a familiar data warehousing tool
 Extensibility (Pluggable map/reduce scripts in the language of your
choice, Rich and User Defined Data Types, User Defined FuncOons)
 Interoperability (Extensible Framework to support different file and data
formats)
 Performance

Copyright Ellis Horowitz, 2011 - 2012 3

Hive: Background
 Started at Facebook
 Data was collected by nightly cron jobs into Oracle
DB
 “ETL” via hand-coded python
 Grew from 10s of GBs (2006) to 1 TB/day new data
(2007), now 10x that

Copyright Ellis Horowitz, 2011 - 2012 4

Source: cc-licensed slide by Cloudera
Hive Components
 Shell: allows interactive queries
 Driver: session handles, fetch, execute
 Compiler: parse, plan, optimize
 Execution engine: DAG of stages (MR, HDFS,
metadata)
 Metastore: schema, location in HDFS, etc

Copyright Ellis Horowitz, 2011 - 2012 5

Source: cc-licensed slide by Cloudera
Data Model
 Tables
 Typed columns (int, float, string, boolean)
 Also, list: map (for JSON-like data)
 Partitions
 For example, range-partition tables by date
 Buckets
 Hash partitions within ranges (useful for sampling, join
optimization)

Copyright Ellis Horowitz, 2011 - 2012 6

Source: cc-licensed slide by Cloudera
Type System
 PrimiOve types
– Integers:TINYINT, SMALLINT, INT, BIGINT.
– Boolean: BOOLEAN.
– FloaOng point numbers: FLOAT, DOUBLE .
– String: STRING.
 Complex types
– Structs: {a INT; b INT}.
– Maps: M['group'].
– Arrays: ['a', 'b', 'c'], A[1] returns 'b'.

Copyright Ellis Horowitz, 2011 - 2012 7

Data Model-‐ Tables
 Tables
 Analogous to tables in relaOonal DBs.
 Each table has corresponding directory in HDFS.
 Example
 Page view table name – pvs
 HDFS directory
 /wh/pvs

 Example:
CREATE TABLE t1(ds string, ctry ﬂoat, li list<map<string,
struct<p1:int, p2:int<<);

Copyright Ellis Horowitz, 2011 - 2012 8

Data Model -‐ ParOOons
 ParOOons
 Analogous to dense indexes on parOOon columns
 Nested sub-‐directories in HDFS for each combinaOon of
parOOon column values.
 Allows users to eﬃciently retrieve rows
 Example
 ParOOon columns: ds, ctry
 HDFS for ds=20120410, ctry=US
 /wh/pvs/ds=20120410/ctry=US

 HDFS for ds=20120410, ctry=IN

 /wh/pvs/ds=20120410/ctry=IN

Copyright Ellis Horowitz, 2011 - 2012 9

Hive Query Language –Contd.
 ParOOoning – CreaOng parOOons

CREATE TABLE test_part(ds string, hr int)
PARTITIONED BY (ds string, hr int);

 INSERT OVERWRITE TABLE
test_part PARTITION(ds='2009-‐01-‐01', hr=12)
SELECT * FROM t;

 ALTER TABLE test_part
ADD PARTITION(ds='2009-‐02-‐02', hr=11);

Copyright Ellis Horowitz, 2011 - 2012 10

ParOOoning -‐ Contd..
SELECT * FROM test_part WHERE ds='2009-‐01-‐01';

 will only scan all the ﬁles within the
/user/hive/warehouse/test_part/ds=2009-‐01-‐01 directory

SELECT * FROM test_part
WHERE ds='2009-‐02-‐02' AND hr=11;

 will only scan all the ﬁles within the /user/hive/warehouse/test_part/
ds=2009-‐02-‐02/hr=11 directory.

Copyright Ellis Horowitz, 2011 - 2012 11

Data Model
 Buckets
 Split data based on hash of a column – mainly for
parallelism
 Data in each parOOon may in turn be divided into Buckets
based on the value of a hash funcOon of some column of a
table.
 Example
 Bucket column: user into 32 buckets
 HDFS ﬁle for user hash 0
 /wh/pvs/ds=20120410/cntr=US/part-‐00000

 HDFS ﬁle for user hash bucket 20

 /wh/pvs/ds=20120410/cntr=US/part-‐00020

Copyright Ellis Horowitz, 2011 - 2012 12

Data Model
 External Tables
 Point to exisOng data directories in HDFS
 Can create table and parOOons
 Data is assumed to be in Hive-‐compaOble format
 Dropping external table drops only the metadata
 Example: create external table
CREATE EXTERNAL TABLE test_extern(c1 string, c2 int)
LOCATION '/user/mytables/mydata';

Copyright Ellis Horowitz, 2011 - 2012 13

SerializaOon/DeserializaOon
 Generic (De)SerialzaOon Interface SerDe
 Uses LazySerDe
 Flexibile Interface to translate unstructured data into
structured data
 Designed to read data separated by diﬀerent delimiter
characters
 The SerDes are located in 'hive_contrib.jar';

Copyright Ellis Horowitz, 2011 - 2012 14

Hive Tables
 Two types of tables
 External Table
 Table created on top of the exisOng data
 delete the table è data sOll persistent
 Normal Table
 Tables locaOon is in hives default locaOon

 delete the table è data gone

Copyright Ellis Horowitz, 2011 - 2012 15

Create Table
 Employee1 | Name 1 |Address1|Phone 1
 create external table (Key1 String, Name Strng,Address
String, Phone String) row format delimited ﬁelds
terminated by ‘|’ locaOon ‘/….’;

Copyright Ellis Horowitz, 2011 - 2012 16

Hive File Formats
 Hive lets users store diﬀerent ﬁle formats
 Helps in performance improvements
 SQL Example:
CREATE TABLE dest1(key INT, value STRING)
STORED AS
INPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.mapred.SequenceFileOutputFormat'

Copyright Ellis Horowitz, 2011 - 2012 17

17
System Architecture and Components

Copyright Ellis Horowitz, 2011 - 2012 18

System Architecture
and Components
JDBC ODBC

Web
Command Line Interface Interface ThriP Server

Metastore
Driver
(Compiler, OpOmizer, Executor)

Metastore
• The component that store the system catalog and meta data about tables,
columns, parOOons etc.
• Stored on a tradiOonal RDBMS
System Architecture
and Components
JDBC ODBC

Web
Command Line Interface Interface ThriP Server

Metastore
Driver
(Compiler, OpOmizer, Executor)

• Driver
The component that manages the lifecycle of a HiveQL statement as it
moves through Hive. The driver also maintains a session handle and any
session staOsOcs.
System Architecture
and Components
JDBC ODBC

Web
Command Line Interface Interface ThriP Server

Metastore
Driver
(Compiler, OpOmizer, Executor)

• Query Compiler
The component that compiles HiveQL into a directed acyclic graph of map/
reduce tasks.
System Architecture
and Components
JDBC ODBC

Web
Command Line Interface Interface ThriP Server

Metastore
Driver
(Compiler, OpOmizer, Executor)

• OpOmizer
consists of a chain of transformaOons such that the operator DAG resulOng from one transformaOon is passed as
input to the next transformaOon
Performs tasks like Column Pruning , ParOOon Pruning, ReparOOoning of Data

System Architecture
and Components
JDBC ODBC

Web
Command Line Interface Interface Thris Server

Metastore
Driver
(Compiler, OpOmizer, Executor)

• ExecuOon Engine
The component that executes the tasks produced by the compiler in proper
dependency order. The execuOon engine interacts with the underlying
Hadoop instance.
System Architecture
and Components
JDBC ODBC

Web
Command Line Interface Interface ThriP Server

Metastore
Driver
(Compiler, OpOmizer, Executor)

• HiveServer
The component that provides a tris interface and a JDBC/ODBC server and
provides a way of integraOng Hive with other applicaOons.
System Architecture
and Components
JDBC ODBC

Web
Command Line Interface Interface ThriP Server

Metastore
Driver
(Compiler, OpOmizer, Executor)

• Client Components
Client component like Command Line Interface(CLI), the web UI and JDBC/
ODBC driver.
Hive Query Language
 Basic SQL
 From clause sub-‐query
 ANSI JOIN (equi-‐join only)
 MulO-‐Table insert
 MulO group-‐by
 Sampling
 Objects Traversal
 Extensibility
 Pluggable Map-‐reduce scripts using TRANSFORM

Hive Query Language
 JOIN

SELECT t1.a1 as c1, t2.b1 as c2
FROM t1 JOIN t2 ON (t1.a2 = t2.b2);

 INSERTION
INSERT OVERWRITE TABLE t1
SELECT * FROM t2;

Hive Query Language –Contd.
 InserOon

INSERT OVERWRITE TABLE sample1 '/tmp/hdfs_out' SELECT * FROM sample
WHERE ds='2012-‐02-‐24';

INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM sample
WHERE ds='2012-‐02-‐24';

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive-‐sample-‐out' SELECT *
FROM sample;

Hive Query Language –Contd.
 Map Reduce

FROM (MAP doctext USING 'python wc_mapper.py' AS (word, cnt)
FROM docs
CLUSTER BY word
)
REDUCE word, cnt USING 'python wc_reduce.py';

 FROM (FROM session_table
SELECT sessionid, tstamp, data
DISTRIBUTE BY sessionid SORT BY tstamp
)
REDUCE sessionid, tstamp, data USING 'session_reducer.sh';

Hive Query Language
 Example of mulO-‐table insert query and its opOmizaOon
FROM (SELECT a.status, b.school, b.gender
FROM status_updates a JOIN proﬁles b
ON (a.userid = b.userid AND a.ds='2009-‐03-‐20' )) subq1

INSERT OVERWRITE TABLE gender_summary
PARTITION(ds='2009-‐03-‐20')
SELECT subq1.gender, COUNT(1)
GROUP BY subq1.gender

INSERT OVERWRITE TABLE school_summary
PARTITION(ds='2009-‐03-‐20')
SELECT subq1.school, COUNT(1)
GROUP BY subq1.school

Hive Query Language
Hive: Example
 Hive looks similar to an SQL database
 Relational join on two tables:
 Table of word counts from Shakespeare collection
 Table of word counts from Homer
SELECT s.word, s.freq, k.freq FROM shakespeare s
JOIN homer k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1
ORDER BY s.freq DESC LIMIT 10;

the 25848 62394

I 23031 8854
and 19671 38985
to 18038 13526
of 16700 34654
a 14170 8057
you 12702 2720
my 11297 4135
in 10797 12445
is 8882 6884
Source: Material drawn from Cloudera training VM
Hive: Behind the Scenes
SELECT s.word, s.freq, k.freq FROM shakespeare s
JOIN homer k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1
ORDER BY s.freq DESC LIMIT 10;

(Abstract Syntax Tree)

(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF shakespeare s) (TOK_TABREF homer k) (= (. (TOK_TABLE_OR_COL
s) word) (. (TOK_TABLE_OR_COL k) word)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT
(TOK_SELEXPR (. (TOK_TABLE_OR_COL s) word)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL s) freq)) (TOK_SELEXPR (.
(TOK_TABLE_OR_COL k) freq))) (TOK_WHERE (AND (>= (. (TOK_TABLE_OR_COL s) freq) 1) (>= (. (TOK_TABLE_OR_COL k)
freq) 1))) (TOK_ORDERBY (TOK_TABSORTCOLNAMEDESC (. (TOK_TABLE_OR_COL s) freq))) (TOK_LIMIT 10)))

(one or more of MapReduce jobs)

Metastore
 Database: namespace containing a set of tables
 Holds table definitions (column types, physical
layout)
 Holds partitioning information
 Can be stored in Derby, MySQL, and many other
relational databases

Source: cc-licensed slide by Cloudera

Physical Layout
 Warehouse directory in HDFS
 E.g., /user/hive/warehouse
 Tables stored in subdirectories of warehouse
 Partitions form subdirectories of tables
 Actual data stored in flat files
 Control char-delimited text, or SequenceFiles
 With custom SerDe, can use arbitrary format

Source: cc-licensed slide by Cloudera

Hive Usage @ Facebook
¬ Statistics per day:
¬ 4 TB of compressed new data added per day
¬ 135TB of compressed data scanned per day
¬ 7500+ Hive jobs on per day
¬ Hive simplifies Hadoop:
¬ ~200 people/month run jobs on Hadoop/Hive
¬ Analysts (non-engineers) use Hadoop through
Hive
¬ 95% of jobs are Hive Jobs

http://www.slideshare.net/cloudera/hw09-hadoop-
7/20/2010
development-at-facebook-hive-and-hdfs
Introduction to Hive 36
Conclusion
 Pros
 Good explanaOon of Hive and HiveQL with proper examples
 Architecture is well explained
 Usage of Hive is properly given
 Cons
 Accepts only a subset of SQL queries
 Performance comparisons with other systems would have
been more appreciable

User Guide Varicent Icm
100% (4)
User Guide Varicent Icm
380 pages
4.installing A New Product in T24
100% (1)
4.installing A New Product in T24
17 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
20 SQL Queries For Interview - Complex SQL Queries For Interview
No ratings yet
20 SQL Queries For Interview - Complex SQL Queries For Interview
8 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
Introduction To Hive
No ratings yet
Introduction To Hive
28 pages
HIVE Lect
No ratings yet
HIVE Lect
91 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
04 Bigdata Hive
No ratings yet
04 Bigdata Hive
22 pages
5 - Hive
No ratings yet
5 - Hive
51 pages
Hive
No ratings yet
Hive
49 pages
Big Data 4
No ratings yet
Big Data 4
14 pages
Hive
No ratings yet
Hive
23 pages
Big Data & Analytics (CSE6005) L6
No ratings yet
Big Data & Analytics (CSE6005) L6
56 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
HIVE
No ratings yet
HIVE
28 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Apache Hive: An Introduction
No ratings yet
Apache Hive: An Introduction
51 pages
Hive
No ratings yet
Hive
63 pages
Hive
No ratings yet
Hive
45 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
Hive Final
No ratings yet
Hive Final
75 pages
Hive Basics
No ratings yet
Hive Basics
35 pages
Course On: Big Data Analytics
No ratings yet
Course On: Big Data Analytics
59 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
Hive
No ratings yet
Hive
12 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
33 pages
Unit 4 HIVE - PIG
No ratings yet
Unit 4 HIVE - PIG
71 pages
Course3 Module2 Intro To Hive Slides
No ratings yet
Course3 Module2 Intro To Hive Slides
76 pages
Super 25 Unit 4 Notes
No ratings yet
Super 25 Unit 4 Notes
16 pages
Hive Pig PDF
No ratings yet
Hive Pig PDF
20 pages
The Free Hive Book
No ratings yet
The Free Hive Book
1 page
Session 3.2
No ratings yet
Session 3.2
27 pages
Introduction To Hive
No ratings yet
Introduction To Hive
14 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
Module 4
No ratings yet
Module 4
34 pages
6.1NoSQL ApacheHIVE Witha3
No ratings yet
6.1NoSQL ApacheHIVE Witha3
45 pages
Apache HIVE
No ratings yet
Apache HIVE
44 pages
Hive - A Warehousing Solution Over A Map-Reduce Framework
No ratings yet
Hive - A Warehousing Solution Over A Map-Reduce Framework
4 pages
HIVE
No ratings yet
HIVE
80 pages
Hive
No ratings yet
Hive
4 pages
(R17a0528) Big Data Analytics-57-100
No ratings yet
(R17a0528) Big Data Analytics-57-100
44 pages
Hive
No ratings yet
Hive
50 pages
Big Data Analytics: Welcome
No ratings yet
Big Data Analytics: Welcome
69 pages
Unit 5 Hive and Pig
No ratings yet
Unit 5 Hive and Pig
16 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
Hadoop Hive
No ratings yet
Hadoop Hive
61 pages
Hive Overview
No ratings yet
Hive Overview
28 pages
Hive
No ratings yet
Hive
30 pages
Final Bda 1-8 Lab Aayush
No ratings yet
Final Bda 1-8 Lab Aayush
17 pages
Hive
No ratings yet
Hive
47 pages
Hive Notes
No ratings yet
Hive Notes
15 pages
Using Hive For Data Warehousing: Introduction To Hive
No ratings yet
Using Hive For Data Warehousing: Introduction To Hive
4 pages
Big Data
No ratings yet
Big Data
120 pages
Hive
No ratings yet
Hive
28 pages
Hive Data Types and Data Models
No ratings yet
Hive Data Types and Data Models
24 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Data Visualization - Data Mining
No ratings yet
Data Visualization - Data Mining
11 pages
SolarWinds Interview Perp - Edition 8 (MIBs and OIDs)
No ratings yet
SolarWinds Interview Perp - Edition 8 (MIBs and OIDs)
21 pages
Phprunner
80% (5)
Phprunner
1,346 pages
Intro-to-Data-and Data-Science-Course-Notes-365-Data-Science
100% (1)
Intro-to-Data-and Data-Science-Course-Notes-365-Data-Science
17 pages
Baze Podataka Predavanja
No ratings yet
Baze Podataka Predavanja
78 pages
HBase and Apache Phoenix
No ratings yet
HBase and Apache Phoenix
6 pages
03 Etl 081028 2055
No ratings yet
03 Etl 081028 2055
46 pages
EdX MicroBachelors Courses & Programs
No ratings yet
EdX MicroBachelors Courses & Programs
2 pages
Azure Databricks - An Introduction 2019 Roadshow
No ratings yet
Azure Databricks - An Introduction 2019 Roadshow
13 pages
50 MCQ Database Questions
No ratings yet
50 MCQ Database Questions
16 pages
Commands SQL, Python (BASICS)
No ratings yet
Commands SQL, Python (BASICS)
7 pages
OPTIMA Operations and Maintenance Guide
No ratings yet
OPTIMA Operations and Maintenance Guide
486 pages
MIS Week 2 Assignment
No ratings yet
MIS Week 2 Assignment
6 pages
Termux Command
No ratings yet
Termux Command
4 pages
VMware Site Recovery Manager On NetApp Storage
100% (1)
VMware Site Recovery Manager On NetApp Storage
42 pages
Current Log
No ratings yet
Current Log
19 pages
Database Lab Project
No ratings yet
Database Lab Project
8 pages
BSC
No ratings yet
BSC
20 pages
Danish BIM Guidelines Oslo March 2012
No ratings yet
Danish BIM Guidelines Oslo March 2012
12 pages
Visualizer Overview
No ratings yet
Visualizer Overview
2 pages
SQL
No ratings yet
SQL
107 pages
Diskmod Manual
No ratings yet
Diskmod Manual
3 pages
Rman Q A
No ratings yet
Rman Q A
16 pages
Ps File
No ratings yet
Ps File
6 pages
Fundamental of Spatial Data Base Assignment&presentation
No ratings yet
Fundamental of Spatial Data Base Assignment&presentation
3 pages
Defintion #1: "Fitness For Use": - Degree To Which Data Can Be Used For Its Intended Purpose
No ratings yet
Defintion #1: "Fitness For Use": - Degree To Which Data Can Be Used For Its Intended Purpose
6 pages
Report PSA Assessement
No ratings yet
Report PSA Assessement
21 pages

05b Hive

Uploaded by

05b Hive

Uploaded by

HIVE

Copyright Ellis Horowitz, 2011 - 2012 3

Copyright Ellis Horowitz, 2011 - 2012 4

Copyright Ellis Horowitz, 2011 - 2012 5

Copyright Ellis Horowitz, 2011 - 2012 6

Copyright Ellis Horowitz, 2011 - 2012 7

Copyright Ellis Horowitz, 2011 - 2012 8

 HDFS for ds=20120410, ctry=IN

Copyright Ellis Horowitz, 2011 - 2012 9

Copyright Ellis Horowitz, 2011 - 2012 10

Copyright Ellis Horowitz, 2011 - 2012 11

 HDFS ﬁle for user hash bucket 20

Copyright Ellis Horowitz, 2011 - 2012 12

Copyright Ellis Horowitz, 2011 - 2012 13

Copyright Ellis Horowitz, 2011 - 2012 14

 delete the table è data gone

Copyright Ellis Horowitz, 2011 - 2012 15

Copyright Ellis Horowitz, 2011 - 2012 16

Copyright Ellis Horowitz, 2011 - 2012 17

Copyright Ellis Horowitz, 2011 - 2012 18

Copyright Ellis Horowitz, 2011 - 2012 26

Copyright Ellis Horowitz, 2011 - 2012 27

Copyright Ellis Horowitz, 2011 - 2012 28

Copyright Ellis Horowitz, 2011 - 2012 29

Copyright Ellis Horowitz, 2011 - 2012 30

the 25848 62394

(Abstract Syntax Tree)

(one or more of MapReduce jobs)

Source: cc-licensed slide by Cloudera

Source: cc-licensed slide by Cloudera

Copyright Ellis Horowitz, 2011 - 2012 37

You might also like

05b Hive

Uploaded by

05b Hive

Uploaded by

HIVE

Copyright Ellis Horowitz, 2011 - 2012 3

Copyright Ellis Horowitz, 2011 - 2012 4

Copyright Ellis Horowitz, 2011 - 2012 5

Copyright Ellis Horowitz, 2011 - 2012 6

Copyright Ellis Horowitz, 2011 - 2012 7

Copyright Ellis Horowitz, 2011 - 2012 8

 HDFS for ds=20120410, ctry=IN

Copyright Ellis Horowitz, 2011 - 2012 9

Copyright Ellis Horowitz, 2011 - 2012 10

Copyright Ellis Horowitz, 2011 - 2012 11

 HDFS ﬁle for user hash bucket 20

Copyright Ellis Horowitz, 2011 - 2012 12

Copyright Ellis Horowitz, 2011 - 2012 13

Copyright Ellis Horowitz, 2011 - 2012 14

 delete the table è data gone

Copyright Ellis Horowitz, 2011 - 2012 15

Copyright Ellis Horowitz, 2011 - 2012 16

Copyright Ellis Horowitz, 2011 - 2012 17

Copyright Ellis Horowitz, 2011 - 2012 18

Copyright Ellis Horowitz, 2011 - 2012 26

Copyright Ellis Horowitz, 2011 - 2012 27

Copyright Ellis Horowitz, 2011 - 2012 28

Copyright Ellis Horowitz, 2011 - 2012 29

Copyright Ellis Horowitz, 2011 - 2012 30

the 25848 62394

(Abstract Syntax Tree)

(one or more of MapReduce jobs)

Source: cc-licensed slide by Cloudera

Source: cc-licensed slide by Cloudera

Copyright Ellis Horowitz, 2011 - 2012 37

You might also like

 HDFS for ds=20120410, ctry=IN

 HDFS ﬁle for user hash bucket 20

 delete the table è data gone