0% found this document useful (0 votes)

141 views16 pages

Hive On Google Cloud

The document describes how to set up Hive on Google Cloud Dataproc. It involves enabling APIs, creating storage buckets and loading sample data, launching a Dataproc cluster, connecting to it using SSH, and running Hive commands like CREATE EXTERNAL TABLE to access the data and issue SQL queries. Examples of queries demonstrate retrieving trades within a time window and calculating total trading volume by symbol. The cluster is deleted to avoid further charges.

Uploaded by

Niri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

141 views16 pages

Hive On Google Cloud

Uploaded by

Niri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Hive on google cloud

Introduction

Google Cloud Dataproc is Google’s implementation of the Hadoop ecosystem that includes the
Hadoop Distributed File System (HDFS) and Map/Reduce processing framework. In addition the
Google Cloud Dataproc system includes a number of applications such as Hive, Mahout, Pig,
Spark and Hue that are built on top of Hadoop.

Apache HIVE https://hive.apache.org/ is a SQL interface that operates on data stored in the
Hadoop Distributed File System (HDFS). The programming language used with HIVE is
HiveQL, a simplified version of SQL. Queries submitted via HIVE are converted into
Map/Reduce jobs that access data stored in the HDFS. Results are then aggregated and returned
to the user or application.

Part 1: Enabling the Google Cloud Compute Engine API and Dataproc API

If this is the first time the Google Cloud Compute Engine or Dataproc service is being used on
this account, the Google Cloud Compute Engine API (Application Programming Interface) and
Dataproc API must be enabled. If this has already been enabled, please skip to the next part.

1. Pull down the Products & Services menu (three horizontal bars in the upper left corner)
and select the API Manager menu item.
2. When the API Manager screen appears, click on the blue Enable API button at the top of
the page.
3. Select the Compute Engine API from the list that pops up.
4. Click on the Enable button next to the Google Compute Engine API title at the top of
the page.
It may take a few minutes to enable this API.
5. Repeat this set of steps to Enable the Dataproc API. If you do not see Dataproc on any of
the lists, do a Search for Dataproc and then enable it at that point.

Part 2: Creating a Storage Bucket and Loading Data

This section of the tutorial covers how to create a storage bucket and upload files to the Google
Cloud Storage system.

1. Log in to the Google Cloud account at cloud.google.com

2. Click on the Console link in the upper right corner.
3. Click on the Products & Services icon (three horizontal bars) icon in the upper left corner
4. Scroll down to the Storage group and select Storage.
5.Click on the Create Bucket button.

6.Fill in a name for the new bucket. You may wish to use a name that incorporates your initials
or some other unique number to make it unique. Make a note of the bucket name as this will be
used in later steps. Click on the Create button.
7.The new Bucket should now be selected. Click on the Create Folder button to create a new
folder within this new bucket.

8.Name this new bucket “data” and then click the CREATE button.
9.Repeat these steps to create two more folders named “output” and “logs”. When completed, the bucket and
folders will appear as shown below:

10.Navigate to the data folder and click the UPLOAD FILES button.
11. Download dataset from following URL:
http://optionsdata.baruch.cuny.edu/data1/delivery/data/trades_sample.csv and upload it.

12.The file will be uploaded to Google Cloud storage into the market-data-bucket / data folder.

Rename file to trades_sample.csv. At this point we now have a new storage bucket created with
three new folders. A file named trades_sample.csv has been stored in the data folder.

Part 3: Creating a Dataproc Cluster

Now that the data is stored in the Google Cloud Storage, a Hadoop cluster can be created using
the Google Cloud Dataproc services.

13.Log in to the Google Cloud system

14.Click on the Console link in the upper right corner.

15.Click on the Products & Services icon (three horizontal bars) icon in the upper left corner

16. Scroll down to the BigData group and select Dataproc.

17.Click on the blue Create cluster button in the middle of the screen
18. Fill in the settings for the new cluster. Examples are provided.
– Give a Name for the Cluster: market-data-cluster
– Select a Zone (Region of the world): us-east-c – Or pick the Zone that is closest to you, or
least expensive, etc.

19. Set up the Master Node that will host the scheduler (YARN), the Hadoop Distributed File
System (HDFS) Master Node. The settings are:
– Machine Type: n1-standard-n4 (4 vCPU, 15GB RAM) This is powerful enough for the
examples used in this tutorial
– Cluster Mode: Standard (1 master)
– Primary Disk Size: 100 GB Increase this if your data size will be any larger.

20. Now set up the worker nodes. Each worker node will contain the YARN task manager and
act as an HDFS storage node. Select a Machine Type, Number of Nodes and Primary Disk
storage to match the planned workload. The choices given below are more than enough for the
sample data used in this tutorial:
– Machine Type: n1-standard-n4 (4 vCPU, 15GB RAM)
– Number of Nodes: 3
– Primary Disk Size: 100 GB
– Local Solid State Drives (SSD): 0
– YARN Cores: 12
– YARN Memory: 36 GB

21. The rest of the options under “Preemptible workers, bucket, network, version, initialization, &
access options” can be left at the default settings. The complete set of parameters is shown in
the figure below:
22. At this point, click the Create button to create and launch the cluster. Note that as long as the
cluster is running, your Google Cloud account will be charged.

23. Initially, the Status of the cluster will be “Provisioning”

Part 4: Connecting to the Master Node using Secure Shell (ssh)
Now that the cluster is running, connect to the cluster using secure shell. Google provides a
great web browser-based secure shell client so there are no keys to manage or extra software
to install.

24.Click on the name of the cluster and then click on the VM Instances tab

25.Click on the down arrow next to the SSH icon and select Open in browser window from the
drop down menu
26. Once the SSH connection is established, the shell prompt will appear.

Part 5: Run the Beeline command line interface to Hive and Issue SQL
Statements
An example of running the Beeline command line is shown below:

myusername@market-data-cluster-m:~$

beeline -u jdbc:hive2://localhost:10000/default -n myusername@market-data-cluster-m -d

org.apache.hive.jdbc.HiveDriver

Loading Data into Hive

The sample data file trades_sample.csv is already loaded on the Google Cloud Storage
(gs:) in folder market-data-bucket/data.

There are two main ways of working with data under Hive:

● Use CREATE TABLE to move the data from the HDFS (gs: in this case) into the HIVE
file system as a table. Once the data is moved into the HIVE table, the file is removed
from the regular HDFS.
● Use CREATE EXTERNAL TABLE that will leave the source data file in the HDFS. Using
this approach, the data stays where it is (gs: bucket in this example) but the table can
be manipulated using the same SQL statements.
For this example, create an EXTERNAL table with a given column structure that will use the
existing data file in place. The syntax is:

CREATE EXTERNAL TABLE trades_sample

(trading_date_time TIMESTAMP,
network CHAR(1),
message_category CHAR(1),
message_type CHAR(1),
message_sequence BIGINT,
market_exchange CHAR(1),
symbol VARCHAR(10),
trade_price DOUBLE,
trade_size BIGINT,
trade_conditions VARCHAR(6),
trade_conditions2 VARCHAR(6) )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 'gs://market-data-bucket/data/';

Note that the LOCATION clause in the CREATE TABLE statement must point to the bucket and
folder containing the data. It is also possible to point to a specific file. For example, to reference
a specific file in a folder use this syntax:

LOCATION 'gs://market-data-bucket/data/trades_sample.csv';

To use all of the files in a folder as part of the same table, just specify the name of the bucket
and folder as in:

LOCATION 'gs://market-data-bucket/data/';

Bucket names and folder names are case-sensitive. The complete path should be enclosed in
simple single quotes (some text editors make “fancy quotes” so be careful when copying and
pasting in commands). Be sure to use the bucket name provided in the first section of the
tutorial when the bucket was first created. Do not use the TAB key to indent. Instead just use
space.
Here is the CREATE EXTERNAL TABLE command after typing in to Beeline:

Now that the table has been created, use the DESCRIBE command to describe the structure of
the table:
With the table in place the usual SQL SELECT statements can be issued against it. Each query
will be turned into a MapReduce job that will operate over the data and return the results.

For example, see how many records there are in the trades_sample table:

Some query examples:

1)Retrieve a list of trades that happen during the first minute of trading at 9:30am:
SELECT symbol, trade_price, trade_size

FROM trades_sample

WHERE hour(trading_date_time) = 9

AND minute(trading_date_time) = 30;

2)Find the total trading volume (trade_size) for each stock before 12:00 noon
SELECT symbol, SUM(trade_size) AS total_volume

FROM trades_sample

WHERE hour(trading_date_time) < 12

GROUP BY symbol

ORDER BY symbol;
Part 6: Shutting down the Cluster

Click on the blue DELETE button to remove the cluster. A confirmation screen will appear:

GCP Associate Cloud Engineer Master Cheatsheet
No ratings yet
GCP Associate Cloud Engineer Master Cheatsheet
45 pages
ACE Workbook v2.0
100% (1)
ACE Workbook v2.0
81 pages
ACE Prep - Google
100% (1)
ACE Prep - Google
104 pages
Jump into JMP Scripting, Second Edition
From Everand
Jump into JMP Scripting, Second Edition
Wendy Murphrey
No ratings yet
Learn SQLite in 24 Hours
From Everand
Learn SQLite in 24 Hours
Alex Nordeen
No ratings yet
Matrix Multiplication Using Hadoop Map-Reduce
No ratings yet
Matrix Multiplication Using Hadoop Map-Reduce
10 pages
Hamilton M21 Hairspring and Balance
No ratings yet
Hamilton M21 Hairspring and Balance
10 pages
The New Age Movement PDF
No ratings yet
The New Age Movement PDF
345 pages
GCP Setup Guide Document
No ratings yet
GCP Setup Guide Document
29 pages
Had Oop Inverted Index V 5
No ratings yet
Had Oop Inverted Index V 5
17 pages
BDA Lab Manual-2
No ratings yet
BDA Lab Manual-2
61 pages
GCP Associate Guide
No ratings yet
GCP Associate Guide
14 pages
GC MapRdeuce
No ratings yet
GC MapRdeuce
7 pages
Analyse Data in GCP
No ratings yet
Analyse Data in GCP
14 pages
Associate Cloud Engineer - Study Notes
No ratings yet
Associate Cloud Engineer - Study Notes
14 pages
Salesforce Developer Interview Questions: 1.0, #1
From Everand
Salesforce Developer Interview Questions: 1.0, #1
SFDC TELUGU
No ratings yet
The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights
From Everand
The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights
Dave Fowler
No ratings yet
Session 5
No ratings yet
Session 5
14 pages
The Definitive Guide to Getting Started with OpenCart 2.x
From Everand
The Definitive Guide to Getting Started with OpenCart 2.x
iSenseLabs
No ratings yet
OpenCart Tips and Tricks
From Everand
OpenCart Tips and Tricks
iSenseLabs
No ratings yet
CS341: Project in Mining Massive Datasets: Michele Catasta, Jure Leskovec, Jeffrey Ullman
No ratings yet
CS341: Project in Mining Massive Datasets: Michele Catasta, Jure Leskovec, Jeffrey Ullman
29 pages
GCP Fund Module 8 Big Data and Machine Learning in The Cloud
No ratings yet
GCP Fund Module 8 Big Data and Machine Learning in The Cloud
41 pages
GCP Data Engineer Curriculum
No ratings yet
GCP Data Engineer Curriculum
7 pages
M2_Google_Cloud_Data_Migration_Solutions
No ratings yet
M2_Google_Cloud_Data_Migration_Solutions
65 pages
Task 1& Task 5
No ratings yet
Task 1& Task 5
8 pages
GoogleCloudFlatform Baisc
No ratings yet
GoogleCloudFlatform Baisc
6 pages
GCP DataProc Instructions
No ratings yet
GCP DataProc Instructions
7 pages
CSS Grid Layout: 5 Practical Projects
From Everand
CSS Grid Layout: 5 Practical Projects
Craig Buckler
No ratings yet
4.4 - Managed Services
No ratings yet
4.4 - Managed Services
17 pages
Google Cloud Fund M8 Big Data and Machine Learning in The Cloud
No ratings yet
Google Cloud Fund M8 Big Data and Machine Learning in The Cloud
44 pages
Cloud Computing Assignments: Name: Shubham Ubhe GR No: 21810164 Roll No: 321055 Class: TY Btech Branch: Computer
No ratings yet
Cloud Computing Assignments: Name: Shubham Ubhe GR No: 21810164 Roll No: 321055 Class: TY Btech Branch: Computer
105 pages
GCP Fund Module 8 Big Data and Machine Learning in The Cloud Coursera
No ratings yet
GCP Fund Module 8 Big Data and Machine Learning in The Cloud Coursera
38 pages
Cloud
No ratings yet
Cloud
25 pages
Hosting A Web App On Google Cloud Using Compute Engine - PCA
No ratings yet
Hosting A Web App On Google Cloud Using Compute Engine - PCA
34 pages
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
GCP Command Line Cheat Sheet
No ratings yet
GCP Command Line Cheat Sheet
14 pages
Cloud Computing Lab-3
No ratings yet
Cloud Computing Lab-3
8 pages
From Data To Insights Course Summary
No ratings yet
From Data To Insights Course Summary
67 pages
Building Batch Data Pipelines On Google Cloud
No ratings yet
Building Batch Data Pipelines On Google Cloud
18 pages
BDA04 GoogleCloud
No ratings yet
BDA04 GoogleCloud
33 pages
NM Lab
No ratings yet
NM Lab
66 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
IIM Cal Big Data Course Slides
No ratings yet
IIM Cal Big Data Course Slides
131 pages
Big Data Open Source Implementation & Administration
No ratings yet
Big Data Open Source Implementation & Administration
16 pages
Ace2 HTML
No ratings yet
Ace2 HTML
42 pages
Tableau 8.2 Training Manual: From Clutter to Clarity
From Everand
Tableau 8.2 Training Manual: From Clutter to Clarity
Larry Keller
No ratings yet
MCSL 229 (Ass I)
No ratings yet
MCSL 229 (Ass I)
12 pages
v3 GCP Service Wise Interview Questions
No ratings yet
v3 GCP Service Wise Interview Questions
62 pages
03-Deploying and Implementing A Cloud Solution
No ratings yet
03-Deploying and Implementing A Cloud Solution
46 pages
Hadoop Job Runner UI Tool
No ratings yet
Hadoop Job Runner UI Tool
10 pages
Google Cloud Manual
No ratings yet
Google Cloud Manual
24 pages
Master Cheat Sheet
No ratings yet
Master Cheat Sheet
61 pages
Hadoop: What Is Data Engineering? Hadoop Overview Hadoop Ecosystem
No ratings yet
Hadoop: What Is Data Engineering? Hadoop Overview Hadoop Ecosystem
9 pages
Big Data Class Activity Assignment 2
No ratings yet
Big Data Class Activity Assignment 2
17 pages
GCP Notes For Certification
No ratings yet
GCP Notes For Certification
24 pages
Google Cloud Data Platform & Services: Gregor Hohpe
No ratings yet
Google Cloud Data Platform & Services: Gregor Hohpe
35 pages
GCP - GCS - Handout GCP Bucket Creation
No ratings yet
GCP - GCS - Handout GCP Bucket Creation
5 pages
05 Data Warehouse Using Google Big Query
No ratings yet
05 Data Warehouse Using Google Big Query
6 pages
Microsoft Power Platform For Dummies
From Everand
Microsoft Power Platform For Dummies
Jack A. Hyman
1/5 (1)
Google Cloud Platform
No ratings yet
Google Cloud Platform
17 pages
Google Cloud Platform
No ratings yet
Google Cloud Platform
17 pages
Practical 3.6 Hive
No ratings yet
Practical 3.6 Hive
8 pages
Ace4 HTML
No ratings yet
Ace4 HTML
8 pages
Project 2 Building Chatbot For Dianogtistic Center
No ratings yet
Project 2 Building Chatbot For Dianogtistic Center
10 pages
Word Count Program With MapReduce and Java
No ratings yet
Word Count Program With MapReduce and Java
5 pages
Analog and Digital Signal Processing PDF
100% (1)
Analog and Digital Signal Processing PDF
821 pages
Object Oriented Programming Reference Document For Python Syntax
No ratings yet
Object Oriented Programming Reference Document For Python Syntax
12 pages
To Convert Continuous Time Signal To Discrete Time Signal Using Sampling
No ratings yet
To Convert Continuous Time Signal To Discrete Time Signal Using Sampling
3 pages
AISC Experiment List
No ratings yet
AISC Experiment List
3 pages
Subject: DSIP Class: BE COMPS A.Y. 2019-20 Experiment 3 Problem Statement: To Perform Convolution of Two Signals Theory
No ratings yet
Subject: DSIP Class: BE COMPS A.Y. 2019-20 Experiment 3 Problem Statement: To Perform Convolution of Two Signals Theory
2 pages
Course Name: Intro To Tableau Basic Charting and QTC in Tableau
No ratings yet
Course Name: Intro To Tableau Basic Charting and QTC in Tableau
3 pages
Final Provisional Merit List 2019-20
No ratings yet
Final Provisional Merit List 2019-20
10 pages
User Interface Design
No ratings yet
User Interface Design
4 pages
Software Certification
No ratings yet
Software Certification
5 pages
Jamal Mia
No ratings yet
Jamal Mia
1 page
Maria Clara L. Lobregat National High School: Divisoria, Zamboanga City
100% (1)
Maria Clara L. Lobregat National High School: Divisoria, Zamboanga City
3 pages
Contract Review Process
No ratings yet
Contract Review Process
12 pages
Índice de La RCLL (1-50)
No ratings yet
Índice de La RCLL (1-50)
177 pages
JanFeb08TTexams50 53
No ratings yet
JanFeb08TTexams50 53
4 pages
Career Interest Survey Report
No ratings yet
Career Interest Survey Report
45 pages
Hitachi Premium Inverter TD 0614
No ratings yet
Hitachi Premium Inverter TD 0614
28 pages
PSA With Native Solutions 02-03-15
No ratings yet
PSA With Native Solutions 02-03-15
15 pages
ET Inc Exp Notes
No ratings yet
ET Inc Exp Notes
7 pages
Atex Explosion Guide
No ratings yet
Atex Explosion Guide
12 pages
The Link Between Neurology and Language Teaching
No ratings yet
The Link Between Neurology and Language Teaching
4 pages
StandardAero - Company Profile
No ratings yet
StandardAero - Company Profile
2 pages
QStE 380 TM PDF
No ratings yet
QStE 380 TM PDF
2 pages
B9 Biogeochemical Cycles - Question Answers
No ratings yet
B9 Biogeochemical Cycles - Question Answers
6 pages
Sethurathnam Ravi: Former Chairman of Bombay Stock Exchange and Managing Partner of Ravi Rajan & Co. LLP
No ratings yet
Sethurathnam Ravi: Former Chairman of Bombay Stock Exchange and Managing Partner of Ravi Rajan & Co. LLP
3 pages
Resumé Md. Sharif Ahammad: Contact Address: 5
No ratings yet
Resumé Md. Sharif Ahammad: Contact Address: 5
4 pages
Formality in Analytical Che
No ratings yet
Formality in Analytical Che
10 pages
Moduleii Unit 1 and 2 Short Note
No ratings yet
Moduleii Unit 1 and 2 Short Note
10 pages
10 Tons Palm Fruit Oil Production Line
No ratings yet
10 Tons Palm Fruit Oil Production Line
4 pages
Boys (Sizes 8-20) Ralph Lauren 2
No ratings yet
Boys (Sizes 8-20) Ralph Lauren 2
1 page
Analysis: ΣF = F - mg= -ma
No ratings yet
Analysis: ΣF = F - mg= -ma
6 pages
Chapter 9 Regression Practice Problems
No ratings yet
Chapter 9 Regression Practice Problems
7 pages
Improving Beginning Reading Literacythrough Marungko Approach
No ratings yet
Improving Beginning Reading Literacythrough Marungko Approach
13 pages
First Merged Merged
0% (1)
First Merged Merged
14 pages
Chapter 11 Allocation of Joint Costs and Accounting For by Product
No ratings yet
Chapter 11 Allocation of Joint Costs and Accounting For by Product
18 pages
Welding Design IWFAP
No ratings yet
Welding Design IWFAP
34 pages
GATE 2023 ME Answer Key & Solutions Memory Based Questions
No ratings yet
GATE 2023 ME Answer Key & Solutions Memory Based Questions
45 pages
BNC Coca Lists
No ratings yet
BNC Coca Lists
1,001 pages

Hive On Google Cloud

Uploaded by

Hive On Google Cloud

Uploaded by

Hive on google cloud

Part 2: Creating a Storage Bucket and Loading Data

1. Log in to the Google Cloud account at​ ​cloud.google.com

Part 3: Creating a Dataproc Cluster

13.Log in to the Google Cloud system

14.Click on the Console link in the upper right corner.

16. Scroll down to the ​BigData​ group and select ​Dataproc​.

23. Initially, the ​Status​ of the cluster will be “Provisioning”

beeline -u jdbc:hive2://localhost:10000/default -n myusername@market-data-cluster-m -d

Loading Data into Hive

CREATE EXTERNAL TABLE trades_sample

Some query examples:

AND minute(trading_date_time) = 30;

WHERE hour(trading_date_time) < 12

You might also like

1. Log in to the Google Cloud account at cloud.google.com

16. Scroll down to the BigData group and select Dataproc.

23. Initially, the Status of the cluster will be “Provisioning”