Hive on google cloud
Introduction
Google Cloud Dataproc is Google’s implementation of the Hadoop ecosystem that includes the
Hadoop Distributed File System (HDFS) and Map/Reduce processing framework. In addition the
Google Cloud Dataproc system includes a number of applications such as Hive, Mahout, Pig,
Spark and Hue that are built on top of Hadoop.
Apache HIVE https://hive.apache.org/ is a SQL interface that operates on data stored in the
Hadoop Distributed File System (HDFS). The programming language used with HIVE is
HiveQL, a simplified version of SQL. Queries submitted via HIVE are converted into
Map/Reduce jobs that access data stored in the HDFS. Results are then aggregated and returned
to the user or application.
Part 1: Enabling the Google Cloud Compute Engine API and Dataproc API
If this is the first time the Google Cloud Compute Engine or Dataproc service is being used on
this account, the Google Cloud Compute Engine API (Application Programming Interface) and
Dataproc API must be enabled. If this has already been enabled, please skip to the next part.
1. Pull down the Products & Services menu (three horizontal bars in the upper left corner)
and select the API Manager menu item.
2. When the API Manager screen appears, click on the blue Enable API button at the top of
the page.
3. Select the Compute Engine API from the list that pops up.
4. Click on the Enable button next to the Google Compute Engine API title at the top of
the page.
It may take a few minutes to enable this API.
5. Repeat this set of steps to Enable the Dataproc API. If you do not see Dataproc on any of
the lists, do a Search for Dataproc and then enable it at that point.
Part 2: Creating a Storage Bucket and Loading Data
This section of the tutorial covers how to create a storage bucket and upload files to the Google
Cloud Storage system.
1. Log in to the Google Cloud account at cloud.google.com
2. Click on the Console link in the upper right corner.
3. Click on the Products & Services icon (three horizontal bars) icon in the upper left corner
4. Scroll down to the Storage group and select Storage.
5.Click on the Create Bucket button.
6.Fill in a name for the new bucket. You may wish to use a name that incorporates your initials
or some other unique number to make it unique. Make a note of the bucket name as this will be
used in later steps. Click on the Create button.
7.The new Bucket should now be selected. Click on the Create Folder button to create a new
folder within this new bucket.
8.Name this new bucket “data” and then click the CREATE button.
9.Repeat these steps to create two more folders named “output” and “logs”. When completed, the bucket and
folders will appear as shown below:
10.Navigate to the data folder and click the UPLOAD FILES button.
11. Download dataset from following URL:
http://optionsdata.baruch.cuny.edu/data1/delivery/data/trades_sample.csv and upload it.
12.The file will be uploaded to Google Cloud storage into the market-data-bucket / data folder.
Rename file to trades_sample.csv. At this point we now have a new storage bucket created with
three new folders. A file named trades_sample.csv has been stored in the data folder.
Part 3: Creating a Dataproc Cluster
Now that the data is stored in the Google Cloud Storage, a Hadoop cluster can be created using
the Google Cloud Dataproc services.
13.Log in to the Google Cloud system
14.Click on the Console link in the upper right corner.
15.Click on the Products & Services icon (three horizontal bars) icon in the upper left corner
16. Scroll down to the BigData group and select Dataproc.
17.Click on the blue Create cluster button in the middle of the screen
18. Fill in the settings for the new cluster. Examples are provided.
– Give a Name for the Cluster: market-data-cluster
– Select a Zone (Region of the world): us-east-c – Or pick the Zone that is closest to you, or
least expensive, etc.
19. Set up the Master Node that will host the scheduler (YARN), the Hadoop Distributed File
System (HDFS) Master Node. The settings are:
– Machine Type: n1-standard-n4 (4 vCPU, 15GB RAM) This is powerful enough for the
examples used in this tutorial
– Cluster Mode: Standard (1 master)
– Primary Disk Size: 100 GB Increase this if your data size will be any larger.
20. Now set up the worker nodes. Each worker node will contain the YARN task manager and
act as an HDFS storage node. Select a Machine Type, Number of Nodes and Primary Disk
storage to match the planned workload. The choices given below are more than enough for the
sample data used in this tutorial:
– Machine Type: n1-standard-n4 (4 vCPU, 15GB RAM)
– Number of Nodes: 3
– Primary Disk Size: 100 GB
– Local Solid State Drives (SSD): 0
– YARN Cores: 12
– YARN Memory: 36 GB
21. The rest of the options under “Preemptible workers, bucket, network, version, initialization, &
access options” can be left at the default settings. The complete set of parameters is shown in
the figure below:
22. At this point, click the Create button to create and launch the cluster. Note that as long as the
cluster is running, your Google Cloud account will be charged.
23. Initially, the Status of the cluster will be “Provisioning”
Part 4: Connecting to the Master Node using Secure Shell (ssh)
Now that the cluster is running, connect to the cluster using secure shell. Google provides a
great web browser-based secure shell client so there are no keys to manage or extra software
to install.
24.Click on the name of the cluster and then click on the VM Instances tab
25.Click on the down arrow next to the SSH icon and select Open in browser window from the
drop down menu
26. Once the SSH connection is established, the shell prompt will appear.
Part 5: Run the Beeline command line interface to Hive and Issue SQL
Statements
An example of running the Beeline command line is shown below:
myusername@market-data-cluster-m:~$
beeline -u jdbc:hive2://localhost:10000/default -n myusername@market-data-cluster-m -d
org.apache.hive.jdbc.HiveDriver
Loading Data into Hive
The sample data file trades_sample.csv is already loaded on the Google Cloud Storage
(gs:) in folder market-data-bucket/data.
There are two main ways of working with data under Hive:
● Use CREATE TABLE to move the data from the HDFS (gs: in this case) into the HIVE
file system as a table. Once the data is moved into the HIVE table, the file is removed
from the regular HDFS.
● Use CREATE EXTERNAL TABLE that will leave the source data file in the HDFS. Using
this approach, the data stays where it is (gs: bucket in this example) but the table can
be manipulated using the same SQL statements.
For this example, create an EXTERNAL table with a given column structure that will use the
existing data file in place. The syntax is:
CREATE EXTERNAL TABLE trades_sample
(trading_date_time TIMESTAMP,
network CHAR(1),
message_category CHAR(1),
message_type CHAR(1),
message_sequence BIGINT,
market_exchange CHAR(1),
symbol VARCHAR(10),
trade_price DOUBLE,
trade_size BIGINT,
trade_conditions VARCHAR(6),
trade_conditions2 VARCHAR(6) )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 'gs://market-data-bucket/data/';
Note that the LOCATION clause in the CREATE TABLE statement must point to the bucket and
folder containing the data. It is also possible to point to a specific file. For example, to reference
a specific file in a folder use this syntax:
LOCATION 'gs://market-data-bucket/data/trades_sample.csv';
To use all of the files in a folder as part of the same table, just specify the name of the bucket
and folder as in:
LOCATION 'gs://market-data-bucket/data/';
Bucket names and folder names are case-sensitive. The complete path should be enclosed in
simple single quotes (some text editors make “fancy quotes” so be careful when copying and
pasting in commands). Be sure to use the bucket name provided in the first section of the
tutorial when the bucket was first created. Do not use the TAB key to indent. Instead just use
space.
Here is the CREATE EXTERNAL TABLE command after typing in to Beeline:
Now that the table has been created, use the DESCRIBE command to describe the structure of
the table:
With the table in place the usual SQL SELECT statements can be issued against it. Each query
will be turned into a MapReduce job that will operate over the data and return the results.
For example, see how many records there are in the trades_sample table:
Some query examples:
1)Retrieve a list of trades that happen during the first minute of trading at 9:30am:
SELECT symbol, trade_price, trade_size
FROM trades_sample
WHERE hour(trading_date_time) = 9
AND minute(trading_date_time) = 30;
2)Find the total trading volume (trade_size) for each stock before 12:00 noon
SELECT symbol, SUM(trade_size) AS total_volume
FROM trades_sample
WHERE hour(trading_date_time) < 12
GROUP BY symbol
ORDER BY symbol;
Part 6: Shutting down the Cluster
Click on the blue DELETE button to remove the cluster. A confirmation screen will appear: