Apache Hive Cookbook

hive

Uploaded by

Rpps Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

315 views485 pages

Apache Hive Cookbook

hive

Uploaded by

Rpps Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 485

ee ee Wi el | A Pn nag | By eater ae Py hes Apache Hive Cookbook Easy, hands-on recipes to help you understand Hive and its integration with frameworks that are used widely in today's big data worldApache Hive CookbookTable of Contents Apache Hive Cookbook Credits About the Authors About the Reviewer www.PacktPub.com eBooks, discount offers, and more Why Subscribe? Preface ‘What this book covers ‘What you need for this book Who this book is for Sections setting ready How to do it How it works, ‘There's more... See also Conventions Reader feedback Customer support Downloading the example code Downloading the color images of this book Errata Piracy Questions 1. Developing Hive Introduction Deploying Hive on a Hadoop cluster How to do it. How it works... Deploying Hive Metastore setting ready How to do it... Installing Hive Getting ready How to do it. ive with an embedded metastore Hive with a local metastore Hive with a remote metastore zonfiguring HCatalogGetting ready How to do it. Understanding different components of Hive HiveServer Hive clients Hive CLI Getting ready How to do it. Beeline setting ready How to do it. Getting ready How to do it. Hive packages Getting ready How to do it.. Debugging Hive Getting ready How to do it. oun Getting ready How to do Changing configurations at runtime How to do i 2. Services in Hive Introducing HiveServer2 How to do it How it works. See also Understanding HiveServer2 properties How to do i How it works. See also Configuring HiveServer2 high availability Getting ready How to doit... How it works... See also ising HiveServer? clients Getting readyHow to do it... Beeline Beeline command options JDBC [DBC client sample code using Eclipse Running the JDBC sample code from the command-line JDBC datatypes Other clients Introducing the Hive metastore service How to do ii How it works. Configuring high availability of metastore service How to do it... Introducing Hue setting ready How to doit... Prepare dependencies Downloading and installing Hue Configuring Hive with Hue Starting Hue Accessing Hive with Hue 3. Understanding the Hive Data Model Introduction Introducing data types Primitive data types Complex data types Using numeric data types How to do it... Using string data types How to do i How it works... Using Date/Time data types How to do it. Using mi How to d Using complex data types How to do it... Using operators Using relational operators How to do it... Using arithmetic operators How to do it Using logical operators How to do it. Using complex operatorsHow to do it... Partitioning Getting ready How to do it, Partitioning a manay abl How to do it... Adding new partitions Renaming partitions Exchanging partitions Dropping the partitions Loading data in a managed partitioned table Partitioning an external table How to do it... Bucketing setting ready How to do it... How it works... 4. Hive Data Definition Language Introduction Creating a database schema Getting ready How to do it... Dropping a database schema setting ready How to do it... Altering a database schema Getting ready How to do it... Using a database schema Getting ready How to do it... Showing database schemas setting ready How to doit... Describing a database schema Getting ready How to do it... Creating tables How to do it... Create table LIKE Truncating tablesGetting ready How to do it... Renaming tables Getting ready How to do it. Altering table properties Getting ready How to do it. Creating views Getting ready How to do it... Dropping views Getting ready How to do it Altering the view properties Getti How to do it... Altering the view as select Getting ready How to do it... Showing tables Getting ready How to do it. Showing partitions How to do it... Show the table properties Getting ready How to do i Showing create table Getting ready How to do it... HCatalog Getting ready How to di HCatalog DMLs WebHCat Getting ready How to do it... See also... 5. Hive Data Manipulation Language Introduction Loading files i Getting ready How to do it...How it works... Inserting data into Hive tables from queries Getting ready How to do it, How it works. How to do it How it works. There's more. ‘Writing data into files from queries Getting ready How to do it... Enabling transactions in Hive setting ready How to doit... Inserting values into tables from SQL Getting ready ‘There's more... Updating data Getting ready How to do it. How it works... There's more... Deleting data Getting ready How to do it. How it works. 6. Hive Extensibility Features Introduction alization an How to doit... LazySimpleSerDe RegexSerDe JSONSerDe CsSVSerDe ‘There's more... See also Exploring views How to do it. How it works... Exploring indexes How to do it... tion format ataHive partitioning How to do it. Static partitioning Dynamic partitioning reating kets in Hiv How to do it... Metastore view of bucketing Analytics functions in Hive Windowing in Hive How to do it... LEAD LAG EIRST_VALUE LAST_VALUE See also File formats How to do it 7. Joins and Join Optimization Understanding the joins concept Getting ready How to do it How it works. How to do ii How it works. Using a left semi join How to do i How it works Using a cross join How to do it... How it works. Using a bucket map join Getting ready Using a bucket sort merge map join setting ready How to doit... How it works, Using a skew joinHow to do it... How it works... 8. Si Bringing statistics in to Hive How to do it. ‘Table and partition statistics in Hive Getting ready How to do it. Statistics for a partitioned table Column statistics in Hive How to doit. 9, Functions in Hive How to do ii Mathematical functions Collection functions Type conversion functions Date functions String functions How it works. Mathematical functions Type conversion functions Date functions String functions There's more Conditional functions iiscellaneous functions jing the built-in r-defined Aggregation Function AF) How to doit... How it works... See more Using the buil -in User Defined Table Function (UDTF, See also Creating custom User-Defined Functions (UDF) How to do it. How it works... 10. Hive Tuning Enabling predicate pushdown optimizations in HiveGetting ready How to do it... How it works... timizations to reduce the m of mal setting ready How to do it... Sampling Getting ready Sampling bucketed table Block sampling Length literal Row count How to do it... How it works. Giving read and write access to user mike Revoking the access of the user mike See also How to do it. Default authorization—legacy mode SQL standards-based authorization ‘There's more Configuring the SQL standards-based authorization Getting Started How to do it... To list out all existing roles creating a role Deleting a role Showing li Setting a role Granting a role Revoking a role Checking roles of a user/role Checking principles of a role Granting privileges Revoking privileges hecking privileges of ror roll See also Authenticating Hive How to do it...Anonymous with SASL (default no authentication) Anonymous without SASL Kerberos figuring the JDBC client for Ker! Hi Beeline cli Access Hive using the Hive JDBC client in Java LDAP Pluggable Authentication Modules Custom 12. Hive Integration with Other Frameworks Working with Apache Spark How it works... Working with Accumulo Getting ready Working with HBase Setting ready How to doit... How it works... Working with Google Drill Getting ready How to do it. How it works. Index authe ‘icationApache Hive CookbookApache Hive Cookbook Copyright © 2016 Packt Publishing All rights reserved. No part of this book may be reproduced, stored a retrieval system, or transmitted in any form or by any means, without the prior written permission of t publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented, However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable fo any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: April 2016 Production reference: 1260416 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78216-108-0 ‘ktpub.comCredits Authors Hanish Bansal Saurabh Chauhan Shrey Mehrotra Reviewer Aristides Villarreal Bravo Commissioning Editor Wilson D'souza Acquisition Editor Tushar Gupta Content Development Editor Anish Dhurat Technical Editor Vishal K. Mewada Copy Editor Dipti Mankame Project Coordinator Bijal Patel Proofreader Safis Editing Indexer Priya Sane GraphicsKirk D'Penha Production Coordinator Shantanu N. Zagade Cover Work Shantanu N. ZagadeAbout the Authors Hanish Bansal is a software engineer with over 4 years of experience in developing big data applications. He loves to study emerging solutions and applications mainly related to big data processin NoSQL, natural language processing, and neural networks. He has worked on various technologies such as Spring Framework, Hibernate, Hadoop, Hive, Flume, Kafka, Storm, and NoSQL databases, which include HBase, Cassandra, MongoDB, and search engines such as Elasticsearch. In 2012, he completed his graduation in Information Technology stream from Jaipur Engineering Colleg and Research Center, Jaipur, India. He was also the technical reviewer of the book Apache Zookeeper Essentials. In his spare time, he loves to travel and listen to music. You can read his blog at http://hanishb] rb}. in/ and follow him on Twitter at Hea " 786. I would like to thank my parents for their love, support, encouragement and the amazing chances they've given me over the years. Saurabh Chauhan is a module lead with close to 8 years of experience in data warehousing and big dat applications. He has worked on multiple Extract, Transform and Load tools, such as Oracle Data Integrator and Informatica as well as on big data technologies such as Hadoop, Hive, Pig, Sqoop, and Flume. He completed his bachelor of technology in 2007 from Vishveshwarya Institute of Engineering and Technology. In his spare time, he loves to travel and discover new places. He also has a keen interest in sports. T would like to thank everyone who has supported me throughout my life, Shrey Mehrotra has 6 years of IT experience and, since the past 4 years, in designing and architecting cloud and big data solutions for the governance and financial domains. Having worked with big data R&D Labs and Global Data and Analytical Capabilities, he has gained insights into Hadoop, focusing on HDFS, MapReduce, and YARN. His technical strengths also include Hive, Pig, Spark, Elasticsearch, Sqoop, Flume, Kafka, and Java. He likes spending time performing R&D on different big data technologies. He is the co-author of the book Learning YARN, a certified Hadoop developer, and has also written various technical papers. In h free time, he listens to music, watches movies, and spending time with friends.T would like to thank my mom and dad for giving me support to accomplish anything I wanted. Also, I would like to thank my friends, who bear with me while I am busy writing,About the Reviewer Aristides Villarreal Bravo is a Java developers, a member of the NetBeans Dream Team, and a Java User Groups leader. He has organized and participated in various conferences and seminars related to Java, JavaEE, NetBeans, NetBeans Platform, free software, and mobile devices, nationally and internationally. He has written tutorials and blogs about Java, NetBeans, and web development. He has participated in several interviews on sites such as NetBeans, NetBeans Dzone, and JavaHispano. He has developed plugins for NetBeans. He has been a technical reviewer for the book PrimeFaces Blueprints. Aristides is the CEO of Javscaz Software Developers. He lives in Panama To my mother, father, and all family and friends.www.PacktPub.com eBooks, discount offers, and more Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. RU PACKTLiS” hups://www2.packtpub.conybooks/subscription/packtlib Do you need instant solutions to your IT questions? PacktLib is Packt’s online digital book library. Here you can search, access, and read Packt's entire library of books.Why Subscribe? © Fully searchable across every book published by Packt * Copy and paste, print, and bookmark content © Ondemand and accessible via a web browserPreface Hive is an open source big data framework in the Hadoop ecosystem. It provides an SQL-like interface query data stored in HDFS. Underlying it runs MapReduce programs corresponding to the SQL query. Hive was initially developed by Facebook and later added to the Hadoop ecosystem. Hive is currently the most preferred framework to query data in Hadoop. Because most of the historical data is stored in RDBMS data stores, including Oracle and Teradata. It is convenient for the developers to run similar SQL statements in Hive to query data. Along with simple SQL statements, Hive supports wide variety of windowing and analytical functions, including rank, row num, dense rank, lead, and lag, Hive is considered as de facto big data warehouse solution. It provides a number of techniques to optimize storage and processing of terabytes or petabytes of data in a cost-effective way. Hive could be easily integrated with a majority of other frameworks, including Spark and HBase. Hive allows developers or analysts to execute SQL on it. Hive also supports querying data stored in different formats such as JSON.What this book covers Chapter 1, Developing Hive, helps you out in configuring Hive on a Hadoop platform. This chapter explains a different mode of Hive installations. It also provides pointers for debugging Hive and brief information about compiling Hive source code and different modules in the Hive source code. Chapter 2, Services in Hive, gives a detailed description about the configurations and usage of different services provided by Hive such as HiveServer2. This chapter also explains about different clients of Hive, including Hive CLI and Beeline. Chapter 3, Understanding the Hive Data Model, takes you through the details of different data types provided by Hive in order to be helpful in data modeling. Chapter 4, Hive Data Definition Language, helps you understand the syntax and semantics of creating, altering, and dropping different objects in Hive, including databases, tables, functions, views, indexes, and roles. Chapter 5, Hive Data Manipulation Language, gives you complete understanding of Hive interfaces fo data manipulation, This chapter also includes some of the latest features in Hive related to CRUD operations in Hive. It explains insert, update, and delete at the row level in Hive available in Hive 0.14 and later versions. Chapter 6, Hive Extensibility Features, covers a majority of advance concepts in Hive. This chapter explain some concepts such as SerDes, Partitions, Bucketing, Windowing and Analytics, and File Form: in Hive with the detailed examples. Chapter 7, Joins and Join Optimization, gives you a detailed explanation of types of Join supported by Hive. It also provides detailed information about different types of Join optimizations available in Hive Chapter 8, Statistics in Hive, allows you to capture and analyze tables, partitions, and column-level statistics. This chapter covers the configurations and commands use to capture these statistics. Chapter 9, Functions in Hive, gives you the detailed overview of the extensive set of inbuilt functions supported by Hive, which can be used directly in queries. This chapter also covers how to create a custom User-Defined Function and register in Hive. Chapter 10, Hive Tuning, helps you out in optimizing the complex queries to reduce the throughput time. covers different optimization techniques using predicate pushdown, by reducing number of maps, and by sampling. Chapter 11, Hive Security, covers concepts to secure the data from any unauthorized access. It explains the different mechanisms of authentication and authorization that can be implement in Hive for security purposes. In case of critical or sensitive data, security is the first thing that needs to be considered. Chapter 12, Hive Integration with Other Frameworks, takes you through the integration mechanism of Hive with some other popular frameworks such as Spark, HBase, Accumulo, and Google Drill.What you need for this book To practice in parallel with reading the book, you need a machine or set of machines on which Hadoop i installed in either pseudo distributed or clustered mode. To have a better understanding of metastore concept, you should have configured Hive with local or remote metastore using MySQL at the backend. You also need a sample dataset to practice different windowing and analytical functions available in Hi and to optimize queries using concepts such as partitions and bucketing.Who this book is for This book has covered almost all concepts of Hive. So, if you are a beginner in the big data Hadoop domain, you can start with installing Hive, understanding Hive services and clients, and using Hive data modeling concepts to design your data model. If you have basic knowledge of Hive, you can deep dive into some of the advance concepts covered in the book such as partitions, bucketing, file formats, securi and windowing and analytics. Ina nutshell, this book is helpful for both a Hadoop developer and a Hadoop analyst who want to explo Hive.Sections Inthis book, you will find several headings that appear frequently (Getting ready, How to do it, How it works, There's more, and See also). To give clear instructions on how to complete a recipe, we use these sections as follows:Getting ready This section tells you what to expect in the recipe, and describes how to set up any software or any preliminary settings required for the recipe.How to do it... This section contains the steps required to follow the recipe.How it works... This section usually consists of a detailed explanation of what happened in the previous section.There's more... This section consists of additional information about the recipe in order to make the reader more knowledgeable about the recipe.See also This section provides helpful links to other useful information for the recipe.Conventions In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: By default, this location is set to the /metastore_dbinconf/hive-default . xml file. Ablock of code is set as follows: hive.metastore.warehouse. dir /user/Hive/warehouse The directory relative to fs.default.name where managed tables ar stored. Any command-line input or output is written as follows: hive --service metastore & New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: Create a Maven project in Eclipse by going to File | New | Project. Note Warnings or important notes appear in a box like this. Tip Tips and tricks appear like this.Reader feedback Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really ge the most out of. To send us general feedback, simply e-mail , and mention the book's title i the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.conyauthors.Customer support Now that you are the proud owner of a Packt book, we have a number of things to help you to get the mo from your purchase.Downloading the example code You can download the example code files for this book from your account at http://www.packtpub.com. you purchased this book elsewhere, you can visit http://www.packtpub.conYsupport and register to have the files e-mailed directly to you. You can download the code files by following these steps: 1. Log in or register to our website using your e-mail address and password. . Hover the mouse pointer on the SUPPORT tab at the top. . Click on Code Downloads & Errata. . Enter the name of the book in the Search box. . Select the book for which you're looking to download the code files. . Choose from the drop-down menu where you purchased this book from. . Click on Code Download. NQURWN You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account. Once the file is downloaded, please make sure that you unzip or extract the folder using the latest versio of: * WinRAR / 7-Zip for Windows © Zipeg/ iZip / UnRarX for Mac © 7-Zip / PeaZip for LinuxDownloading the color images of this book We also provide you with a PDF file that has color images of the screenshots/diagrams used in this bool The color images will help you better understand the changes in the output. You can download this file from http:/ i i iErrata Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you fine a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting hitp://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.Piracy Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at witha link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.Questions If you have a problem with any aspect of this book, you can contact us at , and we will do our best to address the problem.Chapter 1. Developing Hive In this chapter, we will cover the following recipes: Deploying Hive ona Hadoop cluster Deploying Hive Metastore Installing Hive Configuring HCatalog Understanding different components of Hive Compiling Hive from source Hive packages Debugging Hive Running Hive Changing configurations at runtimeIntroduction Hive, an Apache Hadoop ecosystem component is developed by Facebook to query the data stored in Hadoop Distributed File System (HDFS). Here, HDFS is the data storage layer of Hadoop that at very high level divides the data into small blocks (default 128 MB) and stores these blocks on different node Hive provides a SQL-like query model named Hive Query Language (HQL) to access and analyze bi data. It is also termed Data Warehousing framework of Hadoop and provides various analytical features, such as windowing and partitioning.Deploying Hive on a Hadoop cluster Hive is supported by a wide variety of platforms. GNU/Linux and Windows are commonly used as the production environment, whereas Mac OS X is commonly used as the development environment.Getting ready In this book, we will assume a GNU/Linux-based installation of Apache Hive for installation and other instructions. Before installing Hive, the first step is to make sure that a Java SE environment is installed properly. Hive requires version 6 or later, which can be downloaded from http://www.oracle.comytechnetwork/java/javase/downloads/index.html.How to do it... To install Hive, just download it from http://Hive.apache.org/downloads.html and unpack it. Choose the latest stable version. Note At the time of writing this book, Hive 1.2.1 was the latest stable version available.How it works... By default, Hive is configured to use an embedded Derby database whose disk storage location is determined by the Hive configuration variable named javax .jdo.option.ConnectionurL. By default this location is set to the /metastore_dbinconf/hive-default .xml file. Hive with Derby as metasto in embedded mode allows at most one user at a time. The other modes of installation are Hive with local metastore and Hive with remote metastore, whic will be discussed later.Deploying Hive Metastore Apache Hive is a client-side library that provides a table-like abstraction on top of the data in HDFS fo data processing. Hive jobs are converted into a map reduce plan, which is then submitted to the Hadoop cluster. Hadoop cluster is the set of nodes or machines with HDFS, MapReduce, and YARN deployed o these machines. MapReduce works on the distributed data stored in HDFS and processes a large datase' in parallel, as compared with traditional processing engines that process whole task on a single machine and wait for hours or days for a single query. Yet Another Resource Negotiator (YARN) is used to manage RAM the and CPU cores of the whole cluster, which are critical for running any process ona node. The Hive table and database definitions and mapping to the data in HDFS is stored in a metastore. A metastore is a central repository for Hive metadata. A metastore consists of two main components, whic are really important for working on Hive. Let's take a look at these components: * Services to which the client connects and queries the metastore * Abacking database to store the metadataGetting ready In this book, we will assume a GNU/Linux-based installation of Apache Hive for installation and other instructions. Before installing Hive, the first step is to make sure that a Java SE environment is installed properly. Hive requires version 6 or later, which can be downloaded from http://www.oracle.comytechnetwork/java/javase/downloads/index.html.How to do it... In Hive, a metastore (service and RDBMS database) could be configured in one of the following ways: « Anembedded metastore « Alocal metastore * Aremote metastore When we install Hive on the preinstalled Hadoop cluster, Hive, by default, gets the embedded database. This means that we need not configure any database as a Hive metastore. Let's check out what these configurations are and why we call them the embedded and remote metastore. By default, the metastore service and the Hive service run in the same JVM. Hive needs a database to store metadata. In default mode, it uses an embedded Derby database stored on the local file system. The embedded mode of Hive has the limitation that only one session can be opened at a time from the same location on a machine as only one embedded Derby database can get lock and access the database files disk: HIVE Service JVM Embedded Metastore Beg Metastore Local Metastore MySQL Driver Metastore Metastore Remote Metastore Metastore Aca MANY) Driver‘An Embedded Metastore has a single service and a single JVM that cannot work with multiple nodes « a time. To solve this limitation, a separate RDBMS database runs on same node. The metastore service and Hiy service still run in the same JVM. This configuration mode is named local metastore. Here, local mean: the same environment of the JVM machine as well as the service in the same node. There is one more configuration where one or more metastore servers run in a separate JVM process to the Hive service connecting to a database on a remote machine. This configuration is named remote metastore. The Hive service is configured to use a remote metastore by setting hive .metastore.uris to metastor server URIs, separated by commas. The Hive metastore could be configured using properties specified : the following sections. In the following diagram, the pictorial representation of the metastore and driver is given: ———RUNNING ON SEPARATE JVM ON SEPARATE SERVERS——— DRIVER 4 METASTORE DRIVER ly METASTORE }— hive metastore,warehouse. dir /user/Hive/warehouse The directory relative to fs.default.name where managed tables ar stored. hive.metastore,uris The URIs specifying the remote metastore servers to connect to. If there are multiple remote servers, clients connect in a round-robin fashion javax.jdo.option. ConnectionURL jdbc: derby: ; databaseName=hivemetastore; create=true The JDBC URL of database javax. jdo.option.ConnectionDriverName org.apache.derby, jdbc.EmbeddedDriver The JDBC driver classname. javax. jdo. option. Connect. ionUserName username metastore username to connect with javax. jdo.option.ConnectionPassword password metastore password to connect with Installing Hive We will now take a look at installing Hive along with all the prerequisites.Getting ready Let's download the stable version from one of the mirrors: $ wget http://a.mbbsindia. com/hive/hive-1.2.1/apache-hive-1.2.1-bin.tar.gzHow to doit... This can be achieved in three ways. Hive with an embedded metastore Once you have downloaded the Hive tar-ba11 file, installing and setting up a Hive is pretty simple anc straightforward. Extract the compressed tar: Star -xzvf apache-hive-1.2.1-bin.tar.gz Export the location where Hive is extracted as the environment variable HIVE_HOME: $ cd apache-hive-1.2.4-bin $ export HIVE_HOME={{pwd}} Hive has all its installation scripts in the $41VE_HOME/bin directory. Export this location to the PATH environment variable so that you can run all scripts from any location directly from a command-line: $ export PATH=SHIVE_HOME/bin: SPATH Alternatively, if you want to set the Hive path permanently for the user, then make the entry of Hive environment variables in the .bashre or . bash_profile files available or could be created in the user home folder: 1. Add the following to ~/.bash_profile: export HIVE_HOME=/home/hduser /apache-hive-1.2.1-bin export PATH=$PATH: $HIVE_HOME/bin 2. Here, hduser is the name of user with which you have logged in and Hive-1.2.1 is the Hive directory extracted from the tar file.Run Hive froma terminal: hive 3. Make sure that the Hive node has a connection to Hadoop cluster, which means Hive would be installed on any of the Hadoop nodes, or Hadoop configurations are available in the node's class path, 4. This installation uses the embedded Derby database and stores the data on the local filesystem. Onl one Hive session can be open on the node. 5. If different users try to run the Hive shell, the second would get the Failed to start database 'metastore_db' error. 6. Run Hive queries for the datastore to test the installation: hive> SHOW TABLES; hive> CREATE TABLE sales(id INT, product String, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; 7. Logs are generated per user bases in the /tmp/ folder. Hive with a local metastoreFollow these steps to configure Hive with the local metastore. Here, we are using the MySQL database a metastore: 1. Add following to ~/., bash_profile: export HIVE_HOME=/home/hduser /apache-hive-1.2.1-bin export PATH=$PATH: $HIVE_HOME/bin Here, hduser is the user name, and apache -hive-1.2.1-bin is the Hive directory extracted from the tar file. 2. Install a SQL database such as MySQL on the same machine where you want to run Hive. 3. For the Ubuntu, MySQL could be installed by running the following command on the node's terminz sudo apt-get install mysql-server 4. Incase of MySql, Hive needs the mysql-connector jar. Download the latest mysql-connector jar from http://dev.mysql.com/get/Downloads/Comnector-J/mysql-connector-java-5.1.35.tar.gz and coy it to the Lib folder of your Hive home directory. 5. Create a file, hive-site.xm1, in the conf folder of Hive and add the following entries to it: javax. jdo. option. Connect ionURL jdbc:mysql://localhost :3306/metastore_db? createDatabaseIfNotExist=true metadata is stored in a MySQL server javax.jdo.option. Connect ionDriverName com.mysql. jdbc. Driver MySQl JDBC driver class javax. jdo.option.ConnectionUserName hduser user name for connecting to mysql server javax. jdo. option. ConnectionPassword passwd password for connecting to mysql server 6. Run Hive from the terminal: hive Note There is a known "JLine" jar conflict issue with Hadoop 2.6.0 and Hive 1.2.1. If you are getting the err “unable to load class jline. terminal," you need to remove the older version of the j1:ine jar from th yarn lib folder using the following command:sudo rm -r $HADOOP_PREFIX/share/hadoop/yarn/lib/jline-0.9.94.jar Hive with a remote metastore Follow these steps to configure Hive with a remote metastore. 1. Download the latest version of Hive from http://a.mbbsindia.comhive/hive-1,2.1/apache-hive- 1.2.1-bin.tar.gz. 2. Extract the package: tar -xzvf apache-hive-1.2.1-bin.tar.gz 3. Add the following to ~/.bash_profile: sudo nano ~/.bash_profile export HIVE_HOME=/home/hduser/apache-hive-1.2.1-bin export PATH=$PATH:SHIVE_HOME/bin Here, hduser is the user name and apache-hive-1.2.1-bin is the Hive directory extracted from the tar file. 4. Install a SQL database such as MySQL on a remote machine to be used for the metastore. 5. For Ubuntu, MySQL can be installed with the following command: sudo apt-get install mysql-server 6. In the case of MySQL, Hive needs the mysql-connector jar file. Download the latest mysq1- connector jar from http://dev. mysql.com/get/Downloads/Connector-J/mysq|-connector-java- 5,1.35.tar.gz and copy it to the 1ib folder of your Hive home directory. 7. Add the following entries to hive-site.xml: javax. jdo. option, Connect ionURL jdbc: mysql: //:3306/metastore_db? createDatabaseIfNotExist=true metadata is stored in a MySQL server javax. jdo. option ConnectionDriverName com.mysql. jdbc.DriverMySQL JDBC driver class javax.jdo.option. Connect ionUserName hduser user name for connecting to mysql server javax. jdo. option. Connect ionPassword passwd password for connecting to mysql server 8. Start the Hive metastore interface: bin/hive --service metastore & 9. Run Hive from the terminal: hive 10. The Hive metastore interface by default listens at port 9083: netstat -an | grep 9083 11. Start the Hive shell and make sure that the Hive Data Definition Language and Data Manipulatic Language (DDL or DML) operations are working by creating tables in Hive. Note There is a known “JLine" jar conflict issue with Hadoop 2.6.0 and Hive 1.2.1. If you are getting the erro “unable to load class jline.terminal," you need to remove the older version of jline jar from the yarn lib folder using the following command: sudo rm -r $HADOOP_PREFIX/share/hadoop/yarn/lib/jline-0.9.94. jarConfiguring HCatalog Assuming that Hive has been configured in the remote metastore, let's look into how to install and configure HCatalog.Getting ready ‘The HCatalog CLI supports these command-line options: loptionf lusage [Description hcat -g mygrp The HCatalog table, which needs to be created, must have the group “yarn”. heat -p rvoxrwxr-v The HCatalog table, which needs to be created, must have permissions "rwerwar -x Ihcat -F myscript.heat [Tells HCatalog that nyscript. heat is a file containing DDL commands to execute. Ihcat -e ‘create table mytable(a int); ‘Treat the following string as a DDL command and execute it hcat -okey-value iPass the key-value pair to HCatalog as a Java System Property. licat Prints a usage message,How to do it... Hive 0.11.0 HCatalog is packaged with Hive binaries. Because we have already configured Hive, we could access the HCatalog command-line hcat command on shell. The script is available at the hcatalog/bin directory.Understanding different components of Hive Besides the Hive metastore, Hive components could be broadly classified as Hive clients and Hive servers. Hive servers provide interfaces to make the metastore available to external applications and check for user's authorization and authentication, and Hive clients are various applications used to acce: and execute Hive queries on the Hadoop cluster.HiveServer Let's take a look at its various components. Hive metastore Hive metastore URIs start a metastore service on the specified port. Metastore provides APIs to query t database, tables, schema, and other entities stored in the RDBMS datastore.How to do it... The metastore service starts as a Java process in the backend. You can start the Hive metastore service with the following command: hive --service metastore &HiveServer2 HiveServer2 is an interface that allows clients to execute Hive queries and get the result. It is based on Thrift RPC and supports multiple clients a against single client in HiveServer. It also provisioned for th authentication and authorization of the user.How to do it... The HiveServer2 service also starts as a Java process in the backend. You can start HiveServer2 with tl following command: hive --service hiveserver2 &Hive clients The following are the different clients available in Hive to query metastore data or to submit Hive queri to Hive servers. Hive CLI The following are the various sections included in Hive CLI.Getting ready Hive Command-line Interface (CLI) can be used to run Hive queries in either interactive or batch mocHow to do it... To run Hive CLI, use the following command: $ HIVE_HOME/bin/hive Queries are submitted by username of the user logged in to the UNIX system. Beeline The following are the various sections included in Beeline.Getting ready If you have configured HiveServer2, then a Beeline client can be used to interact with Hive.How to do it... To run Beeline, use the following command: $ HIVE_HOME/bin/beeline Using beeline, a connection could be made to any HiveServer2 instance with any username and passwortCompiling Hive from source In this recipe, we will see how to compile Hive from source.Getting ready Apache Hive is an open source framework available for compilation and modification by any user. Hiv source code is a maven project. The source has intermittent scripts executed on a UNIX platform during compilation. The following prerequisites need to be installed: * UNIX OS: UNIX is preferable for Hive source compilation, Although the source could also be compiled on Windows, you need to comment out the intermittent scripts execution. * Maven: The following are the steps to configure maven: 1. Download the Apache maven binaries for Linux (. tar.gz) from https://maven.apache.org/download.cg wget http: //mirror.olnevhost .net /pub/apache/maven/maven- 3/3.3.3/binaries/apache-maven-3.3.3-bin. tar.gz 2. Extract the tar file: tar -xzvf apache-maven-3.3.3-bin.tar.gz © Create a folder and move maven binaries to that folder: sudo mkdir -p /usr/lib/maven my apache-maven-3.3.3-bin/usr/lib/maven/ © Open /etc/environment: sudo nano /etc/profile « Add the following variable for the environment PATH: export M2_HOME=/usr/1ib/maven/apache-maven-3.3.3-bin export M2=$M2_HOME/bin export PATH=$M2:$PATH © Use the command source /etc/environment to add variables to PATH without restai source /etc/environment * Check whether maven is properly installed or not: myn -versionHow to do it... Follow these steps to compile Hive on a Unix OS: 1. Download the latest version of the Hive source tar file: sudo wget http://a.mbbsindia.com/hive/hive-1.2.1/apache-hive-1.2.1-sre.tar.gz * Extract the source folder: tar -xzvf apache-hive-1.2.1-srce.tar.gz * Move to the Hive directory: ed apache-hive-1.2.1-sre To import Hive packages in eclipse, run the following command: myn eclipse:eclipse © To compile Hive with Hadoop 2 binaries, run the following command: mvn clean install -Phadoop-2, dist * Incase you want to skip tests execution, run the earlier command with the following switch: mvn clean install -DskipTests -Phadoop-2,dist © To generate a tarball file from the source code, run the following command: mvn clean package -DskipTests -Phadoop-2 -PdistHive packages ‘The following are the various sections included in Hive packages.Getting ready Hive source consists of different modules categorized by the features they provide or as a submodule of some other module.How to do it... The following is the list of Hive modules and their usage in Hive: * accumulo-handler: Apache accumulo is a distributed key-value datastore based on Google Big Table. This package includes the components responsible for mapping the Hive table to the accumulo table. AccumuloStorageHandler and AccumuloPredicateHandler are the main class responsible for mapping tables. For more information, refer to the official integration documentatic available at https://cwiki.apache.org/confluence/display/Hive/Accumulolntegration. * ant: This tool is used to build earlier versions of Hive source. Ant is also needed to configure the Hive Web Interface server. * beelin Hive client used to connect with HiveServer2 and run Hive queries. bin: This package includes scripts to start Hive clients and services. cli: This is a Hive Command-line Interface implementation. common: These are utility classes used by other modules. conf: This contains default configurations and uses defined configuration objects. contrib: This contains Serdes, generic UDF, and fileformat contributed by third parties to Hive * hbase-handler: This module allows Hive SQL statements to access HBase tables for SELECT anc INSERT commands. It also provides interfaces to access HBase and Hive tables for join and unio ina single query. More information is available at ‘https://cwiki.apache,org/confluence/display/Hive/HBaseIntegration. * hcatalog: This is a table management framework that helps other frameworks such as Pig or MapReduce to access the Hive metastore and table schema. ‘© hwi: This module provides an implementation of a web interface to run Hive queries. Also, the WebHCat APIs provide REST APIs to access the Hive metastore. © Jdbo: This is a connector that accepts JDBC connections and calls to execute Hive queries on the cluster. * Metastore: This is the API that provides access to metastore entities including database, table, schema, and serdes. * odbc: This module implements the Open Database Connectivity (ODBC) API, enabling ODBC applications to connect and execute queries over Hive. © ql: This module provides an interface to clients that checks for query semantics and provides an implementation for driver, parser, and query planner. © serde: This module has an implementation of serializer and deserializer used by Hive to read and write data. It helps in validating and parsing record and field types. © shims: This is the module that transparently intercepts and modifies calls to the Hive API, usually for compatibility purposes. * spark-client: This module provides an interface to execute Hive SQLs ona Spark framework. eeeDebugging Hive Here, we will take a quick look at the command-line debugging option in Hive.Getting ready Hive code could be debugged by assigning a port to Hive and adding socket details to Hive JVM. To ad debugging configuration to Hive, execute the following properties on an OS terminal or add it to bash_profile of the user: export HIVE_DEBUG_PORT=8000 export HIVE_DEBUG="-Xdebug - Xrunjdwp: transport=dt_socket, address=S{HIVE_DEBUG_PORT}, server=y, suspend=y"How to do it... Once a debug port is attached to Hive and Hive server suspension is enabled at startup, the following steps will help you debug Hive queries: 1. After defining previously mentioned properties, run the Hive CLI in debug mode: hive --debug 2. If you have written up your own Test class and want to execute unit test cases written in that class, then you need to execute the following command specifying the class name you want to execute: mvn test -Dtest=ClassNameRunning Hive Let's see how to run Hive from the command-line.Getting ready Once you have the binaries of Hive either compiled or downloaded, you need to configure a metastore f Hive where it keeps information about different entities. Once that is configured, start Hive metastore ar HiveServer2 to access the entities from different clients.How to do it... Follow these steps to start different components of Hive on a node: 1. Run Hive CLI: SHIVE_HOME/bin/hive © Run HiveServer2 and Beeline: SHIVE_HOME/bin/hiveserver2 SHIVE_HOME/bin/beeline -u jdbc:Hive2://$HiveServer2_HOST: $HiveServer2_PORT Run HCatalog and start up the HCatalog server: SHIVE_HOME/hcatalog/sbin/hcat_server.sh Rum the HCatalog CLI: SHIVE_HOME/hcatalog/bin/hcat « RunWebHCat: SHIVE_HOME/hcatalog/sbin/webhcat_server.shChanging configurations at runtime Let's see how we can change various configuration settings at runtime.How to do it... Follow these steps to change any of the Hive configuration properties at runtime for a particular session or query: 1. Configuration for Hive and underlying MapReduce could be changed at runtime through beeline or the CLI. The general syntax to set a property is as follows: SET key=value; © The configuration set is only applicable for that session. If you want to set it permanently, then you nee to set it inHive-site. xml. The examples are as follows: beeline> SET mapred. job. tracker=example. host .com:50030; Hive> SET Hive.exec.mode. local. auto=false;Chapter 2. Services in Hive In the previous chapter, you learned how we could install Hive with different metastore configurations. We also have gone through Hive clients and Hive services in brief. In this chapter, we will cover the following recipes in detail: Introducing HiveServer2 Understanding HiveServer2 properties Configuring HiveServer2 high availability Using HiveServer2 clients Introducing the Hive metastore service Configuring high availability of metastore service Introducing Hue eee eeeeIntroducing HiveServer2 HiveServer2 is an enhancement of HiveServer provided in earlier versions of Hive. The major limitations of HiveServer related to concurrency and authentication is resolved in HiveServer2. HiveServer2 is based on Thrift RPC. It supports multiple types of clients, including JDBC and ODBC.How to do it... Assuming that you have installed Hive on your machine, as explained in Chapter 1, Developing Hive. Before starting HiveServer2, you need to add the following property to hive-site. xml: hive.server2.thrift.port 10000 TcP port number to listen on, default 10000 Starting HiveServer2 is easy. All you need to do is run the following command on the terminal of your machine, as shown in the following screenshots: # hive --service hiveserver2 &How it works... Let's look into the series of actions that starts with HiveServer2: . A Java service is started on default port 10000 Minimum worker threads are initialized with 5 Maximum worker threads are set to 500 The background operation thread pool size is initialized with 100 The background operation thread wait queue size is initialized with 190 The background operation thread keep alive time is set to 10 secondsSee also For more information about HiveServer2 configuration, refer to the next recipe, Understanding HiveServer2 properties.Understanding HiveServer2 properties By default, HiveServer2 is started with default configurations. The configurations are mainly related t the port and host on which the server is going to start and number of threads that could be configured for client and background operations.How to do it. You can change the default properties for HiveServer2 by overriding the value in hive- site. xm1 in the conf folder of Hive package. roperty [Default Value|[Description hive server2.thrift.port lroe00 lHiveServer2 thrift interface hive server2.thrift bind. host Jrocathost —|fHiveServer2 bind host hive. server2.thrift.min.worker.threads Is IMininwum thrift worker threads hive server2.thrift.max.worker.threads see |Maximum thrift worker threads hive server2 authentication luone |None/I.DAP/KERBEROS/PAM/NOSASL hive. server2.authentication.kerberas. keytab ffm JA keytab file for kerberos principal hive. server2.authentication,kerberos. principall|™ | The Kerberos principal hive. server2.enable.doAs [true JExecute Hive operations as the user making the call hive. server2.authentication. Idap.url ILDAP connection URLs hive server2.authentication. 1dap-baseoN I~ LDAP DN hive server2.authentication.1dap.Donain - [The LDAP domain hive. server2.thrift http. port j:0002 JPort number in HTTP modeHow it works... When you override the configurations in hive -site.xm1 and restart HiveServer2, then it reads the updated properties. For example, you can define HiveServer2 to start on a port other than the default 10000 by defining the following property: hive.server2.thrift.port 11111 When you restart HiveServer2, it starts listening on the new port 11111.See also For more configurations about HiveServer2 configuration, you can refer to Hive online documentation available at ht ewi ionPriConfiguring HiveServer2 high availability HiveServer2 for a cluster of thousands of nodes could be a single point of failure if HiveServer2 is not configured with a high availability concept. If HiveServer2 service goes down, none of the clients woul be able to access metastore or submit Hive queries to cluster. To solve this limitation, high availability. HiveServer2 is configured. It needs a ZooKeeper quorum running on a set of nodes. ZooKeeper is an open source centralized service for providing coordination between distributed applications. It is also used to store some common configuration and metadata to provide distributed synchronization. Hive uses ZooKeeper to store configuration information to provide high availability of HiveServer2.Getting ready For configuring high availability of HiveServer2, you will need a ZooKeeper quorum running. Tip The installation of ZooKeeper is not in the scope of this book. You can refer to the following links for th installation of ZooKeeper. * Refer to the following for ZooKeeper's installation in the standalone mode: http://www. protechskills.comybig-data/hadoop-ecosystem/zookeeper/zookeeper-standalone- installation. © Refer to the following for ZooKeeper's installation in the distributed mode: http://www. protechskills.com/big-data/hadoop-ecosystenyzookeeper/zookeeper-clustered-mode- installation.How to do it... For enabling HiveServer2 High Availability with ZooKeeper, you need to set the following properties i hive-site. xml: hive . zookeeper . quorum Zookeeper client's session timeout in milliseconds hive zookeeper , session. timeout Comma separated list of zookeeper quorum hive.server2. support. dynamic. service.discovery true hive. server2.zookeeper .namespace hiveserver2 How it works... If more than one HiveServer2 instance is registered with ZooKeeper and all instances fail except one, ZooKeeper passes the link to the instance that is running so that client can connect successfully with running HiveServer2. ZooKeeper doesn't control autostart of services of failed instances, so if any HiveServer2 instance goes down, then it must be restarted manually.See also To read more about HiveServer2 High Availability, you can refer to Hortonwork's blog at http://docs.hortonwor! (HDPDocuments/HDP2/HDP-2.3.2/bk_hadoop-ha/content/ha-hs2-service-Using HiveServer2 clients Once we started HiveServer2, we could connect to the server with different clients as per our requirements and run Hive Query Language (HiveQL). The different client includes beeline, JDBC, ODBC, and so on. We will be going through each client in detail.Getting ready This recipe requires Hive installed as described in the Installing Hive recipe of Chapter 1, Developing Hive. For connecting with HiveServer2 using a client, you must run HiveServer2, as described in the Introducing HiveServer2 recipe in this chapter.How to doit... There are multiple ways of connecting with HiveServer2, as described in the following sections. Beeline Beeline is a shell client that could be executed from the terminal by running the following command: beeline Once you enter the beeline shell, you can make a connection to the HiveServer2 service as a user using the following command: connect jdbe:hive2://localhost:10000 scott tiger org. apache. hive. jdbe.HiveDriver If the connection is successful, then further SQL queries could be executed in the same way as on Hive shell, as shown in the following screenshot: The following is the set of common commands you can execute from beeline: Command [Description reset [This changes all settings to default values Jset = [This sets a value for a particular key set [This displays the list of all overridden settings Jset -v [This displays all Hive and Hadoop Jconfigurations Jada FILE[S] * add JAR[S] * add CHIVE[S] * This adds files or jars in the distributed Jcache of Hadoop liist FILE[S] list JaR[s] list aRcirVve[s] [This lists files or jars available in the distributed cache|delete FILE(S] * delete JaR[s] * delete ARCHIVE[S] Jcritepath>* [This deletes files or jars from the distributed cache les [This runs HDFS commands from beeline lequery string> [This runs Hive queries Beeline command options beeline CLI While running the beeline command, there are different options available that you use directly with the Command [Description -u JDBC URL. For example, jdbe:mysql://localhost :3366/nyab| sn [Username -» [User password -d The driver class Le [The query to be in double quotes Lf [The script file to be executed --showHeader=[true/false] Whether to show columns in the result --delimi terForosv=DELIMITER} The delimiter for queries output stream; default is") These are the commonly used options. For more options, type beeline --help on your terminal. The following is the example of beeline command option: 1, Running Hive queries: beeline -u 'jdbc:hive2://localhost org.apache.hive.jdbc.HiveDriver -e 900/default' "select * from =n root -p xxx -d sales;"re ery Ee ae SS Tae pevenmeetgpetereemenren cart eaeer tes paseo seereaprarrey . warn eats Tene arent eevee at aS ene Toner eter Cees ae core ereea erst) 9 onl pra end Stona ee ren Eee cae romano Ailesn BSeeEeres Here, "default" is the database name; also replace localhost with the IP of your HiveServer2 node. Running Hive scripts: beeline -u 'jdbc:hive2://localhost:10000/default' -n root -p xxx -d org.apache hive. jdbc.HiveDriver -f /opt/hivescript Les ee eae ee Ce FCoE ee Stee Cen Seer BA cprarseCe ite ee REE oe een pero eee trate Reese steer eee eee eee SIE ees A JDBC client allows connection to HiveServer2 from Java code. The JDBC connection could be mad: in Remote, Embedded, or HTTP mode. The following are the configurations for the modes: © The connection URL for Remote or Embedded mode:© For a Remote server, the URL format is jdbc :hive2: //: / (defat port for HiveServer2 service is 19000) © For an Embedded server, the URL format is jdbe:hive2: // (no host or port) © The connection URL when HiveServer? is running in HTTP mode. © The JDBC connection URL is: jdbc:hive2://:/? hive.server2. transport .mode=http;hive.server2. thrift .http.path= , The following are description of the JDBC connection URL: © is the corresponding HTTP endpoint configured in ConfiguringHive. The default value is cliservice. © The default port for HTTP transport mode is 10001, Once the connection mode is set, the JDBC connection in Java code could be made with the following steps: 1. Load the JDBC drivers class: Class. forName("org.apache. hive. jdbc. HiveDriver"); 2. Connect to the database by creating a Connection object with the JDBC driver: Connection conn = DriverManager .getConnection("jdbce:hive2://:", " ", "") ; Here, the default port is 190e6 and the password is ignored if HiveServer2 is running in a nonsecu mode. 3. Execute your query as follows: Statement stmt ResultSet rset conn. createStatement(); stmt.executeQuery("SELECT fname FROM sales"); 4. Process the result as returned in Resultset. JDBC client sample code using Eclipse The following are the steps to create and execute the Hive JDBC client in Eclipse: 1. Create a Maven project in Eclipse by going to File | New | Project.Tava - Eclipse Tle] Ecit Source Refactor Navigate Search Project Run Window Help New AlsShifteN ¥| 22 Java Project Open File FE Project Close cule | af Package Close Alt Cashew | @ Chase Save aris | Inistoce Saves Scar Save ll Cinsshites | Annetation = ‘89 SourceFolder 2. Now search for the Maven project by typing maven in the search box, as shown in the following screenshot, and click on the Maven Project. Now click on the Next button to continue: 1G New Project Select a wizard Create a Maven Project Wisard: | maverl ] 4 & Maven GS Check out Maven Project from SCM 1 Maven Module ME Maven Project 3. Provide Group Id and Artifact Id for your project, then select 0.0.1 SNAPSHOT as the Version, shown in the following screenshot:New Maven project Specify Acchetype parameters Group Id: hivejdbe Artfact: Version: freee a Properties availabe from archetype: Name Value 4. Add the following dependencies in pom. xm1, as shown in the following screenshot: org. apache. hive hive-jdbc 1.2.1 org, apache. hadoop hadoop-common 2,6,0 st clientypomamt 25 15 cproject xmlns="http://moven.qpache.org/POH/4. 0.0" xalns 2 | xsi:schemalocation-"http://maven. apocke.org/PON/4.4.0 cmodelversion>[email protected]< /modelversloa> hive. jdbc cartifactId»client@[email protected]< /vers «cpackaging>jar org apache.hive eversionsL.2.1¢/version> séependency> org apache. hadoop hadoop-comion 2.6.0 ‘junits/ercupld> junits/artifactid> 3.8-1 test 26 27 é/dependencies> 5. Create the class HiveClient by following these steps: 1. Right-click on sre/main/java and then navigate to New | Class to create a new class named HiveClient in your maven project: [Ellisvs - clentiporaml- Ecce a File Eat Source Refactor Navigate Segch Project Bun Design Window Help [oe Or are GisG ya HENS Thy Oey HB Package Explorer 53 B Sle 7 = G | Bi chen/pomam! 2 4 BD client 16 coroject selnse"http://naven.anache.org = 2 oe ee oe a New Ve Jove Project > BAJFES| — Qpenin New Window TF Project » ah Meret | ertypc Fem 14 | a Pectoge Cte) Shewin aieshitew >| @ Case a= POM BBs Copy cree | Rie 2. Inthe Name section, type HiveClient and click on Finish:[Biase cenvpomoni- Eclipse Fer ir Sones Rafat igual De Rin Dea toe (Semas | Fe Package Esplorer 92 a ih JRE System Library J2SE 1.5] b> ih Maven Dependencies Package, (detauty {__ Browse > & xe " oe (Hinclesing type i) pom » pees Name HiveChen Most: @ public — package private protectec Cleteract Flfinal [2st — no —id 3. Add the following code to the class HiveClient: import java.sql.Connection; import java.sql.DriverManager ; import java.sql.ResultSet; import java.sql.SQLexception; import java.sql.Statement; public class HiveClient { private static String driverName = "org.apache.hive, jdbc.HiveDriver" public static void main(string[] args) { try { Class. forName(driverName) ; } catch (ClassNotFoundexception e) { e.printStackTrace(); System.exit(1); } 7/replace "roo here with the name of the user the queries should run as Connection conn; try { conn = DriverManager.getConnection("jdbc:hive2://192.168,.56.101:10000/default" "root", ""); Statement stmt = conn.createStatement(); String table_name = "testtable"; stmt.execute("drop table if exists " + table_name); stmt.execute("create table " + table_name + " (id int, fname string, age int)"); // 1. show tables String sqlquery = "show tables";System.out .print1n("Running query: " + sqlquery); ResultSet rst = stmt.executeQuery(sqlquery); if (rst.next()) { System. out .print1n(rst.getString(1)); } // 2. describe table sqlquery = “describe " + table_name; system.out.print1n("Executing query: " + sqlquery); rst = stmt. executeQuery(sqlQuery) ; while (rst,next()) { System. out.print1n(rst.getString(1) + "\t" + rst.getstring(2)); } // 3, load data into table /** filepath is local to Hive server NOTE: /opt/sample_10000.txt is a '\t' separated file with ID and First Name values. */ String filepath = "/opt/sample_10000, txt"; sqlquery = "load data local inpath '" + filepath + "' into table " + table_name; System.out.print1n("Executing query: " + sqlquery); stmt ,execute(sqlQuery) ; // 4, select * query sqlquery = “select * from" + table_name; System.out.print1n("Executing query: " + sqlQuery); rst = stmt. executeQuery(sqlQuery) ; while (rst,next()) { system. out. print1n(String.valueof(rst.getint(1)) + "\t" + rst.getString(2)); // 8. regular Hive query sqlQuery = “select count(*) from" + table_name; system.out.print1n("Running: " + sqlQuery); rst = stmt, executeQuery(sqlquery) ; while (rst.next()) { system. out.printin(rst.getstring(1)); } } catch (SQLException e) { e.printStackTrace(); } 3 Running the JDBC sample code from the command-line To execute the client on a Hadoop/Hive cluster, you need to run the following command: java -cp SCLASSPATH HiveClientHere, $CLASSPATH is the path to the Hadoop and Hive home directories. JDBC datatypes The following table lists the data types implemented for HiveServer2 JDBC: [Hive Type |[Java Type [Specification Jrawrant byte [A signed or unsigned I-byte integer} lswaturnt short [A signed 2-byte integer lew line [A signed 4-byte integer lbrorwr long, A signed 8-byte integer IFLoaT Jdouble A single-precision number DOUBLE double |A double-precision number DECIMAL _ |fiava.nath 8igbecinaillA fixed-precision decimal value BOOLEAN boolean [A single bit (0 or 1) IsTRING — [string LA character string [TIMESTAMP]}java.sql.tinestamp [fThe date and time value BINARY — [string [Binary data JARRAY [String json encoded |]Values of one data type IMAP String -json encoded fKey-value pairs Jstruct [string — json encoded Structured values Other clients Languages such as Python or Ruby could also connect to HiveServer2 using the client APIs: * Python: A Python client driver is available on GitHub at https://github.conyBradRuderman/pyhs2. © Ruby Client: A Ruby client driver is available on GitHub at hitps://github.conv forward3d/rbhive.Introducing the Hive metastore service In Hive, the data is stored in HDFS and the table, database, schema, and other HQL definitions are store in a metastore. The metastore could be any RDBMS database, such as MySQL or Oracle. Hive creates database and a set of tables in metastore to store HiveQL definitions. There are three modes of configuring a metastore: * Embedded © Local « Remote The detailed description and configuration steps of different modes are available in Chapter 1, Developing Hive.How to do it... The Hive metastore could be made available as a service. All you need to do is run the following command on the terminal of your machine: hive --service metastoreHow it works.. In the case of a remote metastore configuration, all clients connect to the metastore service to query the underlying datastore (MySQL, Oracle, and so on). The communication is done through the Thrift protocol. At the client's side, a user needs to add the following configurations to make the client connect to a metastore service: \Command [Description Ihive.metastore. uris itis used to specify the URI of a metastore server lhive.metastore.warehouse.dir| itis used to specify the data location in HDFS. The default value is /user/hive/warehouse| If MySQL is used as a metastore, then the user needs to add mysql connector jar to the 1ib folder of Hive. Also if you want to change the metastore port, you need to start the metastore with the following command: hive --service metastore -p If you want to run the metastore as a backend service, append & at the end of service: hive --service metastore -p & You can verify whether the metastore service is running or not using the jps command. It will run as the RunJar service, as shown in the following screenshot:Configuring high availability of metastore service The Hive metastore service is a single point of communication between different clients and metastore data. If the metastore service is down or unavailable, then clients would not be able to run any HiveQL metastore data is not accessible.How to do it... The High Availability solution is designed to provide the failover control of the Hive metastore service. To configure metastore in the High Availability mode, you need to concurrently start the metastore servi on multiple machines. Every client will read the hive metastore.uris property from the configuratio file. The property could have a comma-separated list of machines on which metastore services are running: hive.metastore.uris thrift://$Hive_Metastore_Server_Host_Machine_FQDN A comma separated list of metastore uris on which metastore service is running. Here, $Hive_Metastore_Server_Host_Machine_FQDN is the comma-separated, fully qualified domaii name of machines on which the Hive metastores are running.Introducing Hue Hadoop User Experience (Hue) is an open source web interface for analyzing data with Hadoop and i ecosystem components. Hue brings together the most common Apache Hadoop components into a single web interface. Its main goal is to allow the users to use Hadoop without worrying about underlying complexity or using a command-line interface.Getting ready The following are the major features of Hue: The file browser for HDFS The job browser for MapReduce or YARN Query editors for Apache Hive Query editors for Apache Pig The Apache Sqoop2 editor The Apache ZooKeeper browser The Apache HBase browser We will focus on the Query editors for Apache Hive.How to doit... To run Hue, there are various steps that need to be followed. The installation of Hue might seem a little complex, but once Hue is set up, it will ease up the running Hive queries through the web interface without using terminal screens, Prepare dependencies If you are installing Hue on Ubuntu, you need to install the following libraries: sudo apt-get install -y ant sudo apt-get install -y gcc g++ sudo apt-get install -y libkrbS-dev libmysqlclient-dev sudo apt-get install -y libssl-dev libsasl2-dev libsas12-modules-gssapi-mit sudo apt-get install -y libsqlite3-dev sudo apt-get install -y libtidy-0.99-0 libxml2-dev libxslt-dev sudo apt-get install -y maven sudo apt-get install -y libldap2-dev sudo apt-get install -y python-dev python-simplejson python-setuptools python-1dap sudo apt-get install libgmp3-dev If you are installing Hue on CentOS, you need to install and configure the following libraries: sudo yum install -y ant sudo yum install -y gcc g++ sudo yum install python-devel.xs6_64 sudo yum groupinstall "Development Tools" sudo yum install krb5-devel sudo yum install libxslt-devel libxm12-devel sudo yum install mysql-devel.xs6_64 sudo yum install ncurses-devel zlib-devel texinfo gtk+-devel gtk2-devel qt-devel tcl-devel tk-devel kernel-headers kernel-devel sudo yum install gmp-devel.x86_64 sudo yum intall sqlite-devel.xs6_64 sudo yum install cyrus-sasl.x86_64 sudo yum install postfix system-switch-mail cyrus-imapd cyrus-plain cyrus-md5 cyrus-utils sudo yum install libevent libevent-devel sudo yum install memcached-devel.x86_64 sudo yum install postfix sudo yum install cyrus-sasl sudo yum install cyrus-imapd sudo yum install openldap-devel Note Installing the latest version of Maven is necessary. Downloading and installing Hue 1. Downloading the latest version of Hue is important (which at the time of writing was 3.9). Run the following command on the machine on which Hue is to be installed: wget https: //d1.dropboxusercontent . com/u/730827/hue/releases/3.9,0/hue- 3.9.0.tgz2. Extract the Hue packages using the following command: tar -xzvf hue-3.9.0.tgz 3. Installing Hue via the following command: sudo make install By default, Hue installs to /usr /1ocal/hue in your node's local filesystem. Running the previous command will give logs, as shown in the following screenshot: 807 static files copied to '/usr/local/hue/build/static', 907 post-processed. inake[1]: Leaving directory */usr/local/hue/apps' -- Setting up Desktop database make -C /usr/local/hue/desktop syncdb juake[1]: Entering directory ‘/usr/local/hue/desktop" ake[1]: Nothing to be done for *syncdb’. make{1]: Leaving directory “/usr/local/hue/desktop' Note Check carefully at last of logs that there is no error message. The default ownership of Hue files and folders is set to the root user. Let's change Hue permissions so that it can run without root permissions: sudo chown -R hadoop:hadoop /usr/local/hue Here, hadoop is the user name and group name too. Configuring Hive with Hue Hue contains a configuration file named hue. ini located at /usr/local/hue/desktop/conf/hue. in: [beeswax] # Host where HiveServer2 is running at: hive_server_host=localhost Note Replace localhost by the hostname to point to Hive running on the other machine. Starting Hue ‘We start the Hue server using the supervisor command in Hue's bin: cd /usr/local/hue/build/env/bin-/supervisor Note Use the -¢ switch to start the Hue supervisor in the daemon mode (as background process): ./supervisor -d Accessing Hive with Hue To access Hive via Hue, the URL format will be http :: 8888. Here, HOST_NAME will be t IP address or URL of Hive; for example, http: //192.168.56.111:8888. This would prompt for a username and a password. You can give any username and password of your choice. Remember the username and password for future reference. For example, you can use the username admin and the password admin for the first time. After successful login into the Hue web interface, it will describe the configuration of various supportec components. For accessing Hive through the Hue web interface, HiveServer2 must be running. As shown in the following screenshot, to access the Hive editor on Hue, click on Hive under the Query Editors tab: Hive Edi © He 4) trea asst sotngs OB Ovey Je Deser 2x Saxo) GEE eect rons owen The left-hand side panel will show the list of databases and all tables of the selected database in the dropdown icon. In the right-hand side panel, you can execute any Hive query and retrieve the result. There is one table named sales in the default database. Now let's execute a query on Hue to retrieve the result. For demonstration purposes, let's retrieve the first and last names of 10 users from the sales table:SELECT fname, lname FROM sales LIMIT 10; After clicking on the Execute button, get the 10 users’ names from the sales table, as shown in the following screenshot: (eo oe earch Securty & Hive Editor query Eanor aly cuenios Saved queries Hesory Asost Senne atannse se ott : [EG em cet cess ewan Duey og Glirms asus Chat 1 ‘rome name ° ara oss 1 tine Bihep 2 sage cara 3 caso Shot ‘ Aa Wight ° Stone aimee ° Rega Dryant , Donovan Aouire ° ‘Moen Menace ° Maia Henson ‘As shown in the previous screenshot, after executing the query, it will show the result on the same page. Similarly, you can execute any Hive query through this web interface without using terminal screens. You can also check the history of recently executed queries through Hue:fevnan. Caplin oo Howe Le Sthcametnene etre suey As shown in the preceding screenshot, in the right panel, there is a Recent queries tab where you can s all commands that have recently been executed, You can also check the results of particular query by clicking on See results.Chapter 3. Understanding the Hive Data Model In this chapter, we will cover the following recipes: Using numeric data types Using string data types Using Date/Time data types Using miscellaneous data types Using complex data types Using operators Partitioning Partitioning a managed table Partitioning an external table Bucketing ee eyIntroduction In previous chapters, you learned the installation of different Hive components such as Hive metastore, HiveServer2, and working with different services in Hive. In this chapter, we will cover the following sub topics: © Using data types © Using operators * Partitioning © BucketingIntroducing data types Hive supports various data types that are primarily divided into two parts: * Primitive data types * Complex data types Hive supports many primitive data types that are similar to relational databases, such as INT, SMALLINT TINYINT, BIGINT, BOOLEAN, FLOAT, and DOUBLE. In addition to primitive data type, Hive also supports few complex data types, such as ARRAY, MAP, STRUCT, and UNION. Primitive data types Hive supports large number of primitive data types, which are divided into the four following different categories: © Numeric data types © String data types * Date/Time data type * Miscellaneous data types All these primitive data types are similar to RDBMS primitive data types. Complex data types The following are the three complex data types supported by Hive: * STRUCT MAP « ARRAYUsing numeric data types Hive supports a set of data types that can be used for table columns, expression values, and function arguments, and return values. In the following table, primitive numeric data types are listed with sizes and examples: [Data Type||Size JExample [riwrant i-byte signed integer so JSMALLinT >-byte signed imeger 20,000 hrwr |+-byte signed integer 1,000 Jorcrwr — fe-byte signed integer 50,000 JFLoar —— fa-byte single-precision floating point [1400.50 Jpouste —|fe-byte double-precision floating point/20,000.50 Jpecamat }17-byte precision up to 38 digits foeczMAL(20, 2) By default, all integral literals are treated as the INT values until they cross the range of INT values. If some integral literal crosses the range of the INT value, then it is treated as the BIGINT value. There is a mechanism of postfix, which is used to specify an integral literal as TINYINT, SMALLINT, and BIGINT. To specify an integral literal as TINYINT, the postfix Y is used. For example, you can specify 50 as SeY. To specify an integral literal as SMALLINT, the postfix s is used. For example, you can specify 50 as 50s To specify an integral literal as BIGINT, the postfix L is used. For example, you can specify 50 as SOL. The DECIMAL data type is defined using the fixed precision and scale value. It is very useful for financia and other arithmetic use cases where Float and Double don't meet all requirements. The DECIMAL data type is defined using a DECIMAL(precision, scale) syntax. The Default value of scale is , whict means no fractional digits, and the default value of precision is 10, which means 10 digits. Therefore, the default DECIMAL with no precision or scale values is equivalent to DECIMAL(10, 0). The precision value must be between 1 and 38. For example, to represent integer values up to 9999 and floating point values up to 99.99, both require a precision of 4. The maximum integral value for decima is represented by DECIMAL (38, 0), that is, 999 with 9 repeated 38 times.How to do it... The following are the few examples for using primitive data types in Hive. The following statement creates a table named customer with id as the BIGINT data type and age as th TINYINT data type: CREATE TABLE customer (id BIGINT, age TINYINT); The following statement creates a table customer_order table with id as the BIGINT data type and price as the DECIMAL data type: CREATE TABLE customer_order (id BIGINT, price DECIMAL(10,2));Using string data types Hive supports three types of String data type, STRING, VARCHAR, and CHAR: * STRING: It is a sequence of characters that can be expressed using single quotes (') as well as double quotes ("). * VARCHAR: It is variable-length character type. It is defined using a length specifier, which specifies the maximum number of characters allowed in the character string. Its syntax is VARCHAR(max_Length). The value of the varchar data type gets truncated during processing if necessary to fit within the specified length. While converting the string value to the varchar value, a string value exceeds the length specifier, then the string gets silently truncated. The maximum leng of varchar type is 65355. * CHAR: Itis fixed-length character type. It is defined in the same way as the varchar type. If the value is shorter as compared with the specified length, then the value is padded with tailing space to achieve the specified length, The maximum length of the char type is 255. Note The varchar and char data type cannot be used in a nongeneric User-Defined Function (UDF) or Use’ Defined Aggregate (UDA) function.How to do it... The following is an example of using String data types in Hive: CREATE TABLE customer (id BIGINT, name String, sex CHAR(6), role VARCHAR(64)); The preceding statement creates a table, customer, with id as the BIGINT data type, name as the Strin: data type, sex as the CHAR data type, and role as the VARCHAR data type.How it works... If you declare a field of the CHAR (12) data type, then it will always take 12 bytes irrespective of size « data you store. For example, whether you store the 1 character or the 12 character in the CHAR(12) field it will take 12 bytes in both the cases. Also, because the field is declared as the CHAR(12) data type, we can store a maximum of 12 characters in this column. On the other hand, VARCHAR is a variable-length data type. It takes storage equal to the actual size of the field. For example, if you declare a field of the VARCHAR (12) data type, it will take the number of byte equal to the number of characters stored in this column. For example, if you store only one character in this column, then it will take only 1 byte and if you store 10 characters, then it will take 10 bytes. Also i field is declared as the VARCHAR(n) data type, then a maximum of n characters can be stored in this column.Using Date/Time data types Hive supports two data types for Date/Time-related fields—Timestamp and Date: The Timestamp data type is used to represent a particular time with the date and time value. It supports variable-length encoding of the traditional UNIX timestamp with an optional nanosecond precision. It supports different conversions. The timestamp value provided as an integer numeric type is interpret as a UNIX timestamp in seconds; a timestamp value provided as a floating point numeric type is interpreted as a UNIX timestamp in seconds with decimal precision; the timestamp value provided as string is interpreted as the java, sql.Timestamp format YYYY-MM-DD HH:MM:SS. fffffffff. If the timestamp value is in another format than yyyy-mm-dd hh:mm:ss[.f...], then UDF can be used t convert them to the timestamp format. The Date type is used to represent only the date part of timestamp that is, YYYY-MM-DD. This type doesn't represent the time of day component. The Date ranges allowed ar 9000-01-01 to 9999-12-31. The Date types can be casted in the Date, Timestamp, or String types and vice versa.How to do it... The following is an example of using the Date and Timestamp data types in Hive: CREATE TABLE timestamp_example (id int, created_at DATE, updated_at TIMESTAMP) ;Using miscellaneous data types Hive supports two miscellaneous data types: Boolean and Binary: Boolean accepts true or false values. Binary is a sequence of bytes. It is similar to the VARBINARY data type found in many relational databases. If a field is declared as the binary type, then it is stored within a record, not separately like BLoBs. The binary data type is used when a record has hundreds of columns, and the user is just interested in a few columns and doesn't bother about an exact type information of other columns. In such cases, a user can define the type of those columns as binary, so Hive will not try to interpret those columns, It is used to include the arbitrary types in record, and Hive doesn't attempt to parse them as numbers, strings, and so on.How to do it... The following is the example in order to use the Boolean data types in Hive: CREATE TABLE example (id INT, status BOOLEAN, description STRING); The preceding statement creates a table, example, with the status as the Boolean data type.Using complex data types In addition to primitive data types, Hive also supports a few complex data types: Struct, MAP, and Array. Complex data types are also known as collection data types. Most relational databases don't support such data types. Complex data types can be built from primitive data types: * sTRUCT: The struct data type in Hive is analogous to the STRUCT in C programming language. It is a record type that holds a set of named fields that can be of any primitive data types. Fields in the STRUCT type are accessed using the DOT (. ) notation. Syntax: STRUCT For example, if a column address is of the type STRUCT {city STRING; state STRING}, then tl city field can be referenced using address. city. © map: The map data type contains key-value pairs. In Map, elements are accessed using the keys. Fo example, if a column nane is of type Map: 'firstname' -> 'john' and ‘lastname! -> ‘roy! then the last name can be accessed using the name [' lastname’. * ARRAY: This is an ordered sequence of similar elements. It maintains an index in order to access the elements; for example, an array day, containing a list of elements ['Sunday', 'Monday', ‘Tuesday’, 'Wednesday']. In this, the first element Sunday can be accessed using day[@], and similarly, the third element can be accessed using day[2]. © UNIONTYPE: This data type enables you to store different data types in the same memory location. I is an efficient way of using the same memory location for multipurpose. It is similar to Unions int C programming language. You can define a union type with many data types, but at a time, only on data type can be hold by it. Syntax: UNIONTYPEHow to doit... The following are some different examples of how to use complex data types: 1. To create a customer table with name as the struct data type, the following command can be use: CREATE TABLE customer(id INT, name STRUCT) ; Here, the column name is of the type STRUCT with two fields—firstname and lastname—then the firstname field can be referenced using name. firstname. Similarly, the lastname field can be referenced using name. lastname. © Let's create a table test with the usage of different data types, such as int, double, array, and struct as uniontype. CREATE TABLE test(column1 UNIONTYPE, struct>) ; « When we retrieve the data, it will return the result with index of data type and value. The first part w give the information about which part of union is being used. In the following example, index © means the first data_type from the definition, which is an int, and index 1 means the second data type, whic is double, and so on: SELECT column1 FROM test; {1:6.0} // For second data type DOUBLE {0:5} // For first data type INT {2:["sunday", "monday"]} // For third data type ARRAY {3:{"age":28,"country":"INDIA"}} // For fourth data type STRUCT {2:["tuesday", "wednesday"]} // For third data type ARRAY {3 {0 {a: {"age":21,"country":"US"}} // For fourth data type STRUCT // For first data type INT // For second data type DOUBLEUsing operators Hive supports various built-in operators. There are four types of operator in Hive: « Relational operators © Arithmetic operators * Logical operators Complex operatorsUsing relational operators Relational operators are used to compare two operands. The output of comparison produces TRUE or FALSE depending on the comparison of operands. The following table describes relational operators available in Hive: loperand anti loperator Description NP Type iP 8 Keane This returns True if primitives are equal, otherwise False. Al primitive + . pot pee {is tetas True if primitive A & not equal to, False otherwise, It woul! retum Wi {Aor @is NUL. cop [Allprimitve [rhs returns the same result asthe EQUAL operator for non-null primitives. It would return TRUE if both A an aca types ff are nunc, FALSE if one of the primitives among A and @ is NULL. os JA primitive Pris returns the same result as the NOT EQUAL (!=) operator. ata types - Ja primkive PS cetuns NULL if ether primitive Aor Bi WL [Sata types He would retum TRUE if the expression A is less than 8. Otherwise, it would return FALSE. es Jan primicve |frhis returns NULL if any of primitive 4 or 8 is NULL. It would return TRUE if the expression Ai less than or léata types equal tothe expression, otherwise FALSE. au primitive JTS returns TRUE ifthe expression A s greater than the expression 8, otherwise FALSE, >e [Sata types He would return NULL if primitive A or @ is NULL. we Ja primitive Ti setuns RUE ifthe expression A greater than or equal othe expression 8, oherwise FALSE Jat es Hoe would return NULL if primitive A or ® is NULL. ecrwcen 8 [aU primitive JMS returns TRUE ifthe vale of is within 8 and c, otherwise FALSE. De JOata types Hh would return NULL if primitive A, 8, or c is NULL. wot seTwceNfAl primitive. Ti returns TRUE i does ie between B and c otherwise FALSE fs Ano ¢ Haata types mith ' y tc would return NULL if primitive A, 8, or is NULL. smu [Al anive It will return TRUE if the expression A evaluates to NULL; otherwise; it would return FALSE. At primitive fie will return FALSE ifthe expression A evaluates to NULL; otherwise, it would return TRUE.1S NOT NULLMdata types [his returns TRUE if the string A matches the SQL regular expression @; otherwise, it would return FALSE ume 8 string it would return NULL if primitive A or & is NULL. This returns TRUE if string A does not match the SQL regular expression 8, otherwise FALSE. NoT LIKE B [String lt would return NULL if primitive A or 8 is NULL. [This returns TRUE if any substring (possibly empty) of A matches the specified Java regular expression 8, lotherwise FALSE. RLIKE 8 i String lt would return NULL if primitive A or Bis NULL. lor example, ‘hivetunction' RLIKE ‘hive’ will etum TRUE, REGEXP 8 it isthe same as aLrKe, StringHow to doit... Let's assume that the customer table is composed of fields named id, name, gender, and age, as showr in the following table. Generate a query to retrieve the customer details whose age is 41: oan lsender}Age b fikate iMate 35 fz ponn iMate fat fs Mike iMate so i= [Male [32 The following statement will get the job done: hive> SELECT * FROM customer WHERE age = 41 AND gender = ‘Wale’; On the successful execution of a query, you will get the following response: [Idnamelgenderfage| 2 Jrohn |atale a1 In the following SQL statement, we are trying to get the details of those customers whose age is betweer 30 and 40: hive> SELECT * FROM customer WHERE age BETWEEN 30 AND 40; On the successful execution of a query, you will get the following response: fan lzenderfage| [1 kate [Male 3s [s|[Dave Male jfsz The following statement will return the details of those customers, whose names start with k: hive> SELECT * FROM customer WHERE name LIKE 'k%'; On the successful execution of a query, you will get the following response: dnamel[genderage|J ____ kate Male 85Using arithmetic operators Arithmetic operators are used to perform arithmetic operations on operands such as addition, subtractio multiplication, divi ion, and so on. All these types of operators return numbers. If any operand is NULL while performing arithmetic operations, then result will also be NULL. The following table describes arithmetic operators available in Hive: JOperatorfoperand Type |Description +B INumeric data typesfit gives the sum (addition) of a and ~ 8 Numeric data types|[t gives the difference between (subtraction) 8 and A * 8 PNumeric data typesft gives the multiplication of 4 and 6 7-8 [Numeric data typesfit gives the division of a by 8 % 8 [Numeric data types|t gives the remainder value resulting from the division of A by 8 4.8 [Numeric data types|{t gives the resuit of a bitwise AND operation of the operands A and e} 1.8 PINumeric data types} gives the result of a bitwise oR operation of the operands A and & » 8 PNumeric data typesfit gives the result of a bitwise xoR operation of the operands A and e| ca INumeric data typesfit gives the result of a bitwise NoT operation of the operand A How to do it... The following are a few simple examples to use arithmetic operators in Hive: 1. To add two numbers, execute the following command: hive> SELECT 10+10 ADD FROM test; After executing this query, you will get the result 20. Similarly, you can give two or more column names to get the sum of their values. 2. To multiply two numbers, we will use the following command: hive> SELECT 10*10 MULTIPLE FROM test; After executing this query, you will get the result 160. Similarly, you can give two or more column names to multiply their values.Note This is just an example, To get output of this command table must have at least one row. Valid use cases for arithmetic operator are in where clauses or doing some complex operations.Using logical operators Logical operators are used for logical operations AND, oR, and so on. All these types of operators return ‘TRUE or FALSE. If any operand is NULL while performing logical operations, then result will also be NULL. The following table describes the logical operators available in Hive: lOperator JOperand TypeDescription AND B [Boolean ic will return TRUE in case of both A and B is TRUE, otherwise FALSE. ae Boolean Isame as A AND 8 on 8 JBoolean Ie will return TRUE if either A or & are TRUE or both are TRUE, otherwise FALSE ie [Boolean lSame as A 08 8 ic will return TRUE if a is FALSE. wor [Boolean it will return NULL if is NULL, otherwise FALSE. ha [Boolean JSame as NoT A IN (valuet, value2, .) Boolean Ie will return TRUE if the value of A is equal to any of the given values. IN (subquery) [Boolean ic will return TRUE if A is equal to any of the values returned by subquery. NoT IN (valuet, value2, .)][Boolean ic will return TRUE if the value of A is not equal to any of the given values. NoT IN (subquery) [Boolean It will return TRUE if the value of A is not equal to any of the values returned by subquer exists (subquery) [Boolean it will return TRUE if the the subquery returns at least one record Nor exists (subquery) [Boolean It will return TRUE if the subquery retums no record.How to doit... Let's assume that the customer table is composed of fields named id, name, gender, and age. The following are some examples of using logical operators: * Select all customers of age 21, 41, and 60: hive> SELECT * FROM customer where age IN (21,41,60); On the successful execution of query, you will get the following response: Htal] Name| ent 2 ohn fale [: * Select all the male customers of age more than 40: hive> SELECT * FROM customer WHERE gender = 'Male' AND age > 40; On the successful execution of a query, you will get the following response: Htdllnamellgenderfage| l2 ohn |Male a1 [3 Mike |}Male 50Using complex operators Complex operators are used to access the elements of complex type. The following table describes complex operators available in Hive: JOperator} operand type [Description a object, and 1 the index of an element inthe fh. it rerum the element atthe index of array. lstkey Ips the map object, that i, key-value pair fs return the value corresponding to the specified key in the ls. Js is the struct object [Returns the a field of sHow to doit... Let's first create a table, person, with different complex types: CREATE TABLE person ( id INT, phones ARRAY, otherDetails MAP, address STRUCT di Now to access the different values of complex type attribute, different complex operators can be used. To access an alternative phone number of a user, execute the following command: hive> SELECT phones[1] FROM person; Let's assume that the person has some other details such as hometown='HG' and preference="homepage", then we can access each element using the particular key of the otherDetails field: hive> SELECT otherDetails[' hometown'] FROM person; To access city of a person from the address attribute, execute the following command: hive> SELECT address.city FROM person;Partitioning Partitioning in Hive is used to increase query performance. Hive is very good tool to perform queries 1 large datasets, especially datasets that require a scan of an entire table. Generally, users are aware of their data domain, and in most of cases they want to search for a particular type of data. For such cases, simple query takes large time to return the result because it requires the scan of the entire dataset. The concept of partitioning can be used to reduce the cost of querying the data. Partitions are like horizontal slices of data that allows the large sets of data as more manageable chunks. Table partitioning means dividing the table data into some parts based on the unique values of particular columns (for example, city and country) and segregating the input data records into different files or directories.

Agentic AI Fundamentals Quiz Complete With Code Diagrams Nida Rizwan
100% (1)
Agentic AI Fundamentals Quiz Complete With Code Diagrams Nida Rizwan
14 pages
Elastic Search Tutorial
No ratings yet
Elastic Search Tutorial
152 pages
BOINC: A Platform For Volunteer Computing
No ratings yet
BOINC: A Platform For Volunteer Computing
37 pages
Azure SQL Hyperscale Revealed: High-Performance Scalable Solutions For Critical Data Workloads
No ratings yet
Azure SQL Hyperscale Revealed: High-Performance Scalable Solutions For Critical Data Workloads
475 pages
System Design - The Complete Course
No ratings yet
System Design - The Complete Course
251 pages
Comprehensive Rust
No ratings yet
Comprehensive Rust
381 pages
Hands On - Reactive.programming - In.spring.5 Images
No ratings yet
Hands On - Reactive.programming - In.spring.5 Images
65 pages
Structured Approachto Solution Architecture
100% (1)
Structured Approachto Solution Architecture
109 pages
NoSQL and SQL Data Modeling Bringing Together Data, Semantics, and Software (Hills Ted.)
100% (1)
NoSQL and SQL Data Modeling Bringing Together Data, Semantics, and Software (Hills Ted.)
282 pages
Comparative Study of AWS, Azure & Google
100% (1)
Comparative Study of AWS, Azure & Google
18 pages
Murat Durmus - A Primer To The 42 Most Commonly Used Machine Learning Algorithms (With Code Samples) - Leanpub (2023)
No ratings yet
Murat Durmus - A Primer To The 42 Most Commonly Used Machine Learning Algorithms (With Code Samples) - Leanpub (2023)
192 pages
TDD Ebook Sample - 123
No ratings yet
TDD Ebook Sample - 123
317 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
Recurrent Neural Network: Dr. Sukanta Ghosh
100% (1)
Recurrent Neural Network: Dr. Sukanta Ghosh
34 pages
Red Hat Agile Integration Overview: Enabling Your API-Centric Strategy
No ratings yet
Red Hat Agile Integration Overview: Enabling Your API-Centric Strategy
78 pages
Azure AnalysisServiceOverview
No ratings yet
Azure AnalysisServiceOverview
173 pages
Week-09-10-11-12 Fundamentals of Cybersecurity
No ratings yet
Week-09-10-11-12 Fundamentals of Cybersecurity
67 pages
DuckDB in Action MEAP v02 Chptrs 1to4 MotheDuck
No ratings yet
DuckDB in Action MEAP v02 Chptrs 1to4 MotheDuck
123 pages
Dynamodb DG
No ratings yet
Dynamodb DG
705 pages
Installing and Using Impala
No ratings yet
Installing and Using Impala
248 pages
Elastic An Introduction To Apm The What Why and How
No ratings yet
Elastic An Introduction To Apm The What Why and How
24 pages
Data AI Modernization - CP4D Overview
No ratings yet
Data AI Modernization - CP4D Overview
22 pages
Smart Machines IBM's Watson and The Era of Cognitive Computing (PDFDrive)
No ratings yet
Smart Machines IBM's Watson and The Era of Cognitive Computing (PDFDrive)
106 pages
Navigating The AI Landscape by Moody
No ratings yet
Navigating The AI Landscape by Moody
24 pages
Neo4j-Manual-2 0 1
No ratings yet
Neo4j-Manual-2 0 1
593 pages
Xenon Stack
No ratings yet
Xenon Stack
15 pages
AWS Marketplace Cloud-Native Ebook 9 EaaS v2
No ratings yet
AWS Marketplace Cloud-Native Ebook 9 EaaS v2
36 pages
Intro To Flask!
No ratings yet
Intro To Flask!
323 pages
Generativeaiconamazonbedrock 231229150142 844d444e
No ratings yet
Generativeaiconamazonbedrock 231229150142 844d444e
48 pages
Understanding Unit and Integration Testing in Golang
No ratings yet
Understanding Unit and Integration Testing in Golang
59 pages
Comparison of Power BI Tableau and Cognos Webinar Senturus
No ratings yet
Comparison of Power BI Tableau and Cognos Webinar Senturus
35 pages
A987059828 - 11266 - 4 - 2020 - GCp-3 My PDF
No ratings yet
A987059828 - 11266 - 4 - 2020 - GCp-3 My PDF
42 pages
Protractor
No ratings yet
Protractor
36 pages
Ignite Sample
0% (1)
Ignite Sample
88 pages
Wso2 Apim Datasheet
No ratings yet
Wso2 Apim Datasheet
4 pages
Data Mining N Business Intelligence
No ratings yet
Data Mining N Business Intelligence
63 pages
ThoughtWorks TR Technology Radar Vol 28 en
No ratings yet
ThoughtWorks TR Technology Radar Vol 28 en
47 pages
Unit 42 Cloud Threat Report 2h 2020 PDF
No ratings yet
Unit 42 Cloud Threat Report 2h 2020 PDF
28 pages
Essential Python Libraries and Functions For Data Science 1706295212
No ratings yet
Essential Python Libraries and Functions For Data Science 1706295212
12 pages
02 Big Data Analytics MDEC PDF
No ratings yet
02 Big Data Analytics MDEC PDF
34 pages
Redis Vs Ncache
No ratings yet
Redis Vs Ncache
36 pages
State: of Ai
No ratings yet
State: of Ai
30 pages
Flight From Strategy To Executable Code-2018 KOSTA Keynote
No ratings yet
Flight From Strategy To Executable Code-2018 KOSTA Keynote
27 pages
A00009581enw Hyperconverged Infrastructure For Data Protection PDF
No ratings yet
A00009581enw Hyperconverged Infrastructure For Data Protection PDF
30 pages
Set Your Data in Motion
No ratings yet
Set Your Data in Motion
8 pages
Mastering JBoss Drools 6 - Sample Chapter
No ratings yet
Mastering JBoss Drools 6 - Sample Chapter
26 pages
Decentralized Web Platform - Public
No ratings yet
Decentralized Web Platform - Public
18 pages
Amazon EMR Security: © 2018, Amazon Web Services, Inc. or Its Affiliates. All Rights Reserved
No ratings yet
Amazon EMR Security: © 2018, Amazon Web Services, Inc. or Its Affiliates. All Rights Reserved
16 pages
The Free Hive Book
No ratings yet
The Free Hive Book
1 page
Cypress Component Testing 1
No ratings yet
Cypress Component Testing 1
2 pages
Applied Coding Track
No ratings yet
Applied Coding Track
10 pages
Rundeck So, What The Duck Is Rundeck ?
No ratings yet
Rundeck So, What The Duck Is Rundeck ?
6 pages
Product Quality Management 101:: A Guide For Non-Tech Founders
No ratings yet
Product Quality Management 101:: A Guide For Non-Tech Founders
9 pages
Troubleshooting Spark Challenges
No ratings yet
Troubleshooting Spark Challenges
7 pages
For Speed and Agility
No ratings yet
For Speed and Agility
14 pages
Donald Ngandeu 1
No ratings yet
Donald Ngandeu 1
6 pages
(IJIT-V6I5P7) :ravishankar Belkunde
No ratings yet
(IJIT-V6I5P7) :ravishankar Belkunde
9 pages
Using The Cost of Quality Approach For Software
No ratings yet
Using The Cost of Quality Approach For Software
6 pages
Cloudera Nokia Case Study Final
No ratings yet
Cloudera Nokia Case Study Final
2 pages
Hadoop in Action
No ratings yet
Hadoop in Action
1 page

Apache Hive Cookbook

Uploaded by

Apache Hive Cookbook

Uploaded by

You might also like