0% found this document useful (0 votes)
341 views50 pages

Bda Unit 3

The document provides an overview of Hadoop YARN architecture, its advantages, and components such as Resource Manager and Node Manager. It also introduces Hive, detailing its features, architecture, and installation process, emphasizing its role in managing structured data within Hadoop. Additionally, it covers the use of Pig for data analysis, including its functionalities and error handling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
341 views50 pages

Bda Unit 3

The document provides an overview of Hadoop YARN architecture, its advantages, and components such as Resource Manager and Node Manager. It also introduces Hive, detailing its features, architecture, and installation process, emphasizing its role in managing structured data within Hadoop. Additionally, it covers the use of Pig for data analysis, including its functionalities and error handling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 50
Understanding Hadoop YARN Architecture : Introduction YARN, Advantages of YARN, YARN Architecture, Working of YARN. Exploring Hive: Introducing Hive, Getting Started with Hiva, | Hive Services, Data ‘Types in Hive, Built-In Functions in Hive, Hive DDE Data Manipulation in Hive, Data Retrieval Queries, Using JOINS in Hive. Analyzing Data with Pig: Introducing Pig, Running Pig, Getting Started with Pig Latin, Working with Operators in Pig, Working with Functions in ig, Debugging Pig, Error Handling in Pig. 3.1 Unperstanpinc Hapoop Yarn ARCHITECTURE 3.1.1. Introduction to Yarn QI. What is YARN? Write about it ? Ans: Apache Yarn ~ “Yet Another Resource Negotiator” is the resource management layer of Hadoop. The Yarn was introduced in Hadoop 2.x. Yarn allows different data processing engines like graph processing, interactive processing, stream processing as well as batch processing to run and process data stored in HDFS (Hadoop Distributed File System). Apart from resource management, Yar is also used for job Scheduling. Yarn extends the power of Hadoop to other evolving technologies, so they can take the advantages of HDFS and economic cluster. Apache yam is also considered as the data operating system for Hadoop 2.x. The yam based architecture of Hadoop 2.x provides a general purpose data processing platform which is not just limited to the MapReduce. It enables Hadoop to process other purpose-built data processing system other than MapReduce, It allows running several different frameworks on the same hardware where Hadoop is deployed. Components Of YARN > Client : For submitting MapReduce jobs. > Resource Manager : To manage the use of resources across the cluster » Node Manager : For launching and monitoring the computer containers on machines in the cluster > Map Reduce Application Master: Checks tasks running the MapReduce job. The application master and the MapReduce tasks run in containers that are scheduled by the resource manager, and managed by the node managers. Jobtracker & Tasktrackerwere were used:in previous version of Hadoop, which were responsible for handling resources and checking progress management. However, Hadoop 2.0 has Resource manager and NodeManager to overcome the shortfall of Jobtracker & Tasktracker. 3.1.2 Advantages of Yarn Q2. What are the advnatages of YARN? Aaa: YARN has meiny advantages over MapReduce (MRv1). . 1) Scalability - Decreasing the load on the Resource Manager(RM) by delegating the work of handling thé tasks running on slaves to application Master, RM can now handle more requests than dob tracker facilitating addition of more nodes, Unlike MPv1 which is strongly coupled with the MapReduce, YARN supports many kinds of code running on them like MR2, Tez, Storm, Spark etc Optimized resource allocation - There are no fixed number of slots separately allocated for Mapper and Reducers in YARN, which is the case in MRvl, So the available capacity of the nodes can be used to any task which needs resources. 2) 3) Rahul Publications oo. ——_$__________— when Resource manager fails, the jobs running on the cluster neet n Mowery of Resource Manager yarn does efficient utilization of the resource: There are no more fixed map-reduce slots. YARN provides central resource manager. With YARN, you can now run multiple applications in ado, all sharing a common resource . 6) Yarn can even run application that do not follow MapRedu MapRedice’s resource management and scheduling capabilities from the data enabling Hadoop to support more varied processing approaches and a broader array of applications For example, Hadoop clusters can now run interactive querying and streaming data applications simultaneously with MapReduce batch jobs. This also streamlines MapReduce to do what is does best - process data. 4) ‘ot he restarted again after the 5) ce model: YARN decouples processing component, Apache Hadoop YARN-Architecture - NIE CRAMER aun ” veneer (goa container i. Resource Manager (RM) _ Itis the master daemon of Yarn. It manages the global assignments of resources (CPU and memory) among all the applications. It arbitrates system resources between competing applications. follow this ide to learn Yarn Resource manager in great detail. Hadoop yarn Resource Manager gui Resource Manager has two Main components > Scheduler » Application manager > Rahul Publications MSc IL Year a. Scheduler The scheduler is responsible for allocating the resources to the running application. The scheduler is pure scheduler it means that it performs no monitoring no tracking for the oN lll Semeste, 3.1.4 Working of Yarn Q4. Explain , how YARN worked in running a job? hs: Steps involved in running a job using YARN . application and even doesn’t guarantees about restarting failed tasks either due to application failure or hardware failures. b. Application Manager It manages running Application Masters in the cluster, i.e., it is responsible for starting application masters and for monitoring and : dellanager restarting them on different nodes in case of ; failures. : rl a 1] Container ii, Node Manager (NM) i + | [ppaton It is the slave daemon of Yarn. NMis responsible 7 {Le for containers monitoring their resource usage and reporting the same to the ResourceManager. Manage the user process on that machine. Yarn NodeManager also tracks the health of the node on which it is running. The design also allows plugging long-running auxiliary services to the NM; these are application-specific services, specified as part of the configurations and loaded by the NM | node. during startup. For MapReduce applications on : ea YARN, a shuffle is atypical auxiliary service loaded | ee an application id from by the NMs. follow this Hadoop Yarn Node : Manager guide to leam Node Manager in detail. | 2. Job which consists of jar files, class files and other a requited files is copied to hdfs file system under Jit. Applicition Master (AM) rectory of name application id so that job can Oné application master runs per application. It be copied to nodes where it can be run. negotiates resources from the resource manager and 3 works with the node manager. It Manages the | ~” application life cycle. 4. Resource Manager contacts Node Manager to The AM acquires containers from the RM’s he & f container and-run Application ‘Scheduler before contacting the corresponding NMs 7 to start the application’s individual tasks. 5. Application Master checks the splits (usually blocks of datanode of hdfs) on which job has to runs and create one task per split usually. Only ids are given to all the task in this phase. It checks if all the tasks can be run sequentially on same JVM on which: Application Master is User submits jobs to Job Client present on client Job is submitted to Resource Manager. Container : It is started by Node Manager. It consists of resources like memory, cpu core etc, For running a map or reduce task; Application Master asks Resource Manager for resources using which a running then it doesn’t launch any new container can be run. containers, This type of job is called uber job. Rahul Put ions {78} a - uv a | 3.2.1. Introducing HIVE 4 5. What is HIVE ? Write its features and if job is not an uber job, Application Master aks Resource Manager for allocating the Sources, Resource manager knows after node manager hdfs blocks and their bandwidth, so it viocate resources considering the data locality go that tasks can be run on same machine on ‘which data blocks are present Application manager gets the resources information from Resource Manager and it taunches the container through Node Manager. jn container the task is executed by the java application whose main class is YarnChild Before running. the task it.copies all the job_ In:most of the cases, job ually in jar form are copied ‘5 machine on which data is present. s update ‘to Application sports fo-Application master and in case of ‘WM ailure Node manager notifies Application characteristics. E. is a data warehouse infrastructure tool to. process structured data in Hadoop. It resides on’ {op of Hadoop to summarize Big Data, and makes guerying and analyzing easy, 0s. E Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive. It is used by different companies. For example, Amazon uses it in Amazon Elastic MapReduce, | Hive is not > A relational database A design for OnLine Transaction Processing (OLTP) A language for real-time queries and row-level updates BIG DATA ANALYTICS Features of Hive » It stores schema in a database and processed data into HDFS. >» — Itis designed for OLAP. > It provides SQL type language for querying called HiveQL or HQL. > Itis familiar, fast, scalable, and extensible. Important characteristics of Hive > In Hive, tables and databases are created first is loaded into these-tables. +> Hive as data warehouse desoned for managing n nly structured data that is stored dealing with structured data, Map Reduce doesn't have optimization and usability features like UDFs-but Hive framework does. Query ‘optimization refers to an effective way of query erms of performance. >. Hive's SQL-inspired language separates the user “fiom. the complexity. of Map Reduce programming. It reuses familiar concepts from the relational database world, stich as tables, rows, columns and schema, etc. for ease of teaming. “Hadoop's programming works on flat files. So, ‘Hive can use directory structures to ‘partition’ ‘data to improve performance on’ certain queries. > Anew and inparant component of Hiv Metastore used for storing schema information This Metastore typically resides in a relational database. We can interav with Hive using methods like « Web GUI « Java Database Connectivity (JDBC) interface > Most interactions tend to take place over a command line interface (CLI). Hive provides a CLI to write Hive queries using Hive Query Language(HQL) Rahul Publications MSc Il Year I Semesta, > Generally, HQL syntax is similar to the SQL syntax that most data analysts are familiar with, T, ‘Sample query below display alll the records present in mentioned table name. « Sample query : Select * from > Hive supports four file formats those are TEXTFILE, SEQUENCEFILE, ORC and RCFILE (Recor, Columnar File). Q6. Describe the components of Hive Architecture. Ans: Architecture of Hive ees 1 “See = | , af Meta nore osc |. Application _{ File System ‘opBc * acento ‘a} Job client Hive Consists of Mainly 3 core parts “1. Hive Clients 2. Hive Services 3. Hive Storage and Computing 1. Hive Clients Hive provides different drivers for communication with a different type of applications. For Thrift based applications, it will provide Thrift client for communication. For Java related applications, it provides JDBC Drivers. Other than any type of applications provided ODBC drivers. These Clients and drivers in tum again communicate with Hive server in the Hive services. 2. Hive Services Client interactions with Hive can be performed through Hive Services. If the client wants to perform any query related operations in Hive, it has to communicate through Hive Services. CLlis the command line interface acts as Hive service for DDL (Data definition Language) operations. All drivers communicate with Hive server and to the main driver in Hive services as shown in above architecture diagram. Driver present in the Hive services represents the main driver, and it communicates all type of JDBC, ODBC, and other client specific applications, Driver will process those requests from different applications . to meta store and field systems for further processing. Rahul Publications {20} wa LyTICS yn BIG DATA ANA 4, Hive Storage and Computing: Hive services such as Meta store, Filesystem, and Job Client in tur communicates with Hive storage and performs the folowing actions Metadata information of tables created in Hive is stored in Hive “Meta storage database”, Query rests and data loaded in the tables are going to be stored in Hadoop cluster on HDFS. , 32.2. Getting Started With HIVE Q1. Write the step by step installation process of Hive. pas: HIVE Installation Environment for Hive Add the below line in ./bashrc file export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/ust/local/HadoopIlib/*:.
export CLASSPATH=$CLASSPATH;/ust/local/hivellib/*:.

Now run the /bashre file to reflect those changes. $ source ~/.bashre
Configuring Hive ; Co Rahul Publications ee For Configuring Hive hive-envsh is edited, This file is present in HIVE_HOME/conf, $ cd $HIVE_HOME/conf $ cp hive-enuish.template hive-envsh Add the below line to hive-envsh. export HADOOP_HOME=/ust/local/hadoop Step 3 : Derby Database Hive uses extemal Database server to configure Metastore, Now Download and install Apache Derby Follow the steps given below to download and install Apache Derby, Downloading Apache Derby: The following command is used to download Apache Derby. It takes some time to download. $ cd ~ $ wget hitp://archive.apache. Org/dist/db/derby/db-derby-10.4.2.0/db-derby-10.4.2.0-bin,tar.gz Set up the enironment Add the below line to Jbashre file export DERBY_HOME=/usr/local/derby export. PATH=$PATH:$DERBY_HOME/bin Apache Hive export CLASSPATH=$CLASSPATH:$DERBY, HOMElib/derbyjar$DERBY_HOMEib/derbytools ar To reflect the changes type $ source ~/.bashre Create a directory to store Metastore Create a directory named data in $DERBY_HOME directory to store Metastore data, $ mkdir $DERBY_HOME/data
Step 4:Configuring Metastore of Hive Edit hive -site.xml and append the following lines between the and name >javax.jdo.option.ConnectionURL jdbc:derby://localhost:1527/metastore_db;create=true JDBC connect string for a JDBC Metastor figuration> ‘e Rahul Publicattons gut $$ a file named jpox.properties and add the follow saxon do PesistenceManagerFactoryClass. = ve pes PesstenceManagerFactorytmp! sox. autoCreateSchema = false sraipoxvalidateTables = false poxalidateColumns = false org jpox-validateConstraints = false . crapox.storeManagerType = rdbms org pox autoCreateSchema = true org.jpox.autoStartMechanismMode = checked orgpox.transactionlsolation = read_committed on Create ing lines into it 3.2.3 HIVE Services Q8. Write about various services used in Hive. Aud : Hive services are as of the following > CLI Hive CLI (Command Line Interface) , which is nothing but Hive Shell is the default service in Hive and it is the most common way of interacting with Hive. We can run both batch and Interactive shell commainds via CLI service which we will cover in the following sections. We can get the list of commands/options allowed on Hive CLI with $ hive -service cli-help command from terminal, 83 {23} Rahul Publications MSc Il Year Ml Semen, > Hive Server HiveServer is an optional service that allows a remote client to submit requests to Hiv, using : variety of programming languages, and retrieve ests. HiveServeris built on Apache Thrift. HiveSe, is an optional service that allows a remote client to submit requests to Hive, using a vari Programming languages, and retrieve resus. HiveServer is built on Apache Thrift therefore sometimes called the Thiift server although this can lead to confusion because a newer senjc named HiveServe2 is also built on Thi. Since the intoduction of HiveServer2, HiveServer je also been called HiveServerl. It is sometimes called the Thrift server although this can lead to confusion because a Newer servic, named HiveServer2 is also built on Thrift. Since the introduction of HiveServer2, HiveServer hag also been called HiveServerl. >» Hive web Interface The Hive Web Interface is an alternative to using the Hive command line interface. Using the web interface is a great way to get started with Hive. The Hive Web Interface, abbreviated as HWI, is a simple graphical user interface (GUI). » JAR Hive is some what equal to Hadoop JAR, as it is convenient to run the java applications including Hadoop, and Hive classes on the class path. 7 : > Metastore Meta store is the Hive intemal database which will store all the table definitions. By default Hive uses the derby database as ils meta store. The Meta store is divided into two pieces are the service and the backing store for the data. By default the meta store service runs in the same JVM as the Hive service and contains an embedded Derby database instance backed by the local disk This is called Embedded Meta store configuration, > Hive Client There are many different mechanisms to get in contact with the application when you run Hive as a server, that is hive server. The following is the one of the client to Hive Server > Thrift Client Apache Thrift is a software framework for scalable cross-language services development, which combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP Ruby, Perl, C#, davaScript, Node,js and other: languages. Thrift can be used when developing a web service that uses a service developed in one language ‘access that is in another language. 3.2.4 Data Types in HIVE Q9. Explain Hive Data types with examples, Aas: Hadoop Hive Data Types With Examples are mainly divided into 2 types : » Primitive Data Types > Complex Data Types 84 Rakul Publications 7 —} — BIG DATA ANALYTICS unit UNIT Data Types : Primiti 1, Primitive yp imitive Data Types also divide into 4 types there are mentioned below Numeric Data Type String Data Type Date/Time Data Type Miscellaneous Data Type Here is the common thing between Java,Sql ee . A data types is similar to Java and SQL. Pepe a Uala pes and sues of Hadooe » Numeric Data Type : Numerical Data types are mainly divided into 2 types « Integral Data Types TINYINT 1 Byte signed “12810 127 integer ~ SMALLINT 2Bytes signed —_-32,768 10 32.767 integer INT 4 Bytes signed —-2,147,483,648 to 100, 1000, $0000 integer 2,147,483,647 BIGINT B-byte signed —-9.2*10""109.2"10" 100, 1000+10" integer FLOAT 4-byte single 14te* to 3.4te™ 1500.00 precision float DOUBLE B-byte double 4.94e™ to 750000.00 precision float 1.79e" DECIMAL 17 Bytes Precision -10"+1to10"-1 DECIMAL(S,2) upto 38 digits : By default in Hive integral values are taken as a INT data type unless the value of integral values range are cross the range of INT values (Please check the above table for values).Suppose if we need to use a low integral value like 100 it to be treated as TINYINT or SMALLINT or BIGINT then we need to suffix the value with Y, S or L respectively. (5) Rahul Publications MSc Il Year I Sem, Examples: 100Y - TINYINT, 100S - SMALLINT, 100L - BIGINT > String Data Types : In Hive Siting Data Types are Mainly divided into 3 types there mentioned below : «STRING » VARCHAR * CHAR This String Data types are supported in Hive from only 0.14 and above versions only Primitive ~ String Data Types Descratio STRING ‘Sequence of characters. Either single ‘Welcome to quotes (') or double quotes (") can be Hadooptutorial. info’ used to enclose characters VARCHAR Max length is specilied in braces, ‘Welcome to Similar to SQUS VARCHAR. Max Hadooptutorial.info length allowed is 65355 bytes tutorials CHAR Similar o SQU's CHAR with fixed-_"Hadooptutoral. info length. ie values shorter than the specified length are padded with spaces » DATE/TIME Data Types : Date/Time Data types are mainly Divide into 2 types = DATE = TIMESTAMP This hive date/time data types in UNIX time stamp format for date/time related fields in hive, ; DATE : Represented format is YYYY-MM--DD Example : DATE ‘2014--12--07’ Date ranges allowed are 0000-01-01 to 9999--12--31 TIMESTAMP TIMESTAMP use the format yyyy-mm-dd hh:mmiss[f..]. Cast Type Result Sti cast(date as date) | Same date value i cast(date as string) | Date is formatted as a sting in the form’ YYYY-MMDD”. j St casi(date as Midnight of the year/month/day of the date value is returned as timestamp. ot timestamp) = casi(sring as date) | Ifthe sting isin the form VYYY-MM-DD,, then a date value comesponding fo i: thats returned. If he sting value does not maich ths format, then NULL s Sting returned. cxsitimesmp as | The year/month/day of the timestamp is returned as a date value, : String fe _ Rahul Publications {28} - . ___ ae BIG DATA ANALYTICS Miscellaneous Types Hive Supports 2 more primitive Data types BOOLEAN «BINARY ive BOOLEAN is simil : Hive similar to Java’s BOOLEAN types.it can stores true or false values only BINARY is an array of Bytes and similar to VARBINARY in many RDBMSs. BINARY columns are stored authin the record, not separately like BLOB, 43.2.5 Built-in Functions in HIVE Sting upper(tring A) Ttreturns the sting resulting from converting all characters of A to upper case. String uucase(string A) Same as above. Sting lower(string A) It returns the string resulting from converting all characters of B to lower case. String lease(string A) Same as above. . String trim(string A) It returns the string resulting from trimming spaces : from both ends of A. String Itrim(string A) It returns the string resulting from trimming spaces from the beginning (left hand side) of A. — Rahul Publications M.Sc Il Year Il Semeste, rtrim(string A) rtrimitring A) It returns the string resulting from trimming spaces from the end (right hand side) of A regexp _replace(string A, string B, string C) It returns the string resulting from replacing all substrings in B that match the Java regular expression syntax with C. Int size(Map) It returns the number of elements in the map type, Int size(Amay) It returns the number of elements in the array type, value of cast( as ) It converts the results of the expression expr to e.g, cast('l' as BIGINT) converts the string 'L'to it integral representation. A NULL is returned if the conversion does not succeed. String from_tnixtime(int unixtime) convert the number of seconds from Unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970- 01-01 00:00:00" String to_date(string timestamp) It returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01" Int year(string date) Itreturns the year part of a'date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970 Int month(string date) It returns the month part of a date ora timestamp string: month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11 Int day(string date) It returns the day part of a date or a timestamp string: day("1970-11-01 00:00:00") day("1970-11.01") = 1 String get_json_object(string json_string, string path) Itextracts json object from ajson string based on json path specified, and retums json string of the extracted json object. It returns NULL if the input json string is invalid. Q11. Give an example to demonstrate some built in functions. Ana : Example ‘The following queries demonstrate some built-in functions : round() function hive> SELECT round(2.6)from temp; On successful execution of query, you get to see the following response: 3.0 floor() function hive> SELECT floor(2.6)from temp; Rahul Publications (28) > pn on guccesstul execution of the query, you get to see the following response: BIG DATA ANALYTICS 2.0 ae ECT cll2.6)rom temp, fa ices execution of the query, you get to see the following response: 3.0. (912. Write some aggregate functions of Hive, pus ‘Aggregate Functions Hive supports the following built-in aggregate functions. The usage of these functions is as same [LOCATION hdfs_path] (WITH DBPROPERTIES (property_name=property value, ..); In the above syntax for create database command, the values mentioned in square brackets {] are optional Usage of Create Database Command in Hive hive> create database if not exists firstDB comment “This is my first demo” location ‘Juser/hivewarehouse/ newdb’ with DBPROPERTIES (‘createdby’=’abhay’,'createdfor'="dezyre’); OK Time taken: 0,092 seconds 89 Rahul Publications MSc Il Year : I Semett, » Drop Database in Hive This command is used for deleting an already created database in Hive and the syntax is as follows DROP (DATABASE) [IF EXISTS] database_name [RESTRICT | CASCADE]; Usage of Drop Database Command in Hive hive> drop database if exists firstDB CASCADE; OK Time taken: 0.099 seconds In Hadoop Hive, the mode is set as RESTRICT by default and users cannot delete it unless it is non, empty. For deleting a database in Hive along with the existing tables, users must change the mode from, RESTRICT to CASCADE. Describe Database Command in Hive This command is used to check any associated metadata for the databases. > Alter Database Command in Hive Whenever the developers need to change the metadata of any of the databases, alter hive DDL command can be used as follows — ALTER (DATABASE) database_narte SET DEPROPERTIES (property_name=property_value, ..; Usage of ALTER database command in Hive - Let’s use the Alter command to modify the OWNER property and specify the role for the owner ~ ALTER (DATABASE) database_name SET OWNER [USER|ROLE] user_or_role; Daas a ac ance ne ia HR rH Ace apn Ue dhe Oe Rey ia) Be eee eee Cg ss le icle 31g 3) MG ie Neer Paso ee) » Show Database Command in Hive Programmers can view the list of existing databases in the current schema, Usage of Show Database Command Show databases; RS Griese) re ee a ea rae) Rahul Publications = BIG DATA ANALYTICS unit see Database Command in Hive Us This hive command is used to select a specific database for the session on which hive queries wo" Id tye executed cage of Use Database Command in Hive poL Commands on Tables in Hive Create Table Command in Hive ive-create stable-commandsis"used: toyereate a table in the existing database thatyis:inuseforr a age INT FIELDS TERMINATED BY ‘| PG Cia (CEL) ae Pca aaie as mele Vp Cel E Here rn ren neers Nee 5 Us LECT Pe ae CE In the above step, we have created a hive table named Students in the database college with various fields like ID, Name, fee, city, etc. Comments have been mentioned for each column so that anybody teferting to the table gets an overview about what the columns mean. The LOCATION keyword is used for specifying where the table should be stored on HDFS. a1 Rahul Publications a 73 MSc Il Year _ MI Semeste, » DROP Table Command in Hive Drops the table and all the data associated with it in the Hive metastore DROP TABLE (IF EXISTS] table_name [PURGE]; Usage of DROP Table command in Hive RCC e cme cuca a ii . Are ed DROP table command removes the metadata and data for a particular table. Data is usually m: .Trash/Curent directory if Trash is configured. If PURGE option is specified then the table data will not go to the trash ditectory and there will be no scope to retrieve the data in case of erroneous DROp command execution, » TRUNCATE Table Command in Hive This hive command is used to truncate all the rows present in a table ie. it deletes all the data from the Hive meta store and the data cannot be restored. ‘TRUNCATE TABLE [db_nametable_name Usage of TRUNCATE Table in Hive ree ved) a ee ee ee cud f CRORES IG » ALTER Table Command in Hive Using ALTER Table command, the structure and metadata of the table can be modified even after the table has been created. Let's try to change the name of an existing table using the ALTER command -— ALTER TABLE [db_name].old_table_name RENAME TO [db_name]:new_table_name; Syntax to ALTER Table Properties ALTER TABLE [db_name].tablename SET TBLPROPERTIES (‘property_key'= *property_new_value’) » DESCRIBE Table Command in Hive Gives the information of a particular table and the syntax is as follows - DESCRIBE [EXTENDED| FORMATTED] [db_name.] table_name[.col_name ( [field_name] Usage of Describe Table Command Ihive>. describe ‘college. college_students; or me rrycrg unique td for each student string BULLS AO ES ee Cer) CT student college fee Pi Ta Uy) Pt ee cn ary See eee Bay Cris Bese eee eas MOS CORR Rie eo Sa PT) 92 Rahul Publications Uy BIG DATA ANALYTIC? DATA ANAT yw how Table Command in Hive Gives the list of existing tables in the current database schema. 327 Data Manipulation in HIVE is. Explain Hive DML Commands, The file is a ‘|’ delimited file where each row. can be inserted as a table record. First let's create a table student based on the contents in the file — > The ROW FORMAT DELIMITED must appear before any of the other clauses, with the exception of the STORED AS ... clause. > Theclause ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘| means I character will be used as field separator by hive. > The cause LINES TERMINATED BY ‘\n' means that the line delimiter will be new line. > Theclause LINES TERMINATED BY ‘\n’ and STORED AS... donot require the ROW FORMAT DELIMITED keywords. . (es) Rahul Publications hive> CREATE TABLE IF NOT EXISTS college.students ( > ID BIGINT COMMENT ‘unique id for each student’, > name STRING COMMENT ‘student name’, >age INT COMMENT ‘sudent age between 16-26, >fee DOUBLE COMMENT ‘student college fee’, city STRING COMMENT ‘cities to which students belongs’, >state STRING COMMENT ‘student home address state s’, > zip BIGINT COMMENT ‘student address zip code’ cal > COMMENT ‘This table holds the demography info for each student’ > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ‘|’ > LINES TERMINATED BY ‘\n’ > STORED AS TEXTFILE. > LOCATION ‘fuser/hive/warehouse/college.db/students’; OK Time taken: 0.112 secondsLet’s load the file into the student table — ae eee a NT] OTe eta a aro er ye eS reed [2 Une If the keyword LOCAL is not specified, then Hive will need absolute URI of the file. However, if local is specified then it assumes the following rules - > It will assume it's an HDFS path and will try to search for the file in HDFS. » If the path is not absolute, then hive will try to locate the file in the /user/ in HDFS. Using the OVERWRITE keyword while importing means the data willbe ingested ie. it wil delete of data and put new data otherwise it would just append the new data. The contents of the target table wl be deleted and replaced by the files refered to by file path; otherwise the files referred by file path ville added to the table. Let’s check if the data has been inserted into the table — hive> select * from students; : 96Stephen 25 16573.0 Gaya BR ° 874761 597 Colby 25 19929.0 New Bombay Maharastra 868698 598 Drake 21 49260.0 NagaonAssam 157775 599 Tanek 18 12535.0 Gurgaon Haryana 201260 600Hedda 23 43896.0 Ajmer RJ 697025 Time taken: 0.132 seconds, Fetched: 601 row(s) Rahul Publications - BIG DATA ANALYTICS check first 5 records:Now, let's try to retrieve only 5 records using the limit option - rl wut! sles from student iit 5; hive ok - NULL name NULL NULL city 1 Kenisl 22 258740 » Miya 25 353670 4 Raven 20 491030 4 cala «1927120 + CLUSTER BY, DISTRIBUTE BY, SORT BY specify ‘ime taken: 0.144 seconds, Fetched: state NULL KultiBarakar WB Jalgaon Maharastra Rewa Madhya Pradesh Pilibhit UP 769853 451333 710179 392423 the sort order and algorithm «LIMIT specifies how many # of records to retrieve WHERE Clauses A WHERE clause is used to filter the result set by using predicate operators and logical operators. Functions can also be used to compute the condition. + List of Predicate Operators _* List of Logical Operators * List of Functions Here's an example query that uses a WHERE clause. SELECT name FROM products WHERE name ‘stone of jordan’; > Rahul Publications MSc I Year Ms, ty GROUP BY Clauses ti : cctions, to group the result g A GROUP BY clause is frequently used with aggregate oa ee aa re 9 ei, apply aggregate functions over each group. Functions can also oUupin * List of Aggregate Functions : : "List of Functions and Here's an example query that groups and counts by category. SELECT category, count(1) FROM products GROUPBY category; HAVING Clauses A HAVING clause lets you filter the groups produced by GROUP BY, by applying Pred operators to each groups. = List of Predicate Operators Here's an example query that Groups and counts by category, and then retrieves only counts , SELECT category, count(1) AS ent FROM products GROUPBY category HAVING ent >10, 3.2.9 Using Joins in HIVE Q16. Write about Use of JOINS in Hive Anas ; JOIN is a clause that is used for ‘combining specific fields from two tables by using values commons! each one. It is used to combine records from two or more tables in the database. It is more or less sing) to SQL JOIN. Types of Joins 1. INNER JOIN - Select records that have matching values in both tables. . 2. LEFT JOIN (LEFT OUTER JOIN) = returns all the values from the left table, plus the Matchei values from the right table, or NULL in case of no matching join predicate 3. RIGHT JOIN (RIGHT OUTER JOIN) A RIGHT JOIN returns all the values from the right table, plus the matched values from the left table, or NULL in case of no matching join predicate FULL JOIN (FULL OUTER JOIN) - Selects all records that match either left or right table recor | 5. LEFT SEMI JOIN: Only returns the records from the left-hand t subqueries so you can't do Syntax : SELECT * FROM TABLE_A WHERE TABLE A.ID IN (SELECT ID FROM TABLE. B); able. Hive doesn't suppor INNER JOIN FULL JOIN LEFT JOIN BIG DATA ANALYTICS yu o tables Employee and Employee Department that are going to be joined Lets 02 ™ Employee Employee Department ~wpio _EmpName Address Emp 10 Department 1 Rose us Lo 2 Fred us 2 1 3 Jess In 3 Eng 4 Frey Th 4 Admin nner joins daa representing them as data flows. Pigis generally used with Hadoop, manipulation operations in Hadoop using Apache Pig. “Tonite data analysis programs, Pig provides a high-level language known as Pig Latin. This language rovides various operators using which programmers can develop their own functions for reading, writing, and processing data. To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language. All these scripts are intemally converted to Map and Reduce tasks. Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs. Features of Pig Apache Pig comes with the following features : > Rich set of operators — It provides many operators to perform operations like join, sort, filer, etc. {7} Rahul Publications MSc Il Year : I Semeste, » Ease of programming ~ Pig Latins similar to SQL and itis easy to write a Pig script ifyou ae goo, at SQL. > Optimization opportunities - The tasks in Apache Pig optimize their execution automatical the programmers need to focus only on semantics of the language. > Extensibility - Using the existing operators, users can develop their own functions to read, Process and write data. > UDF's -Pigprovides the facility to create User-defined Functionsin other programming language such as Java and invoke or embed them in Pig Scripts. >» Handles all kinds of data ~ Apache Pig analyzes all kinds of data, both structured as we unstructured, It stores the results in HDFS. QU8, Explain the Architecture of Apache Pig. Ans: Architecture by 9 elas For uniting a Pig script, we need Pig Latin language and to execute them, we need an execution, environment. The architecture of Apache Pig is shown in the below image. ipache Fig) | Figure: Apache Pig architecture Pig Latin Scripts Initially as illustrated in the above image, we submit Pig scripts to the Apache Pig execution environment which can be written in Pig Latin using built-in operators, Thete are three ways to execute the Pig script : > Grunt Shell : This is Pig's interactive shell provided to execute all Pig Scripts, > Script File : Write all the Pig commands in a script file and execute the Pig script file, This is . executed by the Pig Server. » Embedded Script: Ifsome functions are unavailable in built-in operators, we can programmatically create User Defined Functions to bring that functionalities using other languages like Java, Python, Ruby, etc. and embed it in Pig Latin Script file. Then, execute that script file. Rahul Publications {98 }- BIG DATA ANALYTICS parset : From the above image you can see, after passing through Grunt or Pig Server, Pig Scripts are passed to the Parser. The Parser does type checking and checks the syntax of the script. The parser outputs a DAG ac rected acyclic araph). DAG represents the Pig Latin statements and logical operators. The logical operators Topresented as the nodes and the data flows are represented as edges. optimizer ; Then the DAG is submitted to the optimizer. The Optimizer performs the optimization activities ine spt, merge, transform, and reorder operators. etc. This optimizer provides the automatic optimization feature to Apache Pig. The optimizer basically aims to reduce the amount of data in the pipeline at any instant of time while processing the extracted data, and for that it performs functions like: PushUpFilter: If there are multiple conditions in the filter and the filter can be split, Pig splits the ngandips cond el on (di are Finally, as shown in the figure: Apache Pig Architecture, these MapReduce jobs are submitted for execution to the execution engine. Then the MapReduce jobs are executed and gives the required result. The results can be displayed on the.screen using “DUMP” statement and can be stored in the HDFS using “STORE” statement. Q19. How to install Apache pig? Expalin with steps. Ana: Pig Installation The prerequisites for Apache Pig installation are Java and Hadoop to be installed in the system. Download the latest version of Pig from http://www.apache.org/dyn/closer.cgi/pig. . 99 Rahul Publications 7 MSc Il Year Nl Semeste, Unzip the downloaded file. The Pig script is located in the bin directory (/pig-n.n.n/bin/pig). Add /pig, n.n.n/bin to your path, Use export (bash,sh,ksh) or setenv (tesh,csh). For example: $ export PATH=//pig-n.n.n/bin:$PATH To get the list of pig commands write: $ pig -help To start the grunt shell write the below command: $ pig 3.3.2 Running PIG Q20. Write about various Execute modes of Pig. Ans : Pig Run Modes Pig executes in two modes: Local Mode and MapReduce Mode. Local Mode It executes in a single JVM and is used for development experimenting and prototyping Local mode works on local file system. Command for local mode grunt shell: $ pig-x local MapReduce Mode The MapReduce mode is also known as Hadoop Mode. In this Pig renders Pig Latin into MapReduce jobs and executes them on the cluster. It can be executed against semi-distributed or fully distributed hadoop installation. Command for Map reduce mode : $ pig Or, $ pig-xmapreduce Apache Pig Execution Mechanisms Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and embedded mode. » Interactive Mode (Grunt shell) - You can run Apache Pig in interactive mode using the Grunt shell. In this shell, you can enter the Pig Latin statements and get the output (using Dump operator). » Batch Mode. (Script) - You can run Apache Pig in Batch mode by writing the Pig Latin script in a single file with .pig extension, » Embedded Mode (UDF) - Apache Pig’ provides the provision of defining our own functions (User Defined Functions) in programming languages such as Java, and using them in our script. Invoking the Grunt Shell You can invoke the Grunt shell in a desired mode (local/MapReduce) using the -x option as shown below. 100 Rahul Publications Ce Fe Local mode MapReduce mode BIG DATA ANALYTICS "command - — ¢/pig-xlocal Command ~ ‘sig -xmapreduce output - ; 7s709728 10513803 THFO pig.tains Logging error mestages to: trone/Hacoop/pig_1443415383991. log $o1$-09-28 10:33:04,838 [main] Fo ‘rg. a0ache, 9g. backend hadoop execution cogine éxecutionEngine = Connecting to hodoop file systen at: file:/// rane Output - {5705798 10:26:46 INFO pig.Nain: Logging error messages to: Jrone/Hadoop/pig_1443416326123.1og 2015-09-28 10:28:46,427 [main] INFO org. apache. pig. backend.hadoop ,execution engine HExecutionEngine - Connecting to hadoop file system at: file:/// Sample_script.pig student= LOAD ‘hafs;//localhost:9000/pig_data/student.txt’ USING PigStorage(',’)as(id:int,name:chararray,city:chararray); Dump student; Now, you can execute the script in the above file as shown below. Local mode MapReduce mode $ pig -x local Sample_script.pig Rahul Publications M.Sc Il Year I Semeu 3.3.3. Getting Started with PIG Latin Q21. Explain about Pig Application Flow. Ans: Pig Application flow At its core, Pig Latin is a dataflow language, where we define a data stream and @ series transformations that are applied to the data as it flows through our application. This is in contrast toa Control flow language (like C or Java), where we write a series of instructions. In control flow langu, we use constructs like loops and conditional logic (like an if statement). We won't find loops and if statemens, in Pig Latin, Here is a simple pig syntax : A=LOAD ‘mindstick_file.txt’; B= GROUP..; C= FILTER ..; DUMP B; STORE C INTO ‘Result’ Load : We first load (LOAD) the data we want to manipulate. As in a typical MapReduce job, that data. is stored in HDFS. For a Pig program to access the data, we first tell Pig what file or files to use. For that task, we use the LOAD ‘data file’ command, Here, ‘mindstick file’ can specify either an HDFS data file or a HDFS directory. Ifa directory i specified, every file located _in that directory are loaded into the program. If the data is stored in a file format that isn't natively accessible to Pig, we can optionally add the USING function to the LOAD statement to specify a user-defined function that can read in (and interpret) the data. ‘Transform : We run the data through a set of transformations that, way under the hood and far removed from anything we have to concer ourseives with, are translated into a set of Map and Reduce tasks. The transformation logic is place where all the data manipulation processing happens. Here, we can FILTER out rows that aren't of our interest, JOIN two sets of data files, GROUPdata to build aggregations, ORDER results, and do many more things. Dump : Finally, we dump (DUMP) the results to the screen or Store (STORE) the results in a file somewhere, Q22. What are Statements in Pig Latin? Aus : Pig Latin - Statemets While processing data using Pig Latin, statements are the basic constructs, > These statements work with relations. They include expressions and schemas, > Every statement ends with a semicolon (;). Rahul Publications (102) - pure - BIG DATA ANALYTICS We will perform various 6perations using operators provided by Pig Latin, throuigh staternents. , Fi 7 Except LOAD and STORE, while performing all other operations, Pig Latin statements take a relation * Seinput and produce another relation as output. oe | Assoon as you enter a Load statement in the Grunt shell, its semantic checking will be carried out. To see the contents of the schema, you need to use the Dump operator. Only after performing the dumpoperation, the MapReduce job for loading the data into the file system will be carried out. Example Given below is a Pig Latin statement, which loads data to Apache Pig. nt>Student_data= LOAD ‘student data.tit’ USING PigStoragel',Jas (idit, firstname:chararray, lastname:chararray, phone:chararray, city:chararray }; 9 Biginteger | Represents a Java Biglnteger. Example : 60708090709 10 | Bigdecimal | Represents a Java BigDecimal Example : 185.98376256272893883 Complex Types u Tuple ~ | A tuple isan ordered set of fields. Example : (raja, 30) 12 Bag A bag is a collection of tuples. : Example : {(r@u,30),(Mohhammad,48)} 13 Map ‘A Map isa set of key-value pairs, Example : | ‘name'#'Rajt?, ‘age'#30) 103 Ce Rahul Publications M.Sc Il Year 3.3.4 Working With Operators in PIG Q24. Explain various operators in Pig Latin. ans Pig Latin ~ Arithmetic Operators The following table describes the arithmetic operators of Pig Latin. Suppose a = 10 and b = 20, Os Ul Semester Operator | Description Example + Addition - Adds valueson | a + bill give 30 ___| either side of the operator - Subtraction - Subtracts right | a - b will give -10 hand operand from left hand operand values on either side of the operator Multiplication - Multiplies | a *b will give 200 t Division - Divideslefthand | b/awill give 2 operand by right hand operand % Modulus - Divideslefthand | b %a will give 0 operand by right hand operand and returns remainder Bincond - Evaluatesthe | b = (a == 1)? 20:30; Boolean operators. It has three | ifa = 1 the value of b is 20. > operands as shown below. if a!=1 the value of b is 30. : variable x = (expression) ? valuel iftue : value2 if | false. CASE Case - The case operatoris | CASE (2 %2 WHEN equivalent to nested bincond | WHEN 0 THEN ‘even! THEN operator. WHEN 1 THEN ‘odd’ ELSE END | END Pig Latin - Comparison Operators The following table describes the comparison operators of Pig Latin. Operator Description Example Equal - Checks if the values of two operands are equal or not; if yes, then the condition becomes true. (a = b) is not true Not Equal - Checks if the values of two operands are equal or not, If the values are not equal, then condition becomes true. (a!=b) istrue. Greater than - Checks if the value of the left operand is greater than the value of the right operand. If yes, then the condition becomes true. Less than - Checks if the value of the left operand is less than the value of the right operand, If yes, then the condition becomes true, (a > b) isnot true. (a = Greater than or equal to - Checks if the value of the left | (a >= b) is not trie. operand is greater than or equal to the value of the right operand. If yes, then the condition becomes true, <= | Less than or equal to - Checks ifthe value of the left | (a <= b) is true. operand is less than or equal to the value ofthe right operand. If yes, then the condition becomes tue. matches Pattern matching - Checks whether the string in the left - | f1 matches hand side matches with the constant in the right-hand side. ‘*tutorial." pig Latin - Type Construction Operators ‘The following table describes the Type construction operators of Pig Latin. Operator | Description Toone 7 coluy fe ransform a relation using an external program. Grouping and Joining JOIN To join two or more relations. COGROUP To group the data in two or more relations. GROUP To group the data in a single relation. CROss To create the cross product of two or more relations. Sorting ORDER, To arrange a relation in a sorted order based on one or more fields {ascending or descending). UMir To get a limited number of tuples &gm a relation. Combining and Splitting Non To combine two or more relations into a single relation SPLIT To splita single relation into two or more relations. Rahul Publications MSc Il Year Ul Semestey Diagnostic Operators DUMP To print the contents of a relation on the console. DESCRIBE To describe the schema of a relation. EXPLAIN To view the logical, physical, or MapReduce execution plans to compute a relation. ILLUSTRATE _ To view the step-by-step execution of a series of statements, COGROUP COGROUP is the same as GROUP For readability, programmers usually use GROUP when only one relation is involved and COGROUP with multiple relations re involved. See GROUP for more information CROSS Computes the cross product of two or more relations. Syntax } alias = CROSS alias, alias (, alias ...] [PARALLEL n] Example Suppose we have relations A and B, A = LOAD ‘datal’ AS (al:int,a2:int,a3:in\); DUMP A; (1,2,3) (4,2,1) B = LOAD ‘data2’ AS (b1:int,b2:int); DUMP B; (2,4) (8,9) (1,3) In this example the ‘cross product of relation A and B is computed, X = CROSS A, B; DUMP x; (1,2,3,2,4) (1,2,3,8,9) (1,2,3,1,3) (4,2,1,2,4) (4,2,1,8,9) (4,2,1,1,3) ~S Rahul Publications pvt BIG DATA ANALYTICS pistincT gonoves duplicate tuples in a relation suntax alias = DISTINCT alias [PARALLEL n]; Example guppose we have relation A. = LOAD ‘data’ AS (allint,a2:int,a3:int}; alias = FILTER alias BY expression; Examples Stppose we have relation A. A 4 ae ‘data’ AS (a1:int,a2:int,a3:int); N23) (42,1) Rahul Publications MSc Il Year Ml Semeste In this example the condition states that if the third field equals 3, then include the tuple with relation x X = FILTER A BY 13 DUMP X; 3; In this example the condition states that if the first field equals 8 or if the sum of fields £2 and £3 is not greater than first field, then include the tuple relation X. X = FILTER A BY (fl == 8) OR (NOT (f2+£3 > f1)); DUMP x; (4,2,1) (8,3,4) (7,2,5) (8,4,3) FOREACH : Generates data transformations based on columns of data, Syntax alias = FOREACH { gen_blk | nested_gen_blk } {AS schema]; Examples Suppose we have relations A, B, and C (see the GROUP Operator for information about the field names in relation C), : A =.LOAD ‘datal’ AS (al:int,a2:int,a3:int); DUMP A; ) (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) Rakul Publications - unit Il BIG DATA ANALYTICS j= LOAD ‘data2” AS (bln b2int DUMP B; 24) 9) (13) (2,7) 29) (4,6) (4.9) DUMP x, (12,3) (42,1) (83,4) (43,3) (72,5) (84,3) |i this example two fields from relation A are projected to form relation X X'S FOREACH A GENERATE al, a2: 9 qe) Rahul Publications 7 M.Sc Il Year , . Ill Semester DUMP x 1,2) (4,2) 8.3) 4,3) 7,2) (8,4) Example: Nested Projection In this example if one ofthe fields in the input relation is a tuple, bag or map, we can perform a projection on that field (using a deference operator). X = FOREACH C GENERATE group, B.b2; ‘DUMP X; (1,{(3)}) (4,{(6).(9)}) (8,{(9)}) In this example multiple nested columns are retained. X = FOREACH C GENERATE group, A.(al, a2); DUMP xX; (1,4(1,2)}) (4,4(4,2),(4,3)}) (8,{(8,3),(8,4)}) ) GROUP Groups the data in one or multiple relations. GROUP is the same as COGROUP For readability, Programmers usually use GROUP when only one relation is involved and COGROUP with multiple relations are involved, Syntax alias = GROUP alias { ALL [PARALLEL n];Usage The GROUP operator groups together tuples that have the same group key (key field). The key field will bea tuple i the group key has more than one field, otherwise it will be the same type as that of the group key. a result of a GROUP operation is a relation that includes one tuple per group. This tuple contains two fields: | BY expression} [, alias ALL | BY expression ...] [USING ‘collected’] 7 The first field is named “group” (do not confuse this with the GROUP operator) and is the same type as the group key, Rahul Publications BIG DATA ANALYTICS describe describe debugging utility is helpful to developers when writing Pig scripts as it shows the schema of a relation in the script. For beginners who are trying to leam Apache Pig can use the describe utility to understand how each operator makes alterations to data. A pig script can have multiple describes. > Illustrate Executing pig scripts on large data sets, usually takes a long time. To tackle this, developers run pig scripts on sample data but there is possibility that the sample data selected, might not execute your pig script properly. For instance, if the script has 2 join operator there should be at least a few records in the sample data that have thesame key, dtherwise Rahul Publications =} po ny results, To : sion wil not return 2 y results. 4p opera gues, ustrte used strate 1 pase ind Ot he data and whenever itcomes coin oF fitter that remove data, “me records passthrough and es My making modifications to the ist, do not oy meet the condition, ilustrate mo nob game © it the “obs put ofeach Sage but doesnot - ws a7 erot Handling in PIG si . gaplain error handling in gst hes (oe ple rom fae ors ke 908 at onl at affects the entire processing but can succeed on retry. An example of such a failure is the inability to open a lookup file because the file could not be found. This could 2 temporary environmental issue that can $0 away on retry. A UDF can signal this to Pig q throwing an IOException as with the case the ABS function below. a sed that affects the entire processing and nah a Sowa on rey, An example of i is the inability to open a lookup use of file permission problems. Pig BIG DATA ANALYTICS currently does not have a way to handle this case, Hadoop does not have a way to handle this case either. Pig provides a helper class. WrappedIOException. Proposal The proposal is to add a ONERROR keyword (“on error”, following the existing naming conventions : = < PIG statement...> ONERROR [SPLIT INTO ... Usage lems Rahul Publications

You might also like