0% found this document useful (0 votes)
64 views30 pages

Bda - Unit 5

BDA Unit 5 Notes

Uploaded by

VISHWA PRIYA I
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
64 views30 pages

Bda - Unit 5

BDA Unit 5 Notes

Uploaded by

VISHWA PRIYA I
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 30
- Hadoop Related Tools Syllabus Hibase - data model and implementations - Hbase cliens - Hbase examples - praxis, Pig - Grunt - pig data model - Pig Latin - developing a ind testing Pig Latin scripts. Hive = data types and file formats - HiveQL data definition - HiveQL data manipulation - HiveQL queries. Contents 5.1 Hbase 5.2 Data Model and implementations 5.3 Hbase Clients 5.4 Praxis 55 Pig 5.6 Hive 5.7 HiveQL Data Definition 5.8 HiveQL Data Manipulation 5.9 HiveQL Queries 5.10 Two Marks Questions with Answers (6-1) Hadoop Related Tool, Hbase gistributed database modeled after HBase i e, non-relational, Base is an open source, non-relatio! an dat bull on Hades Google's BigTable. HBase is an open source and sorte It is column oriented and horizontally scalable. It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop file system. It runs on top of Hadoop and HDFs, providing Big Table-like capabilities for Hadoop HBase supports massively parallelized processing vi# MapReduce for using HBase as both source and sink. HBase supports an easy-to-use Java API for programmatic access. It also supports Thrift and REST for non-Java front-ends. HBese is a column oriented distributed database in Hadoop environment. It can store massive amounts of data from terabytes to petabytes. HBase is scalable, distributed big data storage on top of the Hadoop eco system. : The HBase physical architecture consists of servers in a Master-Slave relationship, Typically, the HBase cluster has one Master node, called HMaster and multiple Region Servers called HRegionServer. Fig, 5.1.1 shows Hbase architecture. Client Zookeeper Region Server _ [Region Server Region Server eK eeRS Region Region Region ] Region Region Region | | q J I HDFS Fig. 6.1.1 Hbase architecture Zookeeper is a centralized monitoring server which maintains configuration information and provides distributed synchronization. If the client wants to communicate with regions servers, client has to approach Zookeeper. HMaster in the master server of Hbase and it coordinates the HBase cluster. HMaster is responsible for the administrative operations of the cluster. TECHNICAL PUBLICATIONS - an up-thrust for knowledga Big Da P 7 a Analytics 5-3 Hadoop Related Tools HRegions servers ; It will perform the following functions in communication with HMaster and Zookeeper. 1. Hosting and managing regions, 2. Splitting regions automatically. 3. Handling read and writes requests. 4, Communicating with clients directly HRegions : For each column family, HRegions maintain a store. */ain components of HRegions are Memstore and Hfile. Data model in HBase is designed to accommodate semi-structured data that could vary in field size, data type and columns. HBase is a column-oriented, non-relational database. This means that data is stored in individual columns and indexed by a unique row key. This architecture allows for rapid retrieval of individual rows and columns and efficient scans over individual columns within a table. Both data and requests are distributed across all servers in an HBase cluster, allowing user to query results on petabytes of data within milliseconds. HBase is most effectively used to store non-relational data, accessed via the HBase API. Features and Application of Hbase Features of Hbase : 1. 2. 3. 4. 5. 6. Hbase is linearly scalable. .. It has automatic failure support. . It provides consistent read and writes. It integrates with Hadoop, both as a source and a destination. . It has easy java API for client. .. It provides data replication across clusters. Where to use Hbase 7 . Apache Hbase is used to have random, real-time read/write access to Big Data. . It hosts very large tables on top of clusters of commodity hardware. . Apache Hbase is a non-relational database modeled after Google's Bigtable. Bigtable acts up on Google File System, likewise Apache HBase works on top of Hadoop and HDFS. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Big Data Analytics Applications of Hbase : 1, It is used whenever there is a need to write heavy ap] 2. Hbase is used whenever we need to provide fast random Hadoop Related Too, plications. access to available data, 3. Companies such as Facebook, Twitter, Yahoo and Adobe use HBase internally, Difference between HDFS and Hbase 6 (EXE Difference between Hbase and Relational Database HDFS HDIS is a distributed file system suitable for storing large files. HIDES does not support fast individual record lookups. It provides high latency batch processing; no concept of batch processing. It provides only sequential access of data, HDFS are operations. suited for high latency In HDFS, data are primarily accessed through Map Reduce jobs. HIDES doesn't have the concept of random read and write operations, Hbase HBase is Schema-less It is Column - oriented datastore. It is designed to store denormalized data. It contains wide and sparsely populated tables HBase is a database built on top of the HDFS. HBase provides fast lookups for larger tables. It provides low latency access to’ single rows from billions of records (Random access). 3 HBase intemally uses Hash tables and provides random access and it stores the data in indexed HDFS files for faster lookups. HBase is suited for low, latency, | operations. HBase provides access to single rows from billions of records. R HBase data is accessed through shell commands, client API in Java, REST, Avro or Thrift, Relational Database Relational Database is based on a Fixed Schema. It is Row - oriented datastore. It is designed to store normalized data. It contains thin tables, TECHNICAL PUBLICATIONS® - an up-thnust for knowledge ELT ee aig Data Analytics 5-5 Se Hadoop Related Tools | 5 Hbase supports automatic partitioning Relational database has 7g builtin | | z ~ _ Support for partitioning, | | It is good for semi-structured | 6 Te good for semi-structured as well ts good for sutured data | | 1 _ RDBMS is transactional. [EREE Limitations of HBase « It takes a very long time to recover if the HMaster 1 Goes down. It takes a long time to activate another node if the first nodes go down. «In HBase, cross data operations and join operations are very difficult to perform. « HBase needs a new format when we want to migrate from RDBMS external sources to HBase servers. + It is-very challenging in HBase to support querying process. * It takes enormous time to develop security factor to grant access to the users. * HBase allows only one default sort for a table and it does not support large size of binary files. HBase is expensive in terms of hardware requirement and memory blocks’ allocations. 15.2.1 Data Model and Implementations The Apache HBase Data Model is designed to accommodate structured or semi-structured data that could vary in field size, data type and columns. HBase stores data in tables, which have rows and columns. The table schema is very different from traditional relational database tables. A database consists of multiple tables. Each table consists of multiple rows, sorted by row key. Each row contains a row key and one or more column families. Each column family is defined when the table is created. Column families can contain multiple columns. (family : column), A cell is uniquely identified by (table,row,family : column). A cell contains an uninterpreted array of bytes and a timestamp. HBase data model has some logical components which are as follows : 1. Tables 2, Rows 3. Column Families /Columns 4. Versions/Timestamp 5. Cells Tables : The HBase Tables are more like logical collection of rows stored in Separate partitions called Regions. As shown above, every Region is then served by exactly one Region Server. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Big Data Analytics 5-6 Hadoop Related Too), * The syntax to create a table in HBase shell is shown below. create '','' of y + Example: create ‘CustomerContactinformation” CustomerName’ , ’ Contactinfo’ * Tables are automatically partitioned horizontally by HBase into regions. Each region comprises a subset of a table's rows. A region is denoted by the table jg belongs to. Fig. 5.2.1 shows region with table. Region server 7 ‘TablelA, Table A, Region 1 = Le Table A, Region 2 Region 1 | }® Table G, Region 1070 : Table L, Region 25 tJ Region server 86 Region 2| Pe | Table A, Region 3 h Table C, Region 30 i 7 Table F, Region 160 Region's | 1 ma: Table F, Region 776 | --! -_ Region server 367 = Table A, Region 4 Region 4 [ - i} Table , Region 17 5 Table E, Region 52 Table P, Region 1116. Fig, 5.2.1 Region with table * There is one region server per node. There are many regions in a region server. At any time, a given region is pinned to a particular region server. Tables are split into regions and are scattered across region servers. A table must have at least one region. * Rows : A row is one instance of data in a table and is identified by a rowkey. Rowkeys are unique in a Table and are always treated as a bytef ]. © Column families : Data in a row are grouped together as Column Families. Each Column Family has one more Columns and these Columns in a family are stored together in a low level storage file known as HFile. Column Families form the basic unit of physical storage to which certain HBase features like compression are applied. Columns : A Column Family is made of one or more columns. A Column is identified by a Column Qualifier that consists of the Column Family name concatenated with the Column name using a colon - example : columnfamily : columnname. There can be multiple Columns within a Column Family and Rows within a table can have varied number of Columns. TECHNICAL PUBLICATIONS® - an upthrist for knowiedge pig Date Analytics 5-7 Hadoop Related Tools + Cell : A Cell stores data and is essentially Column Family and the Column (Column Qui called its value and the data type is always tre @ unique combination of rowkey, ialifier). The data stored in a Cell is ‘ated as byte[ J. + Version : The data stored in a cell is versioned and vers by the timestamp. The number of versions of data ret configurable and this value by default is 3. ions of data are identified ined in a column family is + Timerto-Live : TTL is a built-in feature of HBase that ages out data based on its timestamp. This idea comes in handy in use cases where data needs to be held | only for a certain duration of time. So, if on a ma older than the specified TTL in the past, the record in question doesn't get put in | the HFile being generated by the major compaction; that is, the older records are removed as a part of the normal upkeep of the table. jor compaction the timestamp is + If TIL is not used and an aging requirement is still needed, then a much more I/O intensive operation would need to be done | EBB) Hbase Clients | * There are a number of client options for interacting with an HBase cluster. There , are a number of client options for interacting with an HBase cluster. 41, Java ¢ Hbase is written in Java. * Example : Creating table and inserting data in Hbase table are shown in the following program. public class ExampleClient Dublic static void main (String]} args) throws IOException _ { Configuration config = HBaseConfiguration.create(); H Create table HBaseAdmin admin = new HBaseA.dmin(confi HTableDescriptor htd = new HTableDescriptor("test"); HColumnDescriptor hed = new HColumnDescription(“data”); btd.addFamily(hcd); admin.createTable(htd); byte [] tablename = htd.getName(); // Run some operations -- a put Htable table = new HTable(config, tablename); byte |] row! = Bytes,toBytes("row1 Put pi = new put(row!); TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Hade Big Deta Analytics 5-8 adoop Related Too byte {] databyte: = Bytes.toBytes(“data”): Rs pi.add(databytes, Bytes.toBytes("FN"), pytes.toBytes( "value! table.put(p1); } 33 * © To create a table, we need to first create an instance of HBaseAdmin and then ask family named data. it to create the table named test with a single column 2. MapReduce + HBace classes and utilities in the org.apachehadoop hbasemapreduce package facilitate using HBase as a source and/or sink in MapReduce jobs. The TableinputFormat class makes splits on region boundaries so maps are handed a single region to work on. TheTableOutputFormat will write the result of MapReduce into HBase. * Example : A MapReduce application to count the m table public class RowCounter { static final String NAME = ‘rowcounter’; static class RoweCounterMapper extends TableMapper { /** Counter enumeration to count the actual rows. */ public static enum Counters {ROWS} @Override . public void map(ImmutableBytesWritable row, Result values, Context context) throvs IOException { for (KeyVelue value: values.list()) { if (value.gezVelue()-length > 9) { context. getCounter{Counters.ROWS).increment(1); break; Fd} public static Job createSubmittableJob(Configuration conf, String|} args) throws IOException { . String tableName = argsi0I; Job job = new Job(conf, NAME + ‘* + tableName); job.setJarByClass(RowCounter.class); // Columns are space delimited StringBuilder sb StringBuilder(); final int columnofiset = 1, for (int i = colummnofiset; i < argslength; i++) { if (i > columnofiset) { umber of rows in an HBase sb.append(""); } sb.append(argsli]); TECHNICAL PUBLICATIONS” - an up-thrust for knowledge sig 0312 Analytics 5-9 Hadoop Related Tools an = new Scan(); st sorfilter(new FirstKeyOnlyFilter()); () > 0 { if (solar jumnNanie :sbtoString().split("")) { for (6 elds = columnName.split( iqfelds.tencth scan. yelse { scan.addColumn(Bytes toBytes(fields[0}), Bytes.toBytes(tields(1])); } Il second argument is the table name. job setOutputFormatClass(NullOutputFormat.class); TableMapReduceUtil.initTableMapperJob(tableName, scan, powCounterMapper.class, ImmutableBytesWritable.class, Result.class, job); job.setNumReduceTasks(0); It.clas: return jobi » le static void main(String|] args) throws Exception { Configuration conf = HBaseConfiguration.create(); String|] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length < 1) { system.err.println("ERROR: Wrong number of parameters: * + args.length); System.err printin("Usage: RowCounter { ...|); System.exit(-1); t Job job = createSubmittableJob(conf, otherArgs); System exit(job.waitForCompletion(true) ? 0 : 1); y} 3. Avro, REST, and Thrift + HBase ships with Avro, REST and Thrift interfaces. These are useful when the interacting application is written in a language other than Java. In all cases, a Java server hosts an instance of the HBase client brokering application Avro, REST, and Thrift requests in and out of the HBase cluster. This extra work proxying requests and responses means these interfaces are slower than using the Java client directly. REST : To put up a stargate instance, start it using the following command: % bbase-dasmon.sh start rest This will start a server instance, by default on port 8080, background it and catch any emissions by the server in logfiles under the HBase logs directory. Clients can ask for the response to be formatted as JSON, Google's protobufs, or as XML, depending on how the client HTTP Accept header is set. To stop the REST server, type : % bbase-daemon.sh stop rest TECHNICAL PUBLICATIONS® - an up-hrust for knawtedge Hadooe Releted Tose, cere: to field Thrift dents 5, on por 080, background it ang logfiles under the HBas> dogs directory. The used generating classes, * Avro: The Avro server is started and stopped in the seme manner as we staxt and stop the Thrift or REST services. The Avro server by default uses port 9090, EZ] Praxis * When Hbase duster running under load, following issues are considered = L Versions : A particular Hbase version would run on any Hadoop that had a matching minor v HBzse 0203 would run on an Hadoop 0202, but HiBase 0.19.5 would not ran on Hadoop 0.20.0 HDES : In MapReduce, HDFS files are opened, with their content streamed through 2 mep tesk and then closed. In HBase, date files are opened on duster startup and kept open. Because of this, HBese tends to see issues not normally encountered by MapReduce clients. Running out of file descriptors : Because of open files on a loaded cluster, it doesn't take long before we run into system - and Hadoop - imposed limits. Each open file consumes at least one descriptor over on the remote datanode. The default limit on the number of file descriptors per process is 1024. HBase process y regionservers log. ing out of datanode threads : The Hadoop datanode has an upper bound of 256 on the number of threads it can run at any one time. «+ Sync: We must run HBase on an HDFS that has a working sync. Otherwise, there is Joss of data. This means running HBase on Hadoop 0.21.x, which adds 4 working sync/append to Hadoop 0.20. Ul: HBase runs a web server on the master to present 2 view on the state of running cluster, By default, it listens on port 60010. The master UI displays a list of basic attributes such as software versions, cluster load, request rates, lists of cluster tables and participating regionservers. TECHNICAL PUBLICATIONS? - on upthrust for knowledgo peta AnahtS 8-11 mn Hadoop Related Tools gchema Design : HBase tables are Me these in en RDEMS versioned, TOwS are sorted and columns can be added ow a except thet cells are led on the fiy by the diet = jong 25 the column family they belong to precusts IPT Geet es » Joins: There is no native database join fedlity in HBase, but so that there iS no need for datzbase joins pull tables. A wide row cen sometimes be made to hold t wide tables con make all data that pertains to a particular primary key. Bris 7 Pig is an open-source high level date fow system A highlevel plifom for creating MapReduce programs used in Hadoop. it translates into efSdent sequences of one or more MapReduce jobs. « Pig offers a high-level language to write data anzlysis programs which we call 2s Pig Latin. The salient property of pig programs is thet their structure is amenable to substantial parallelization, which in tums enables them to handle very large data sets. « Pig makes use of both, the Hadoop Distributed File System as well as the MapReduce. Features of Pig Hadoop : 1. Inbuilt operators : Apache Pig provides a very good set of operators for performing several data operations like sort, join, filter, etc. 2. Ease of programming. 3. Automatic optimization : The tasks in Apache Pig are automatically optimized. 4. Handles all kinds of data : Apache Pig can analyze both structured and unstructured data and store the results in HDFS. Fig, 5.5.1 shows Pig architecture. (Refer Fig. 53.1 on next page) Pig has two execution modes-: Local mode : To run pig in local mode, we need access to a single machine; all files are installed and run using local host and file system. Specify local mode using the -x flag (pig-x local). Mapreduce mode : To run pig in mapreduce mode, we need access to 2 Hadoop cluster and HDFS installation. Mapreduce mode is the default mode; but don’t need to, specify it using the -x flag TECHNICAL PUBLICATIONS” - an up-trust for knowtedge Big Date Analytics 5-12 Hedoop Related Too, Pig Latin scripts t Execution engine MapReduce | Apache Pig Hadoop HDFS ee, Fig, 5.5.1 Pig architecture « Pig Hadoop framework has four main components : L Parser : When a Pig Latin script is sent to Hadoop Pig, it is first handled by the parser, The parser is responsible for checking the syntax of the script, along with other miscellaneous checks. Parser gives an output in the form of a Directed Acyclic Graph (DAG) that coniains Pig Latin statements, together with other logical operators represented as nodes. Optimizer : After the output from the parser is retrieved, a logical plan for DAG is passed to a logical optimizer. The optimizer is responsible for carrying out the logical optimizations. Compiler : The role of the compiler comes in when the output from the optimizer is received. The compiler compiles the logical plan sent by the optimize the logical plan is then converted into a series of MapReduce tasks OF jobs. TECHNICAL PUBLICATIONS® - en up-thrust for knowledge ag 08! Analytics oe Hadoop Related Tools i 4, Execution Engine : After the logical plan is converted to MapReduce jobs, these jobs are sent to Hadoop in a properly sorted order and these jobs are executed rn Hadoop for yielding the desired result. Pig can run on two types of environments : The local environment in a single JVM or the distributed environment on a Hadoop cluster. Pig has variety of scalar data types and standard data Processing options. Pig cupports Map data; a map being a set of key - value pairs, Most pig operators take a relation as an input and give a relation as the output. It allows normal arithmetic operations and relational operations too. pig's language layer currently consists of a textual language called Pig Latin. Pig Latin is a data flow language. This means it allows users to describe how data from one or more inputs should be read, processed and then stored to one or more outputs in parallel. + These data flows can be simple linear flows, or complex workflows that include points where’ multiple inputs are joined and where data is split into multiple streams to be processed by different operators. To be mathematically precise, a Pig Latin script describes a directed acyclic graph (DAG), where the edges are data flows and the nodes are operators that process the data. + The first step in a Pig program is to LOAD the data, which we want to manipulate from HDFS. Then run the data through a set of transformations. Finally, DUMP the data to the screen or STORE the results in a file somewhere. Advantages of Pig : 1. Fast execution that works with MapReduce, Spark and Tez. 2. Its ability to process almost any amount of data, regardless of size. 3. A strong documentation process that helps new users learn Pig Latin. 4. Local and remote. interoperability that lets professionals work from anywhere with a reliable connection. Pig disadvantages : 1. Slow start-up and clean-up of MapReduce jobs 2. Not suitable for interactive OLAP analytics 3. Complex applications may require many user defined function. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Big Data Analytics 5-14 Hadoop Related Too), [ESE Pig Data Model © With Pig, when the data is loaded th load from the disk into Pig will have a spe model is rich enough to manage most of what structures and nested hierarchical data structures. + However, Pig data types can be divided into two groups in general terms : Scalar forms and complex types. e data model is specified. Any data that we cific schema and structure. Pig data ('s thrown in its way like table - like © Scalar types contain a single value, while complex types include other values, such as the values of Tuple, Container and Map. In its data model, Pig Latin has those four types : s Atom : An atom is any single attribute, like, for example, a string or a number ‘Hadoop'. The atomic values of Pig are scalar forms that appear, for float, double, char array example, in most programming languages, int, long, and byte array. Tuple : A tuple is a record generated by a series of fields. For example, each field can be of any form, 'Hadoop;' ,' or 6. Just think of a tuple in a table as a row. Bag : A pocket is a set of tuples, which are not special. The bag’s schema is flexible, each tuple in the set can contain an arbitrary number of fields and can be of any sort. » Map : A map is a set of pairs with main values. The value can store any type and the key needs to be unique. A char array must be the key of a map and the value may be of any kind. E27 Pig Latin © The Pig Latin is a data flow language used by Apache Pig to analyze the data in Hadoop. It is a textual language that abstracts the programming from the Java MapReduce idiom into a notation. The Pig Latin statements are used to process the data. It is an operator that accepts a relation as an input and generates another relation as an output, a) It can span multiple lines. b) Each statement must end with a semi-colon. ¢) It may include expression and schemas. d) By default, these statements are processed using multi - query execution Pig Latin statements work with relations. A relation can be defined as follows : a) A relation is a bag (more specifically, an outer bag). b) A bag is a collection of tuples: © TECHNICAL PUBLICATIONS® - an up-thrust for knowiedge goannas 5-15 Hadoop Related Tools «) A tuple is an ordered set of fields. 4) A field is a piece of data, Pig Latin Datatypes / 7 1, Int : "nt" represents a signed 32-bit integer. For Example : 13 9. Long : It represents a signed 64-bit integer. For Example ; 13L 3, Float : This data type represents a signed 32-bit floating point. For Example : 130.5F . 4, Double : “double” represents a 64-bit floating point, For Example : 13.5 5, Chararray : It represents,a character array (string) in Unicode UTF-8 format. For Example : 'Big Data’ 6. Bytearray : This data type represents a Byte array. 7. Boolean : "Boolean" represents a Boolean value. For Example : true/false. fel Developing and Testing Pig Latin Scripts Pig provides several tools and diagnostic operators to help us to develop applications. Scripts in Pig can be executed in interactive or batch mode. To use pig in interactive mode, we invoke it in local or map-reduce mode then enter commands one after the other. In batch mode, we save commands in a pig file’and specify the path to the file when invoking pig. At an overly simplified level a Pig script consists of three steps. In the first step we load data from HDFS. In the second step we perform transformations on the data. In the final step we store transformed data. Transformations are the heart of Pig scripts. . . Pig has a schema concept that is used when.loading data to specify what it should expect. First specify columns and optionally their data types. Any columns in data but not included in the schema are truncated. When we have fewer columns than those specified in schema they are filled with nulls. To load sample data sets we first move them to HDFS then from there we Will load into Pig. ig programs can be packaged in three different ways. :.This function is nothing more than a file consists of Pig Latin commands, identified by the .pig suffix. Ending Pig program with the pig ©xtension is a convention but not required. The commands are interpreted by the Pig Latin compiler and then runs in the order determined by the Pig optimizer, TECHNICAL PUBLICATIONS® - en up-thrust for knowledge . He Big Deta Analytics 8516 120000 Retatey Te 2 2. Grunt : Grunt acts as a command interpreter where we can interactive Pig Latin at the Grunt command line and immediately see the Tesponse enter method is useful for prototyping during early development stage and what-if scenarios. with 3. Embedded : Pig Latin statements can run within Java, JavaScript and Pyty programs. ‘on Pig scripts, Grunt shell Pig commands and embedded Pig programs may executed in either Local mode or on MapReduce mode. The Grunt shel] enables ay, interactive shell to submit Pig commands and run Pig scripts. To start the shell in Interactive mode, we need to submit the command pig at the shel}, To tell the complier whether a script or Grunt shell is executed locally op in Hadoop mode just specify it in the -x flag to the pig command. The following i, an example of how we would specify running our Pig script in local mode : pig -x local mindStick.pig. Here's how we would run the Pig script in Hadoop mode, which is the default we don't specify the flag : pig -x mapreduce mindstick.pig By default, when we specify the pig command without any parameters, it start the Grunt shell in Hadoop mode. If we want to start the Grunt shell in local mode just add the -x local flag to the command. Ba Hive Apache Hive is an open source data warehouse software for reading, writing and managing large data set files that are stored directly in either the Apache. Hadoop Distributed File System (HDFS) or other data storage systems such as Apache HBase, Data analysts often use Hive to analyze data, query large amounts of unstructured data and generate data summaries. Features of Hive : 1. It stores schema in a database and processes data into HDFS. 2. It is designed for OLAP. 3. It provides SQL type language for querying called HiveQL or HQL. 4, It is familiar, fast, scalable and extensible Hive supports variety of storage formats : TEXTFILE for plaintext, SEQUENCEFILE for binary key-value pairs, RCFILE stores columns of a table in a record columnat format TECHNICAL PUBLICA’ Tions® * 8N up-thrust for knowledge Hadoop Related Tools Hive table structure consists of rows and columns The : rows typi to some record, transaction, or particular entity deta, YPM comespona f the corresponding col « The values of Pt '§ columns Tepresent the vari characteristics for each row. arious attributes or « Hadoop and its ecosystem are used to appl Therefore, if a table structure is an approp Hive may be a good tool to use. Y some structure to unstructured data miate way to view the restnictured data « Following are some Hive use cases : 1, Exploratory or ad-hoc analysis of HDFS data + Data can be queried, transformed and exported to analytical tools, 2, Extracts or data feeds to reporting systems, dashboards, or data repositories such as HBase. 3. Combining external structured data to data already residing in HDFS, Advantages : 1. Simple querying for anyone already familiar with SQL. 2 Its ability to connect with a variety of relational databases, including Postgres and MySQL. 3. Simplifies working with large amounts of data. Disadvantages : 1. Updating data is complicated 2. No-teal time access to data 3. High latency, Program Example : Write a code in JAVA for 4 simple Word Count application that counts the number of occurrences of each word in a given input set using the Hadoop Map-Reduce framework on local-standalone set-up. Amport java io.]OException; import java.util. StringTokenizer; import °rg.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; ‘Port org.apache.hadoop io.IntWritable; import org.apache.hadoop io. Text; import org, ‘apache.hadoop.mapreduce.Job; 4 import °rg. apache. hadoop.mapreduce.Mapper; import org.apache hadoop mapreduce Reducer; Spo" cig.epache hhadoop.mapreducelipinputFlelnputFommat TECHNICAL PUBLICATIONS? - an up-thrust for knowledge Hadoop Related Toolg Big Data Analytics 18 i r ts import org.apache.hadoop.mapreduce.lib.output FileOutputF oma : public class WordCount { Public static class TokenizerMapper extends Mapper { private final static IntWritable one = new IntWritable(1); Private Text word = new Text(); Public voici map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } + } public static class IntSumReducer extends Reducer { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); ‘ context.write(key, result); i } public static void main(String|] args) throws Exception { Configuration conf = new Configuration(); ; Job job = Job.getinstance(conf, "word count’); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat,addinputPath(job, new Path(args(0))); FileQutputFormat.setOutputPath(job, new Path(args(1))); System.exit(job.waitForCompletion(true) ? 0: 1); TECHNICAL PUBLICATIONS® - an up-thrust for knowiedge ig 042 Analytics 5-79 Hadoop Related Tools | Hive Architecture rig. 56:1 shows Hive architecture * User | interfaces eae am Meta store — is xeoiltion engine HDFS or HBASE data storage Fig. 5.6.1 Hive architecture Hive OL process engi e User Interface : Hive is a data warehouse infrastructure software that can create interaction between user and HDFS. * The user interfaces that Hive supports are Hive Web UI, Hive command line and Hive HD Insight. * Meta Store : Hive chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table, their data types and HDFS mapping. HiveQL Process Engine : HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce program in Java, we can write a query for MapReduce job and process it. Execution engine : The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. Execution engine processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce. HDFS or HBASE : Hadoop distributed file system or HBASE are the data storage techniques to store data into file system. b TECHNICAL PUBLICATIONS® - an up-thrust for knowledge a Hadoop Related Tools Working of Hive : © Fig. 5.62 shows Hive working. Hive Job tracker | Mapreduce| T Task tracker HDFS sk | 3 Compiler 2s} tete store Data node| Fig. 5.6.2 Hive working 1. Execute query : The Hive interface such as command line or Web UI sends query to driver to execute. 2. Get plan : The driver takes the help of query compiler that parses the query to check the syntax and query plan or the requirement of query. 3. Get metadata : The compiler sends metadata request to metastore. Send metadata : Metastore sends metadata as a response to the compiler. 5. Send plan : The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and compiling of a query is complete. 6. Execute plan : The driver sends the execute plan to the execution engine. Execute job : Internally, the process of execution job is a MapReduce job. The execution engine sends the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which is in Data node, Here, the query executes MapReduce job. 7.1Metadata Ops : Meanwhile in execution, the execution engine can execute metadata operations with Metastore. 8. Fetch result : The execution engine receives the results from data nodes. 9. Send results ; The execution engine sends those resultant values to the driver. 10. Send results : The driver sends the results to Hive Interfaces TECHNICAL PUBLICATIONS® - an up-thrust for knowledge : 5-21 naiytos gig D348 Hadoop Related Toots Data Types and File Formats | ata types * isa 5 | I “e live data types can be classified into two categories : Primary data types and Complex data types. primary data types are of four types : miscellaneous types Numeric data types : Integral types are TINYINT, SMALLINT, INT and BIGRT. Floating types are FLOAT, DOUBLE and DECIMAL. String data types are string, varchar and char. Date/Time data types : Hive provides DATE and TIMESTAMP data types in traditional UNIX time stamp format for date/time related fields in hive. DATE values are represented in the form YYYY-MM-DD. TIMESTAMP use the format yyyy-mam-dd hh:mmsss[f..]. Numeric, string, date/time and Miscellaneous types : Hive supports two more primitive data types : BOOLEAN and BINARY. Hive stores true or false values only. . Data types = e data types 1} Complex data types Tinyint, Samilint, Integer, Bigint UIST Numeri¢ | Float, Double, Decimal, Numeric Arey | Collection of similar data : “SET ¢ Date time | Timestamp, Date, Interval Map| (ea value combiner String | String, Varchar, Char ‘Struct Serer different data ee Similar to UNION in C Misc. | Boolean, Binary ; Union Fats oe dats Pri Fig. 5.6.3, * Complex type are Array, Map, Struct and Union. * Array in Hive is an ordered sequence of similar type elements that are indexable using the zero-based integers. Map in Hive is a collection of key-value pairs, where the fields are accessed using array notations of keys (eg., [’key']). TECHNICAL PUBLICATIONS® - an up-hrust for knewiedge ci Big Date Analytics 5-22 Hadoop Related Tools in C language. It is a record type » STRUCT in Hive is similar to the STRUCT Teco hich can be any primitive data that encapsulates a set of named fields, w! type. = UNION type in Hive is similar to the point of time can hold exactly one data UNION in C. UNION types at any type from its specified data types. 2. File formats : © In Hive it refers to how records are store structured data, each record has to be its own in a file defines a file format. These file f encoding, compression rate, usage of space and disk I/O. TEXTFILE, SEQUENCEFILE, RCFILE and ORCFILE. at used in Hadoop. In Hive if we f from CSV (Comma Separated d inside the file, As we are dealing with structure. How records are encoded ormats mainly vary between data © Hive support file format : *° TEXTFILE format is a famous input/output form: define a table as TEXTFILE it can load data o! Values), delimited by Tabs, Spaces and JSON data. © Sequence files are flat files consisting of binary key - value pairs. When Hive converts queries to MapReduce jobs, it decides on the appropriate key - value pairs to be used for a given record. Sequence files are in the binary format which can be split and the main use of these files is to club two or more smaller files and make them as a one sequence file. In Hive we can create a sequence file by specifying STORED AS SEQUENCEFILE in the end of a CREATE TABLE statement. RCFILE stands of Record Columnar File which is another type of binary file format which offers high compression rate on the top of the rows. RCFILE is used when we want to perform operations on multiple rows at a time. RCFILEs are flat files consisting of binary key/value pairs. © Facebook uses RCFILE as its default file format for storing of data in their data warehouse as they perform different types of analytics using Hive. ORCFILE : ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75 %. An ORC file contains rows data in groups called as Stripes along with a file footer. ORC format improves the performance when Hive processing the data. HiveQL Data Defin HiveQL is the Hive query language. Hive offers no support for row level inserts, updates and deletes. Hive doesn't support transactions. DDL statements are used to define or change Hive databases and database objects TECHNICAL PUBLICATIONS® - an up-thrust for knowledge a2 Hadoop Related Tools pig ontt anon os of Hive DDL commands are : CREATE, SHOW, DESCRIBE, USE, DROP, . TER ‘and TRUNCATE. Hive ppL commands . ooo 7 5 d Use with | DDL comman | Database, table “CREATE Databases, tables, Table properties, Partitions, Functions, Index Database, Table, view Database od soled Database, Table ALTER hee | SRUNCATE _ | | i | | | Hive database : In Hive, the database is considered as a catalog or namespace of tables, It is also common to use databases to organize production tables into groups. If we do not specify a database, the default database is used. logical Let's create a new database by using the following command : hivoS GREATE DATABASE Rollcall; Make sure the database we are creating doesn't exist on: Hive warehouse, if exists it throws Database Rollcall already exists error. At any time, we can see the databases that already exist as follows = hive> SHOW DATABASES; default Rollcall hive> CREATE DATABASE student; hive> SHOW DATABASES; default Rollcall student + Hive will create a directory for each database. Tables in that database will be sored in subdirectories of the database directory. The exception is tables in the lefault database, which doesn't have its own directory. * Drop Database Statement : Syntax: sas ; deere DATABASE StatementDROP (DATABASE |SCHEMA) [IF EXISTS) ame [RESTRICT | CASCADE]; TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Sig Data Analytics 5-246 Hadoop Related Tool Example : hive> DROP DATABASE IF EXISTS userid; * ALTER DATABASE : The ALTER DATABASE statement in Hive is used to change the metadata associated with the database in Hive. Syntax for changing Database Properties : ALTER (DATABASE |SCHEMA) database_name SET DBPROPERTIES (property_name=property value, ...); [EQ] Hivea pata Manipulation * Data manipulation language is a subset of SQL statements that modify the data stored in tables. Hive has no row - level insert, update and delete operations, the only way to put data into an table is to use one of the "bulk" load operations. Inserting data into tables from queries : * The INSERT statement perform loading data into a table from a query. INSERT OVERWRITE TABLE students PARTITION (branch = 'CSE', classe = 'OR') SELECT * FROM college_students se WHERE se.bra = 'CSE' AND se.cla = OR! + With OVERWRITE, any previous contents of the partition are replaced. If we drop the keyword OVERWRITE or replace it with INTO, Hive appends the data rather than replaces it. This feature is only available in Hive v0.8.0 or later. * We can mix INSERT OVERWRITE clauses and INSERT INTO clauses, as well. Dynamic partition inserts : * Hive also supports a dynamic partition feature, where it can infer the partitions to create based on query parameters. Hive determines the values of the partition keys, from the last two columns in the SELECT clause. * The static partition keys must come before the dynamic partition keys. Dynamic partitioning is not enabled by default. When it is enabled, it works in “strict” mode by default, where it expects at least some columns to be static. This helps protect against a badly designed query that generates a gigantic number of partitions. ¢ Hive Data Manipulation Language (DML) Commands 2) LOAD - The LOAD statement transfers data files into the locations that correspond to Hive tables. b) SELECT - The SELECT statement in Hive functions similarly to the SELECT statement in SQL. It is primarily for retrieving data from the database, ©) INSERT - The INSERT clause loads the data into a Hive table, Users can also perform an insert to both the Hive table and/or partition, TECHNICAL PUBLICATIONS® - an up-thrust for knowledge 5-25 sig 088 Analytics Hadcop Reloted Tools i _ The DELETE clause deletes all the data in th ; ELETE The ‘ ie table. Specifi a) oF targeted and deleted ifthe WHERE clause is specified. emcee 2 UPDATE - The UPDATE command in Hive updates the data in the table. I the gery includes the WHERE clause, then it updates the column of the rows that meet the condition in the WHERE clause. ) EXPORT - The Hive EXPORT command moves the table or pation data together with the metadata to a designated output location in the HDFS. ) IMPORT - The Hive IMPORT statement imports the data from a pattcuarized Jocation to a new or currently existing table. Ea HiveQL Queries « The Hive Query Language (HiveQL) is a query language for Hive to process and analyze structured data in a Metastore, Hive Query Language is used for processing and analyzing structured data. It separates users from the complexity of Map Reduce programming. SELECT ... FROM Clauses : «SELECT is the projection operator in SQL. The FROM clause identifies from which table, view or nested query we select records. For a given record, SELECT specifies the columns to keep, as well as the outputs of function calls on one or more columns. + Here's the syntax of Hive's SELECT statement. SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table, reference (WHERE where_condition] (GROUP BY col list] {HAVING having_condition} ICLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] (LIMIT number} * SELECT is the projection operator in HiveQL. The points are : a) SELECT scans the table specified by the FROM clause b) WHERE gives the condition of what to filter ©) GROUP BY gives a list of columns which specify how to aggregate the records 4) CLUSTER BY, DISTRIBUTE BY, SORT BY specify the sort order and algorithm ©) LIMIT specifies how many # of records to retrieve. TECHNICAL PUBLICATIONS® - an up-hrust for knosedge de late 2G Dats Stents pz28 Hadoop Related Tools Computing with Columns ae * When we select the columns, we can manipulate column values using either arithmetic operators or function calls. Math, date and string functions are also popular. * Here's an example query that uses both operators and functions. SELECT upper(nams), seles_cost FROM products; WHERE Clauses : A WHERE clause is used to filter the result set by using predicate operators and logical operators. Functions can also be used to compute the condition. * GROUP BY Cizuses : A GROUP BY clause is frequently used with aggregate functions, to group the result set by columns and apply aggregate functions over ech group. Functions can also be used to compute the grouping key. Two Marks Questions with Answers Qi What is HBase 7 Ans. : HBase is 2 distributed column - oriented database built on top of the Hadoop fe . It is an open-source project and is horizontally scalable. HBzse is a data nilaz to Google's big table designed to provide quick random access to structured data. . ge amounts Q2 What is Hive 7 dive is 2 datz warehouse system for Hadoop that facilitates easy data i-hoc queries and the analysis of large datasets stored in Hadoop © provides 2 mechanism to project structure onto this data the date using 2 SQL-like language called HiveQL Q3 What is Hive data definition ? igns relational structure to the files stored on the HDFS ne _ Structured data to extract specific information. For MESSAGE, LINENUBER. etc. Q4 Enxpizin services provided by Zoékeeper in Hbase. Various services that Zookeeper provides include = lishing client communication with region servers. S eg cking server failure and network partitions. tain configuration information wides ephemeral nodes, which represent different region servers. TECHNICAL PUBLICATIONS® - an upthnist or knowledge 5-27 sytios Hadocp Patated Tork jg 00ts AE Felated Todls rr tis Zookeeper ? at as cervice keeps track of all the region servers that are there in en | ster - tacking information about how many region serves are there axa | servers are holding which DataNode. | e responsibilities of HMaster 7 eon hse region What are thi . Responsibilities of HMaster > | a) Manages and monitors the Hadoop cluster | ‘p) Performs administration ) Controlling the failover d) DDL operations are handled by the HMaster } ) Whenever 2 client wants to change the schema and change any of Ge metadata operations, HMaster is responsible for all these operations. Where to Use HBase ? Ans. = Hadoop HBase is used to have random real - time access to the big data. It cn | host large tables on top of cluster commodity. HBsse is 2 non - relational database | which modelled after Google's big table. It works similar to a big table to store te | files of Hadoop. a8 Explain unique features of Hbase 7 Q7 Ans.: © HBase is built for low latency operations « HBase is used extensively for random read and write operations | * HBase stores a large amount of data in terms of tables «Automatic and configurable sharding of tables | | « HBase stores data in the form of key/value pairs in a columnar model i | Q9 Explain data model in Hbase 7 } | Ans. : The data model in HBase is designed to accommodzte semi - structured diss | | that could vary in field size, data type and columns. Additionally, the layout of te | | | | data model makes it easier to partition the data and distribute it across the cluster. | Q.40 What is the difference between Pig Latin and Pig engine ? | Ans. : Pig Latin is a scripting language similar to Perl used to search large data ses. Tt | is composed of a sequence of transformations and operations that are applied to the | | ‘Rput data to create data. ‘ | Th es | Gis Pig engine is the environment in which Pig Latin prosers are executed. It ( Sanslates Pig Latin operators into MapReduce jobs. TECHNICAL PUBLICATIONS® - an up-tmust for Iowedge Bg Dats Ansitics 5-28 Hadoop Related Tools | Q.41 What Is pig storage 7 Ans. : Pig has a builtin load function called pig wish to import data from a file system into the Pig, torage. In addition, whenever we can use Pig storage. | Q.12_ What are the features of Hive ? Ans. : # It stores schema in a database dnd processed data into HDFS. * It is designed for OLAP. | | | | * It provides SQL type language for querying called HiveQL or HQL. L * It is familiar, fast, scalable and extensible. aaa TECHNICAL PUBLICATIONS® - an up-thrust for knowledge ai a2 as a4 Qs a6 Q7z as Qo Q.10 Q.44 a) b) Q.12 a) time : Thr SOLVED MODEL QUESTION PAPER [As Per New Syllabus} Big Data Analytics Semester - V (AI&DS) Vertical - 1 (Verticals for AIDS 1) (AI&DS) Vertical - 1 (Data Science) (CSEAT/CS&BS) Vertical - VI (Diversified Courses) (EEE) rs] [Maximum Marks : 100 eels ‘Answer ALL Questions PART A - (10 x 2 = 20 Marks) What is Hadoop ? (Refer Two Marks Q.15 of Chapter - 1) What is data science ? (Refer Two Marks Q.1 of Chapter - 1) Explain Cassandra data center, (Refer Two Marks Q.10 of Chapter - 2) What is the difference bettveen Sharding and replication ? (Refer Two Marks Q.4 of Chapter - 2) Why is a block in HDFS so large ? (Refer Two Marks Q.5 of Chapter - 3) What is MapFile ? (Refer Two Marks Q.11 of Chapter - 3) Define MapReduce. (Refer Two Marks Q.1 of Chapter - 4) Explain First.In First Out (FIFO) scheduling. (Refer Two Marks Q.7 of Chapter - 4) What is pig storage ? (Refer Two Marks Q.11 of Chapter - 5) What is Zookeeper ? (Refer Two Marks Q.5 of Chapter - 5) PART B - (5 x 13 = 65 Marks) i) What is unstructured data? Compare structured and unstructured data. (Refer section 1.3) (6) ii) Explain application of big data. (Refer section 1.6) a OR i) What is web analytics? Why web analytics is important ? (Refer section 1.5) [6] ii) Draw and explan Hadoop ecosystem. (Refer section 1.8.1) i] i) Briefly discuss schemaless database. (Refer section 2.3) a] il) What is CAP theorem? Explain, (Refer section 2.1.3) ta (M1) Se Data Anadis Mo? b) Q.t3 a) b) Q44 a) b) Q.15 2) b) Q.16 a) b) ‘Solved Moxie! Quostion Paper aad senting with replioation. nad 2.5.6) ta iD Discuss read te Quonims. (Refer section 2.6.3) (7) What is > Explain fisture of Hadoop streaming. (Refer section 3.2) le) ii) Explai Seanism of HFS. (Refer section 3.4.5) rea] oR ii) Avro (Refer section 3.5.6) iii) Date integrity in HDFS (Refer section 3.5.) iv) oop loos (Refer section 3.5.2) (13) 3) Discuss data flow in MapRaduce programming model. (Refer section 4.1.2) [6] iD on YARN. (Refer section 4.4) eal OR i) Discuss Input - Output format of MapReduce. (Refer section 4.9.1) {6 ii) What is capacity scheduler? Compare capacity and fair scheduler. (Refer sections 4.6.3 and 4.6.4) (7) i) What is Hbase? Draw architecture of Hbase. Explain difference between HDFS and Hbase. (Refer section 5.1) (13) OR i) Write skort note on Hbase client. (Refer section 5.3) 13] ii) What is pig? Explain feature of pig. Draw architecture of pig. (Refer section 5.5) (7 PART C - (1x 15 = 15 Marks) i) What is open source technology? Explain advantages, disadvantages ~ and application of open source. (Refer section 1.9) m 4) Explain failures in classic map reduce and YARN. (Refer section 4.5) [8] OR Explain with diagram various aggregate data model of NoSQL. (Refer section 2.2) (15) gQa00 TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

You might also like