0% found this document useful (0 votes)
18 views13 pages

Bigdata Analytics

Uploaded by

samjaiwin2210
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views13 pages

Bigdata Analytics

Uploaded by

samjaiwin2210
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

BIGDATA ANALYTICS

USING MACHINE LEARNING

By
S.Manasha
II MSc Computer Science
HIVE
 Data warehouse and an ETL tool which provides an SQL-like interface
between the user and the Hadoop distributed file system (HDFS) which
integrates Hadoop.
 It facilitates reading, writing and handling wide datasets that stored in
distributed storage and queried by Structure Query Language (SQL) syntax. It
is not built for Online Transactional Processing (OLTP) workloads.
 It is designed to enhance scalability, extensibility, performance, fault-tolerance
and loose-coupling with its input formats.
HIVE DATA MODELLING

 Tables - Tables in Hive are created the same way it is done in RDBMS
 Partitions - Here, tables are organized into partitions for grouping similar
types of data based on the partition key
 Buckets - Data present in partitions can be further divided into buckets for
efficient querying
HIVE INTERNAL TABLES VS EXTERNAL TABLES

Internal:
 Data is stored in the Hive data warehouse. The data warehouse is located at
/hive/warehouse/ on the default storage for the cluster.
 Use internal tables when one of the following conditions applies.
 Data is temporary.
 You want Hive to manage the lifecycle of the table and data.
External:

 Data is stored outside the data warehouse. The data can be stored on any
storage accessible by the cluster.
 Use external tables when one of the following conditions apply:
 The data is also used outside of Hive. For example, the data files are updated
by another process (that doesn't lock the files.)
 Data needs to remain in the underlying location, even after dropping the table.
 You need a custom location, such as a non-default storage account.
PARTITIONS
 A table may be partitioned in multiple dimensions.
 For example, in addition to partitioning logs by date, we might also subpartition each date
partition by country to permit efficient queries by location.
 Partitioned are defined at table creation time using the PATITIONED by the clause, which takes
a list of column definitions.
 If we want to search a large amount of data, then we can divide the large data into partitions.

hive>create table party table(loaded int, logerror string) PARTITIONED BY (Logdt string, country
string)
BUCKETS
 To enable more efficient queries.
 To bucket a table is to make sampling more efficient.

hive>CREATE TABLE bucketed users(id INT, name STRINA)


CLUSTERED BY (id)INTO 4 BUCKETS
HIVE DATA TYPES
Primitive Data Types:

 Numeric Data types - Data types like integral, float, decimal


 String Data type - Data types like char, string
 Date/ Time Data type - Data types like timestamp, date, interval
 Miscellaneous Data type - Data types like Boolean and binary

Complex Data Types:

 Arrays - A collection of the same entities. The syntax is: array<data_type>


 Maps - A collection of key-value pairs and the syntax is map<primitive_type, data_type>
 Structs - A collection of complex data with comments. Syntax: struct<col_name : data_type [COMMENT
col_comment],…..>
 Units - A collection of heterogeneous data types. Syntax: uniontype<data_type, data_type,..>
MODES OF HIVE

 Local Mode - Used when Hadoop has one data node, and the amount of data is
small. Here, the processing will be very fast on smaller datasets, which are present in
local machines.
 Mapreduce Mode - Used when the data in Hadoop is spread across multiple data
nodes. Processing large datasets can be more efficient using this mode.
ADVANTAGES OF HIVE
 Scalability
 Familiar SQL-like interface
 Supports partitioning and bucketing
 User-defined functions
HIVE QL
 Hive QL is the HIVE QUERY LANGUAGE
 DDL and DML are the parts of HIVE QL
 Data Definition Language (DDL) is used for creating, altering and dropping
databases, tables, views, functions and indexes.
 Data manipulation language is used to put data into Hive tables and to
extract data to the file system and also how to explore and manipulate data
with queries, grouping, filtering, joining etc.
COMMANDS
 CREATE DATABASE db name -- to create a database in Hive
 USE db name -- To use the database in Hive.
 DROP db name -- To delete the database in Hive.
 SHOW DATABASE -- to see the list of the DataBase
THANKYOU

You might also like