0% found this document useful (0 votes)
70 views

Python and Hadoop Basics - Programmin...

The document provides an introduction to Python and Hadoop. It covers Python basics like variables, data types, operators and control flow statements. It also covers Hadoop concepts like HDFS, YARN, MapReduce and HBase. Example code is provided for many topics.

Uploaded by

Lords Botz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views

Python and Hadoop Basics - Programmin...

The document provides an introduction to Python and Hadoop. It covers Python basics like variables, data types, operators and control flow statements. It also covers Hadoop concepts like HDFS, YARN, MapReduce and HBase. Example code is provided for many topics.

Uploaded by

Lords Botz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 127

PYTHON AND HADOOP

BASICS

PROGRAMMING FOR BEGINNERS


J KING
PYTHON INTRODUCTION
PYTHON HISTORY AND VERSIONS
FIRST PYTHON PROGRAM
PYTHON VARIABLES
PYTHON DATA TYPES
PYTHON KEYWORDS
PYTHON LITERALS
PYTHON OPERATORS
PYTHON COMMENTS
PYTHON IF-ELSE STATEMENTS
PYTHON LOOPS
PYTHON FOR LOOP
PYTHON WHILE LOOP
PYTHON BREAK STATEMENT
PYTHON CONTINUE STATEMENT
PYTHON STRING
PYTHON TUPLE
HADOOP INTRO
BIG DATA
WHAT IS HADOOP
HADOOP MODULES
STARTING HDFS
FEATURES AND GOALS OF HDFS
YARN
HADOOP MAPREDUCE
MAPREDUCE - DATA FLOW
MAPREDUCE API
MAPREDUCE - WORD COUNT EXAMPLE
MAPREDUCE CHAR COUNT EXAMPLE
HBASE
HBASE INTRO
HBASE READ
HBASE WRITE
RDBMS VS HBASE
HBASE EXAMPLE
HIVE INTRO
HIVE ARCHITECTURE
HIVE DATA TYPES
HIVE - CREATE DATABASE
HIVE - DROP DATABASE
HIVEQL - OPERATORS
HIVEQL - FUNCTIONS
PYTHON
BASICS

PROGRAMMING FOR BEGINNERS


J KING
PYTHON INTRODUCTION
Python is a programming language that is general purpose, dynamic , high
level, and interpreted. To develop applications, it supports Object-Oriented
programming approach. Learning is simple and easy, and provides lots of
high-level data structures.
Python is yet a powerful and versatile scripting language that is easy to learn,
making it attractive for application development.
With its interpreted nature, Python's syntax and dynamic typing make it an
ideal language for scripting and quick application development.
Python supports multiple styles of programming including object-oriented,
imperative, and functional or procedural styles.
Python is not intended to operate in a given area, such as web programming.
This is why it is known as the language of multipurpose programming
because it can be used with web, enterprise, 3D CAD etc.
We don't have to use data types to declare variable because it's typed
dynamically so that we can write a=10 to assign an integer value in an integer
variable.
Python Features
Python makes the development and debugging fast because the compilation
step is not included in the development of Python, and the edit-test-debug
cycle is very fast.
Compared with other programming languages Python is easy to learn. Its
syntax is direct and much the same as the English language. The semicolon
or curly-bracket is not used, and the indentation defines the block of code. It
is the recommended language for beginners in the programming.
Using a few lines of code Python can perform complex tasks. A simple
example, you simply type print("Hello World”) to the hello world program.
It'll take one line to execute, while Java or C will take multiple lines.
Python History and Versions
Guido Van Rossum at CWI in The Netherlands began implementing Python
in December 1989.
Guido Van Rossum published the code (labeled version 0.9.0) to alt.sources
in February 1991.
Python 1.0 was released in 1994, with new features such as lambda, map,
filter, and reduce.
Python 2.0 has added new features such as list understandings and garbage
collection systems.
Python 3.0 (also called "Py3 K") was released on December 3, 2008. It had
been designed to correct the language's fundamental flaw.
Python Version List
Python Released
Version Date

Python 1.0 January 1994

Python 1.5 December 31, 1997

Python 1.6 September 5, 2000

Python 2.0 October 16, 2000

Python 2.1 April 17, 2001

Python 2.2 December 21, 2001

Python 2.3 July 29, 2003

Python 2.4 November 30, 2004

Python 2.5 September 19, 2006


Python 2.6 October 1, 2008

Python 2.7 July 3, 2010

Python 3.0 December 3, 2008

Python 3.1 June 27, 2009

Python 3.2 February 20, 2011

Python 3.3 September 29, 2012

Python 3.4 March 16, 2014

Python 3.5 September 13, 2015

Python 3.6 December 23, 2016

Python 3.7 June 27, 2018

Python 3.8 October 14, 2019

To download the latest Python release, please visit the link


https:/www.python.org/downloads/
First Python Program
In this section, we 're going to discuss Python 's basic syntax, and run a
simple program to print Hello World on the console.
Python provides us with two modes of running a program:
Using Interactive interpreter prompt
Using a script file

Interactive interpreter prompt


Python gives us the functionality to execute the Python statement at the
interactive prompt one by one. In the case where we are concerned about the
output of every line of our Python programme.
Open the terminal (or command prompt) and type python (python3, if you
have both Python2 and Python3 installed on your system) .
It will open the following prompt where we can execute the statement in
Python and check its impact on the console.

Using a script file


The interpreters prompt is good at running the individual code statements.
We can't write the code on the terminal anytime .
Our code must be written into a file that can be executed later. To do this,
open a notepad editor, create a file named first.py (Python used the.py
extension) and write the following code into it.
print ("hello world"); #here, we have used print() function to print the message on the console.
Run the following command on the terminal
$ python3 first.py
Python Variables
Variable is a name used to refer to the location of a memory. Also known as
an identifier and used for holding value.
In Python, we don't need to specify the variable type, because Python is an
inferior language and is smart enough to get the type of variable.
Variable names can be a group of letters and digits, but must start with a
letter or an underscore.
Using lowercase letters for name of the variable is recommended. Rahul and
rahul are both two different variables.

Identifier Naming
Variables - example of identifiers .To identify the literals used in the program
an Identifier is used. The rules for naming an identifier are set out below.
The variable must have an alphabet or underscore (_) as its first character.
All characters except the first character can be lower-case(a-z), upper-case
(A-Z), underscore, or digit (0-9) alphabets.
The name of the identifier shall not contain any white space or any special
character!, (@,#,%,^, &, *).
The name of the identifier must not be similar to any keyword set in the
language.
Identifier names are case sensitive; my name, for example, and MyName is
not identical.
Declaring Variable and Assigning Values
Python does not bind us to declare a variable before we use it in the app. It
enables us to create a variable in the time required.
In Python we do not need to declare explicitly variable. That variable is
declared automatically when we assign any value to the variable.
The operator equal (=) is used to assign value to a variable.
Object References
When we declare a variable it's necessary to understand how the Python
interpreter works. The process of treating variables differs slightly from many
other programming languages.
Python is a highly object-oriented programming language; therefore, each
data item belongs to a specific class type. Consider example below.
print("John")

OUTPUT
John
The object in Python creates an integer and displays it to the console. We
created a string object in the preceding print statement. Let's check its type
using the built-in type() function in Python.
type("John")

OUTPUT
<class 'str'>

Object Identity
In Python, each object created uniquely identifies with Python. Python
ensures no two objects will have the same identifier. The Id() function built-
in is used to identify the identifier of the object. Consider example below.
a = 50
b=a
print(id(a))
print(id(b))
# Reassigned variable a
a = 500
print(id(a))
Output:
140734982691168
140734982691168
2822056960944
We assigned both b = a, a and b points to the same object. When we checked
by the Id() function the same number was returned. We reassign a to 500;
then the new object identifier was referenced.

Variable Names
We've already discussed how to declare the valid variable. Variable names
may be uppercase, lowercase (A to Z, a to z), digit (0-9), and underscore ( ).
Consider the following example of names for the valid variables.
name = "Devansh"
age = 20
marks = 80.50

print(name)
print(age)
print(marks)
OUTPUT
Devansh
20
80.5
Consider the following example
name = "A"
Name = "B"
naMe = "C"
NAME = "D"
n_a_m_e = "E"
_name = "F"
name_ = "G"
_name_ = "H"
na56me = "I"

print(name,Name,naMe,NAME,n_a_m_e, NAME, n_a_m_e, _name,


name_,_name, na56me)
OUTPUT
ABCDEDEFGFI
We declared a few valid variable names such as name, name , etc. in the
above example. But it is not recommended, because it can create confusion
when we try to read code. To make code more readable, the name of the
variable should be descriptive.

Multiple Assignment
In a single statement, which is also called multiple assignments, Python
allows us to assign a value to multiple assignments.
Multiple assignments can be applied in two ways, either by assigning one
value to multiple variables or by assigning multiple values to multiple
variables. Consider example below.
Assigning single value to multiple variables
x=y=z=50
print(x)
print(y)
print(z)
Output
50
50
50
Assigning multiple values to multiple variables:
a,b,c=5,10,15
print a
print b
print c
OUTPUT
5
10
15
Python Data Types
Variables can hold values, with each value having a data-type. Python is a
dynamically typed language; therefore when declaring it, we don't need to
define the type of variable. Implicitly the interpreter binds the value to their
type.
a=5
The variable a holds the value of integer five, and we have not defined its
type. Python interpreter will interpret variables as a type of integer
automatically.
Python allows us to check what type of variable the program uses. Python
gives us the type() function, which returns the type of the passed variable.
Consider the example below to define the values of various types of data and
to check their type.
a=10
b="Hi Python"
c = 10.5
print(type(a))
print(type(b))
print(type(c))
OUTPUT
<type 'int'>
<type 'str'>
<type 'float'>
Standard data types
A variable can hold a variety of values. For instance, the name of a person
has to be stored as a string while its Id has to be stored as an integer.
Python provides different standard data types on each of them which define
the storage method. Below you will find the data types defined in Python.
1. Numbers
2. Sequence Type
3. Boolean
4. Set
5. Dictionary

Numbers
Number numbers store numeric values. The integer, float, and complex
values belong to a data-type of Python Numbers. Python gives the type)
(function for knowing the variable's data type. Similarly the function
isinstance() is used to check that an object belongs to a certain class.
Python creates Number objects when a variable has a number assigned. For
instance;
a=5
print("The type of a", type(a))

b = 40.5
print("The type of b", type(b))

c = 1+3j
print("The type of c", type(c))
print(" c is a complex number", isinstance(1+3j,complex))
OUTPUT
The type of a <class 'int'>
The type of b <class 'float'>
The type of c <class 'complex'>
c is complex number: True

Sequence Type
String
The sequence of characters represented in quotation marks can be defined as
the string. In Python, we can define a string by using single, double, or triple
quotes.
Python's string handling is a straightforward task, since Python provides
built-in functions and operators to perform string operations.
In the case of string handling, the operator + is used to combine two strings,
as the "hello"+" python operation returns "hello python."
The example below illustrates Python string.
EXAMPLE 1
str = "string using double quotes"
print(str)
s = '''''A multiline
string'''
print(s)
OUTPUT
string using double quotes
A multiline
String
EXAMPLE 2
str1 = 'hello learnpython' #string str1
str2 = ' how are you' #string str2
print (str1[0:2]) #printing first two character using slice operator
print (str1[4]) #printing 4th character of the string
print (str1*2) #printing the string twice
print (str1 + str2) #printing the concatenation of str1 and str2
OUTPUT
he
o
hello learnpythonhello learnpython
hello learnpython how are you

List
Python lists are similar to C arrays. However, the list may contain data of
various types. The items stored in the list are separated by a comma(,)(and
inside square brackets[].
To access list data, we may use slice [:] operators. The concatenation operator
(+) and the repeat operator (*) works with the list in the same way as the
strings worked.
EXAMPLE
list1 = [1, "hi", "Python", 2]
#Checking type of given list
print(type(list1))

#Printing the list1


print (list1)

# List slicing
print (list1[3:])

# List slicing
print (list1[0:2])

# List Concatenation using + operator


print (list1 + list1)

# List repetation using * operator


print (list1 * 3)
OUTPUT
[1, 'hi', 'Python', 2]
[2]
[1, 'hi']
[1, 'hi', 'Python', 2, 1, 'hi', 'Python', 2]
[1, 'hi', 'Python', 2, 1, 'hi', 'Python', 2, 1, 'hi', 'Python', 2]

Tuple
In many ways a tuple is similar to a list. Like lists, tuples also include the
collection of items of various data types. Tuple items are separated by a
comma(,) and bound in parentheses().
A tuple is a read-only data structure, because we can not change the size and
value of the tuple items.
EXAMPLE
tup = ("hi", "Python", 2)
# Checking type of tup
print (type(tup))

#Printing the tuple


print (tup)

# Tuple slicing
print (tup[1:])
print (tup[0:1])

# Tuple concatenation using + operator


print (tup + tup)

# Tuple repatation using * operator


print (tup * 3)

# Adding value to tup. It will throw an error.


t[2] = "hi"
OUTPUT
<class 'tuple'>
('hi', 'Python', 2)
('Python', 2)
('hi',)
('hi', 'Python', 2, 'hi', 'Python', 2)
('hi', 'Python', 2, 'hi', 'Python', 2, 'hi', 'Python', 2)

Traceback (most recent call last):


File "main.py", line 14, in <module>
t[2] = "hi";
TypeError: 'tuple' object does not support item assignment

Dictionary
Dictionary is an unordered set of a pair of items with key-value. It is like an
associative array or a hash table where a specific value is stored on each key.
Key can hold any type of primitive data whereas value is an arbitrary Python
object.
The dictionary items are separated by the comma(,) and enclosed within the
curly braces{}.
EXAMPLE
d = {1:'Jimmy', 2:'Alex', 3:'john', 4:'mike'}

# Printing dictionary
print (d)

# Accesing value using keys


print("1st name is "+d[1])
print("2nd name is "+ d[4])

print (d.keys())
print (d.values())
OUTPUT
1st name is Jimmy
2nd name is mike
{1: 'Jimmy', 2: 'Alex', 3: 'john', 4: 'mike'}
dict_keys([1, 2, 3, 4])
dict_values(['Jimmy', 'Alex', 'john', 'mike'])

Boolean
Boolean type provides two incorporated values, True and False. These values
are used to determine true or false for the given statement. The class bool
denotes that. True can be represented by any non-zero value or 'T' whereas
the 0 or 'F' may represent false.
EXAMPLE
# Python program to check the boolean type
print(type(True))
print(type(False))
print(false)
OUTPUT
<class 'bool'>
<class 'bool'>
NameError: name 'false' is not defined

SET
Python Set is the unordered data-type collection. It is iterable, mutable (can
change after it has been created), and has unique elements. In set, the order of
the elements is undefined; the changed sequence of the element may be
returned. The set is created using a built-in function set(), or a sequence of
elements passed through the curly braces and separated by the comma. It can
contain different values.
EXAMPLE
# Creating Empty set
set1 = set()

set2 = {'James', 2, 3,'Python'}

#Printing Set value


print(set2)

# Adding element to the set

set2.add(10)
print(set2)

#Removing element from the set


set2.remove(2)
print(set2)
OUTPUT
{3, 'Python', 'James', 2}
{'Python', 'James', 3, 2, 10}
{'Python', 'James', 3, 10}
Python Keywords
Python Keywords are special, reserved words that convey the compiler /
interpreter a special meaning. Each keyword has a particular meaning and a
particular operation. You can't use those keywords as a variable. The List of
Python Keywords follows.

True False None and as

asset def class continuebreak

else finally elif del except

global for if from import

raise try or return pass

nonlocalin not is lambda

Assert
This keyword is used in Python as debugging tool. The code is checked for
correctness. It raises an AssertionError if any error has been found in the
code, and it also prints an error message.
EXAMPLE
a = 10
b=0
print('a is dividing by Zero')
assert b != 0
print(a / b)
OUTPUT
a is dividing by Zero
Runtime Exception:
Traceback (most recent call last):
File "/home/40545678b342ce3b70beb1224bed345f.py", line 4, in
assert b != 0, "Divide by 0 error"
AssertionError: Divide by 0 error

def
This keyword is used in Python to declare a function.
def my_func(a,b):
c = a+b
print(c)
my_func(10,20)
OUTPUT
30
Class
In Python, it is used to represent the class. The class is Objects' blueprint. It is
the variable collection and the methods. Consider the Class below.
class Myclass:
#Variables……..
def function_name(self):
#statements………
continue
It is used to stop actual iteration execution. Consider example below.
a=0
while a < 4:
a += 1
if a == 2:
continue
print(a)
OUTPUT
1
3
4

Break
It is used to terminate execution of the loop and transfer control to the end of
the loop. Consider example below.
for i in range(5):
if(i==3):
break
print(i)
print("End of execution")
OUTPUT
0
1
2
End of execution
Elif
Uses this keyword to check multiple conditions. If the previous condition is
false then check until you find the true condition.
marks = int(input("Enter the marks:"))
if(marks>=90):
print("Excellent")
elif(marks<90 and marks>=75):
print("Very Good")
elif(marks<75 and marks>=60):
print("Good")
else:
print("Average")
Python Literals
Python Literals can be defined as data given in a constant or a variable.
Python supports the following literals:

1. String literals:
You may form string literals by enclosing a text in the quotes. To create a
string, we may use both single and double quotes.
Example:
"Aman" , '12345'
Types of Strings:
Python supports two types of Strings:

a.Single line strings- Single line strings ended in a single line are known
as Single line strings.
Example:
text1='hello'
B) Multi-line Line-A piece of text written in several lines is called multi-line
series.
Multiline strings can be formed by two ways:
1) Adding black slash at the end of each line.
Example:
text1='hello\
user'
print(text1)
2) Using triple quotation marks:-
Example:
str2='''''welcome
to
SSSIT'''
print str2
OUTPUT
welcome
to
SSSIT

2. Numeric literals:
Numeric Literals are unchangeable
Example - Numeric Literals
x = 0b10100 #Binary Literals
y = 100 #Decimal Literal
z = 0o215 #Octal Literal
u = 0x12d #Hexadecimal Literal

#Float Literal
float_1 = 100.5
float_2 = 1.5e2

#Complex Literal
a = 5+3.14j

print(x, y, z, u)
print(float_1, float_2)
print(a, a.imag, a.real)
OUTPUT
20 100 141 301
100.5 150.0
(5+3.14j) 3.14 5.0

3. Boolean literals:
A Boolean literal can have one of these two values: True or False.
Example - Boolean Literals
x = (1 == True)
y = (2 == False)
z = (3 == True)
a = True + 10
b = False + 10

print("x is", x)
print("y is", y)
print("z is", z)
print("a:", a)
print("b:", b)
OUTPUT
x is True
y is False
z is False
a: 11
b: 10

4. Special literals
Python has one special literal, that is None.
None is used to indicate the field is not generated within the field. This is also
used in Python for finishing lists.
Example - Special Literals
val1=10
val2=None
print(val1)
print(val2)
OUTPUT
10
None

5. Literal Collections
Python provides the literal collection of four types, such as List literals, Tuple
literals, Dict literals, and Set literals.
LIST
The List includes items of different types of data. Lists are also mutable, i.e.
modifiable.
Example - List literals
list=['John',678,20.4,'Peter']
list1=[456,'Andrew']
print(list)
print(list + list1)
output
['John', 678, 20.4, 'Peter']
['John', 678, 20.4, 'Peter', 456, 'Andrew']
Dictionary:
In key-value pair, Python dictionary stores the data.
It's enclosed by curly-braces{} and the commas(,) separates each pair.
Example
dict = {'name': 'Pater', 'Age':18,'Roll_nu':101}
print(dict)
Output:

{'name': 'Pater', 'Age': 18, 'Roll_nu': 101}


Tuple:
Python tuple is a collection of different types of data. It is immutable which
means that after development, it can not be changed.
Example
tup = (10,20,"Dev",[2,3,4])
print(tup)
Output:

(10, 20, 'Dev', [2, 3, 4])


Set:
Python set is the Unordered Data Set collection.
It is enclosed by the{}
Example: - Set Literals
set = {'apple','grapes','guava','papaya'}
print(set)
Output:

{'guava', 'apple', 'papaya', 'grapes'}


Python Operators
You may describe the operator as a symbol that is responsible for a specific
operation between two operands. Operators are the foundations of a system
on which a particular programming language constructs the logic. Python
offers a range of operators, described as follows.
Arithmetic operators
Comparison operators
Assignment Operators
Logical Operators
Bitwise Operators
Membership Operators
Identity Operators

Arithmetic Operators
Operator Description

+ (Addition) It is used to add two operands. For example, if a = 20, b = 10 => a+b = 30

- (Subtraction) It is used to subtract the second operand from the first operand. If the first
operand is less than the second operand, the value results negative. For
example, if a = 20, b = 10 => a - b = 10

/ (divide) It returns the quotient after dividing the first operand by the second operand.
For example, if a = 20, b = 10 => a/b = 2.0

* It is used to multiply one operand with the other. For example, if a = 20, b =
(Multiplication) 10 => a * b = 200

% (reminder) It returns the reminder after dividing the first operand by the second operand.
For example, if a = 20, b = 10 => a%b = 0

** (Exponent) It is an exponent operator represented as it calculates the first operand power


to the second operand.

// (Floor It gives the floor value of the quotient produced by dividing the two
division) operands.
Comparison operator
Comparison operators are used to evaluate the value of the two operands, and
returns accordingly Boolean true or false. The comparison operators are
listed in the table below.

Operator Description

== If the value of two operands is equal, then the condition becomes true.

!= If the value of two operands is not equal, then the condition becomes true.

<= If the first operand is less than or equal to the second operand, then the
condition becomes true.

>= If the first operand is greater than or equal to the second operand, then the
condition becomes true.

> If the first operand is greater than the second operand, then the condition
becomes true.

< If the first operand is less than the second operand, then the condition
becomes true.

Assignment Operators
The assignment operators are used to add left operand to the value of the
right expression. The operators for assignments are listed in the table below.

Operator Description

= It assigns the value of the right expression to the left operand.

+= It increases the value of the left operand by the value of the right operand
and assigns the modified value back to left operand. For example, if a = 10,
b = 20 => a+ = b will be equal to a = a+ b and therefore, a = 30.
-= It decreases the value of the left operand by the value of the right operand
and assigns the modified value back to left operand. For example, if a = 20,
b = 10 => a- = b will be equal to a = a- b and therefore, a = 10.

*= It multiplies the value of the left operand by the value of the right operand
and assigns the modified value back to then the left operand. For example, if
a = 10, b = 20 => a* = b will be equal to a = a* b and therefore, a = 200.

%= It divides the value of the left operand by the value of the right operand and
assigns the reminder back to the left operand. For example, if a = 20, b = 10
=> a % = b will be equal to a = a % b and therefore, a = 0.

**= a**=b will be equal to a=a**b, for example, if a = 4, b =2, a**=b will assign
4**2 = 16 to a.

//= A//=b will be equal to a = a// b, for example, if a = 4, b = 3, a//=b will assign
4//3 = 1 to a.

Bitwise Operators
The bitwise operators work bit by bit on the values of the two operands. Find
scenario below.
if a = 7
b=6
then, binary (a) = 0111
binary (b) = 0011

hence, a & b = 0011


a | b = 0111
a ^ b = 0100
~ a = 1000

Operator Description

& (binary and) If both the bits at the same place in two operands are 1, then 1 is copied to
the result. Otherwise, 0 is copied.

| (binary or) The resulting bit will be 0 if both the bits are zero; otherwise, the resulting
bit will be 1.

^ (binary xor) The resulting bit will be 1 if both the bits are different; otherwise, the
resulting bit will be 0.

~ (negation) It calculates the negation of each bit of the operand, i.e., if the bit is 0, the
resulting bit will be 1 and vice versa.

<< (left shift) The left operand value is moved left by the number of bits present in the
right operand.

>> (right shift) The left operand is moved right by the number of bits present in the right
operand.

Logical Operators
The logical operators are used primarily for making a decision in the
expression evaluation. Python supports the logical operators which follow.

Operator Description

and If both the expression are true, then the condition will be true. If a and b are
the two expressions, a → true, b → true => a and b → true.

or If one of the expressions is true, then the condition will be true. If a and b are
the two expressions, a → true, b → false => a or b → true.

not If an expression a is true, then not (a) will be false and vice versa.

Membership Operators
Within a Python data structure, Python membership operators are used to
verify the value membership. If the value in the data structure is present
otherwise the resulting value is true else it will be false.
Operator Description

in It is evaluated to be true if the first operand is found in the second operand


(list, tuple, or dictionary).

not in It is evaluated to be true if the first operand is not found in the second
operand (list, tuple, or dictionary).

Identity Operators
Operator Description

is It is evaluated to be true if the reference present at both sides point to the


same object.

is not It is evaluated to be true if the reference present at both sides do not point to
the same object.

Operator Precedence
The operators' precedemce is important to figure out as it helps one to learn
which operator will first be measured. The operators' precedence table in
Python is shown below.

Operator Description

** The exponent operator is given priority over all the others used in the
expression.

~+- The negation, unary plus, and minus.

* / % // The multiplication, divide, modules, reminder, and floor division.

+- Binary plus, and minus


>> << Left shift. and right shift

& Binary and.

^| Binary xor, and or

<= < > >= Comparison operators (less than, less than equal to, greater than, greater then
equal to).

<> == != Equality operators.

= %= /= //= -= Assignment operators


+=
*= **=

is is not Identity operators

in not in Membership operators

not or and Logical operators


Python Comments
Python Comment is a very important tool for programmers. Commonly,
comments are used to explain the code. Whether it has a clear definition we
can quickly grasp the message. A smart programmer has to use the comments
if somebody wishes to change the code in the future as well as add the new
module; so it can be done quickly.
At the beginning of the statement or code we use the hash (#) to apply the
comment in the code.
EXAMPLE
# This is the print statement
print("Hello Python")
Here we used the hash (#) to write comment about the print statement. It
won't impact our print statement.

Multiline Python Comment


To apply the multiline Python comment, we need to use the hash(#) at the
beginning of each line of code. Find scenario below.
# First line of the comment
# Second line of the comment
# Third line of the comment
EXAMPLE
# Variable a holds value 5
# Variable b holds value 10
# Variable c holds sum of a and b
# Print the result
a=5
b = 10
c = a+b
print("The sum is:", c)
OUTPUT
The sum is: 15
The code above is really clear even though the absolute beginners can see
what is happening in each line of code under that. This is the advantage.

Docstrings Python Comment


The comment on docstring is mainly used in the module, function, class, or
method. In additional tutorials we will clarify the class / process.
EXAMPLE
def intro():
"""
This function prints Hello Joseph
"""
print("Hi Joseph")
intro()
OUTPUT
Hello Joseph
Python If-else statements
The decision-making aspect of almost all the programming languages is the
most important. As the name implies, decision-making enables us to run a
specific block of code for a given decision. The conclusions on the legitimacy
of the specified conditions are taken here. The foundation of decision-making
is condition checking.
In python, the following clauses are used to make choices.

Statement Description

If Statement The if statement is used to test a specific condition. If the condition is true, a
block of code (if-block) will be executed.

If - else The if-else statement is identical to if statement except that, it also includes
Statement the code block for the condition to be verified in the false case. If the
condition provided in the if statement is false, then the other statement is
executed.

Nested if Nested if statements enable us to use if ? else statement inside an outer if


Statement statement.

Indentation in Python
Python does not require the use of parentheses for the block level code for the
convenience of programming and to achieve simplicity. In Python, they use
indentation to declare a line. If two statements are at the same level of
indentation then they are the same block part.
In general , four spaces are provided for indenting statements which are a
standard amount of python indentation.
Indentation is the most commonly used aspect of the python language, since
it defines the code block. All one block statements are meant to be at the
same degree of indentation. We'll see how the actual indentation occurs in
python decision making and other stuff.
The if statement
The if statement is used to test a particular condition and executes a block of
code when the condition is true.

SYNTAX
if expression:
statement
EXAMPLE 1
num = int(input("enter the number?"))
if num%2 == 0:
print("Number is even")
OUTPUT
enter the number?10
Number is even
Example 2 : Program to print the largest of the three numbers.
a = int(input("Enter a? "));
b = int(input("Enter b? "));
c = int(input("Enter c? "));
if a>b and a>c:
print("a is largest");
if b>a and b>c:
print("b is largest");
if c>a and c>b:
print("c is largest");
OUTPUT
Enter a? 100
Enter b? 120
Enter c? 130
c is largest
The if-else statement
The if-else statement provides another block in conjunction with the if
statement, which is executed in the condition 's false case.
If the condition is true then perform the if-block. The else-block is executed,
otherwise.
SYNTAX
if condition:
#block of statements
else:
#another block of statements (else-block)
Example 1 : Program to check whether a person is eligible to
vote or not.
age = int (input("Enter your age? "))
if age>=18:
print("You are eligible to vote !!");
else:
print("Sorry! you have to wait !!");
OUTPUT
Enter your age? 90
You are eligible to vote !!
Example 2: Program to check whether a number is even or not.
num = int(input("enter the number?"))
if num%2 == 0:
print("Number is even...")
else:
print("Number is odd...")
OUTPUT
enter the number?10
Number is even

The elif statement


The elif statement helps one to test different conditions and execute the
relevant set of statements based on their true state. Based on our requirement
we will include any amount of elif statements in our program.
The elif statement functions as if-else-if ladder statement in C.
The elif declaration syntax is given below.
if expression 1:
# block of statements

elif expression 2:
# block of statements

elif expression 3:
# block of statements

else:
# block of statements
EXAMPLE
number = int(input("Enter the number?"))
if number==10:
print("number is equals to 10")
elif number==50:
print("number is equal to 50");
elif number==100:
print("number is equal to 100");
else:
print("number is not equal to 10, 50 or 100");
OUTPUT
Enter the number?15
number is not equal to 10, 50 or 100
EXAMPLE 2
marks = int(input("Enter the marks? "))
f marks > 85 and marks <= 100:
print("Congrats ! you scored grade A ...")
lif marks > 60 and marks <= 85:
print("You scored grade B + ...")
lif marks > 40 and marks <= 60:
print("You scored grade B ...")
lif (marks > 30 and marks <= 40):
print("You scored grade C ...")
lse:
print("Sorry you are fail ?")
Python Loops
By example, the flow of programs written into every programming language
is sequential. Sometimes, we might need to change the program's flow. You
can need to repeat many numbers of times to execute a particular code.
The programming languages provide various types of loops for this purpose
which are capable of repeating several numbers of times a certain specific
code. Consider the diagram below to understand what a loop statement
works.

Why use loops


The looping simplifies the difficult issues into the easy ones. It allows us to
change the program flow so that we can repeat the same code for a finite
number of times instead of writing the same code again and again. For
example, if we need to print the first 10 natural numbers then we can print
inside a loop that runs up to 10 iterations instead of 10 times using the print
statement.
Advantages of loops
1. It gives reusability of code.
2. We don't have to write the same code again and again, using loops.
3. We can traverse the elements of the data structures (array or linked
lists) using loops.

Loop Description
Statement

for loop The for loop is used in case we need to execute some part of the code until
the condition is fulfilled. The for loop is also known as a per-tested loop. If
the amount of iteration is specified in advance it is easier to use it for loop.

while loop The while loop will be included in the situation where we don't know the
number of iterations beforehand. In the while loop the block of statements
is executed until the condition stated in the while loop is satisfied. It is also
referred to as a pretested loop.

do-while loop The do-while loop continues until a given condition satisfies. It is also
called post tested loop. It is used when it is necessary to execute the loop at
least once (mostly menu driven programs).
Python for loop
In Python the for loop is used many times to iterate the statements or a
portion of the program. This is also used to navigate data structures such as
the list, tuple, or dictionary.
The syntax in python for loop is provided below.
for iterating_var in sequence:
statement(s)

For Loop Flow Chart

For loop Using Sequence


Example-1: Iterating string using for loop
str = "Python"
for i in str:
print(i)
OUTPUT
P
y
t
h
o
n
Example- 2: Program to print the table of the given number
list = [1,2,3,4,5,6,7,8,9,10]
n=5
for i in list:
c = n*i
print(c)
OUTPUT
5
10
15
20
25
30
35
40
45
50s
Example-4: Program to print the sum of the given list
list = [10,30,23,43,65,12]
sum = 0
for i in list:
sum = sum+i
print("The sum is:",sum)
Output:
The sum is: 183

Nested for loop in python


Python enables us to nest any number of for loop inside a for loop. For each
iteration of the outer loop the inner loop is executed n number of times. The
syntax is listed below.
for iterating_var1 in sequence: #outer loop
for iterating_var2 in sequence: #inner loop
#block of statements
#Other statements

Example- 1: Nested for loop


# User input for number of rows
rows = int(input("Enter the rows:"))
# Outer loop will print number of rows
for i in range(0,rows+1):
# Inner loop will print number of Astrisk
for j in range(i):
print("*",end = '')
print()
Output:
Enter the rows:5
*
**
***
****
*****

Example-2: Program to number pyramid


rows = int(input("Enter the rows"))
for i in range(0,rows+1):
for j in range(i):
print(i,end = '')
print()
Output:

1
22
333
4444
55555

Using else statement with for loop


UnLike other languages such as C, C++, or Java, Python helps one to use the
other sentence with the loop that can only be executed if all iterations are
completed.
Example 1
for i range( 0 , 5 ):
in
print (i)
else :
print ( "for loop completely exhausted, since there is no break." )
Output:

0
1
2
3
4
for loop completely exhausted, since there is no break.
Example 2
for i range( 0 , 5 ):
in
print (i)
break ;
else : print ( "for loop is exhausted" );
print ( "The loop is broken due to break statement...came out of the loop" )
Output:

0
Python While loop
The Python while loop enables execution of a part of the code until the given
condition returns false. It is also referred to as a pretested loop.
If we don't know the number of iterations then using the while loop is the
most efficient.
The following syntax is given
while expression:
statements

Example-1: Program to print 1 to 10 using while loop


i=1
#The while loop will iterate until condition becomes false.
While(i<=10):
print(i)
i=i+1
Output:
1
2
3
4
5
6
7
8
9
10
Example -2: Program to print table of given numbers.
i=1
number=0
b=9
number = int(input("Enter the number:"))
while i<=10:
print("%d X %d = %d \n"%(number,i,number*i))
i = i+1
Output:
Enter the number:10
10 X 1 = 10

10 X 2 = 20

10 X 3 = 30

10 X 4 = 40
10 X 5 = 50

10 X 6 = 60

10 X 7 = 70

10 X 8 = 80

10 X 9 = 90

10 X 10 = 100

Infinite while loop


If the condition is given in the while loop never becomes false, then the while
loop never ends and turns into the infinite while looping.
Any non-zero value in the while loop implies a state that is always valid,
while null implies the state that is always false. This kind of methodology is
helpful if we want our system to run continuously without interruption in the
loop.
Example 1
while ( 1 ):
print( "Hi! we are inside the infinite while loop" )
Output:

Hi! we are inside the infinite while loop


Hi! we are inside the infinite while loop
Example 2
var = 1
while (var != 2 ):
i = int (input( "Enter the number:" ))
print( "Entered value is %d" %(i))
Output:

Enter the number:10


Entered value is 10
Enter the number:10
Entered value is 10
Enter the number:10
Entered value is 10
Infinite time

Using else with while loop


Python also helps one to use the other statement with the while sequence. The
else block is executed when the condition stated in the declaration of while is
false. As in the for loop, if the break statement removes the while loop, the
else block will not be executed and the statement present after the else block
will be executed. Find scenario below.
Example 1
i= 1
while (i<= 5 ):
print(i)
i=i+ 1
else :
print( "The while loop exhausted" )
Example 2
i= 1
while (i<= 5 ):
print(i)
i=i+ 1
if (i== 3 ):
break
else :
print( "The while loop exhausted" )
When the break statement came across in the code above, the while loop
stopped its execution and skipped the else statement.
Example-3 Program to print Fibonacci numbers to given limit
terms = int (input( "Enter the terms " ))
# first two intial terms
a= 0
b= 1
count = 0

# check if the number of terms is Zero or negative


if (terms <= 0 ):
print( "Please enter a valid integer" )
elif (terms == 1 ):
print( "Fibonacci sequence upto" ,limit, ":" )
print(a)
else :
print( "Fibonacci sequence:" )
while (count < terms) :
print(a, end = ' ' )
c=a+b
# updateing values
a=b
b=c

count += 1
Output:

Enter the terms 10


Fibonacci sequence:
0 1 1 2 3 5 8 13 21 34
Python break statement
The break is a python keyword which is used to take control of the program
out of the loop. The break statement breaks the loops one by one, i.e. it
breaks the inner loop first in the case of nested loops, and then proceeds to
outer loops. In other words , we can say that break is used to abort the
program 's current execution, and the control goes to the next line.
The break is commonly used in cases where a given condition requires us to
break the loop.
SYNTAX
#loop statements
break;
EXAMPLE 1
list =[1,2,3,4]
count = 1;
for i in list:
if i == 4:
print("item matched")
count = count + 1;
break
print("found at",count,"location");
Output:

item matched
found at 2 location
EXAMPLE 2
str = "python"
for i in str:
if i == 'o':
break
print(i);
Output:

p
y
t
h
Example 3: break statement with while loop
i= 0;
while 1 :
print (i, " " ,end=""),
i=i+ 1 ;
if i == 10 :
break ;
print ( "came out of while loop" );
Output:
0 1 2 3 4 5 6 7 8 9 came out of while loop
Example 4
n= 2
while 1 :
i= 1 ;
while i<= 10 :
print ( "%d X %d = %d\n" %(n,i,n*i));
i = i+ 1 ;
choice = int(input( "Do you want to continue printing the table, press 0 for no?" ))
if choice == 0 :
break ;
n=n+ 1
OUTPUT
2X1=2
2X2=4
2X3=6
2X4=8
2 X 5 = 10
2 X 6 = 12
2 X 7 = 14
2 X 8 = 16
2 X 9 = 18
2 X 10 = 20
Do you want to continue printing the table, press 0 for no?
Python continue Statement
The continue statement in Python is used to bring program control to the loop
start. The continue statement skips within the loop the remaining lines of
code and begins with the next iteration. It is mostly used within the loop for a
specific condition, so that we can bypass any particular code for a specific
condition.
SYNTAX
#loop statements
continue
#the code to be skipped
FLOW DIAGRAM

Example 1
i= 0
while (i < 10 ):
i = i+ 1
if (i == 5 ):
continue
print (i)
Output:

1
2
3
4
6
7
8
9
10
EXAMPLE
str = "LearnTPython"
for i in str:
if(i == 'T'):
continue
print(i)
OUTPUT
L
e
a
r
n
P
y
t
h
o
n
Python String
Python string is the set of single quotes, double quotes, or triple quotes
accompanying the characters. The computer doesn't understand the
characters; internally, it stores manipulated characters as the 0's and 1's
combination.
Each symbol shall be stored in the symbol ASCII or Unicode. So we might
say Python strings are also called Unicode character.
Python enables us to build the string of single quotes, double quotes or triple
quotes.
Syntax:
str = "Hi Python !"

Creating String in Python


By enclosing the characters in single quotes or double- quotes we can create a
string. Python also provides the string with triple-quotes, but it is generally
used for multiline strings or docstrings.
#Using single quotes
str1 = 'Hello Python'
print(str1)
#Using double quotes
str2 = "Hello Python"
print(str2)

#Using triple quotes


str3 = '''''Triple quotes are generally used for
represent the multiline or
docstring'''
print(str3)
Output:

Hello Python
Hello Python
Triple quotes are generally used for
represent the multiline or
docstring

STRING OPERATORS
Operator Description

+ It is known as concatenation operator used to join the strings given either


side of the operator.

* It is known as repetition operator. It concatenates the multiple copies of the


same string.

[] It is known as slice operator. It is used to access the sub-strings of a


particular string.

[:] It is known as range slice operator. It is used to access the characters from
the specified range.

in It is known as membership operator. It returns if a particular sub-string is


present in the specified string.

not in It is also a membership operator and does the exact reverse of in. It returns
true if a particular substring is not present in the specified string.

r/R It is used to specify the raw string. Raw strings are used in the cases where
we need to print the actual meaning of escape characters such as
"C://python". To define any string as a raw string, the character r or R is
followed by the string.

% It is used to perform string formatting. It makes use of the format specifiers


used in C programming like %d or %f to map their values in python. We will
discuss how formatting is done in python.

EXAMPLE
Find the example below for understanding the practical use of Python
operators.
str = "Hello"
str1 = " world"
print(str*3) # prints HelloHelloHello
print(str+str1)# prints Hello world
print(str[4]) # prints o
print(str[2:4]); # prints ll
print('w' in str) # prints false as w is not present in str
print('wo' not in str1) # prints false as wo is present in str1.
print(r'C://python37') # prints C://python37 as it is written
print("The string str : %s"%(str)) # prints The string str : Hello

Output:
HelloHelloHello
Hello world
o
ll
False
False
C://python37
The string str : Hello
Python Tuple
Python Tuple is used to store unchanged Python object set. The tuple is
similar to lists since it is possible to adjust the value of the items stored in the
array, whereas the tuple is immutable and it is not possible to alter the value
of the items stored in the tuple.
A tuple can be written as a collection of comma-separated(,) values
accompanied by the small() brackets. The parentheses are optional but use of
them is good practice. They may describe a tuple as follows.
T1 = (101, "Peter", 22)
T2 = ("Apple", "Banana", "Orange")
T3 = 10,20,30,40,50

print(type(T1))
print(type(T2))
print(type(T3))
Output:
<class 'tuple'>
<class 'tuple'>
<class 'tuple'>
A tuple is indexed as is the case for the lists. You can view the items in tuple
using their unique index value.
Example - 1
tuple1 = ( 10 , 20 , 30 , 40 , 50 , 60 )
print(tuple1)
count = 0
for i in tuple1:
print( "tuple1[%d] = %d" %(count, i))
count = count+ 1
Output:
(10, 20, 30, 40, 50, 60)
tuple1[0] = 10
tuple1[1] = 20
tuple1[2] = 30
tuple1[3] = 40
tuple1[4] = 50
tuple1[5] = 60
Example - 2
tuple1 = tuple(input( "Enter the tuple elements ..." ))
print(tuple1)
count = 0
for i in tuple1:
print( "tuple1[%d] = %s" %(count, i))
count = count+ 1
Output :
Enter the tuple elements ...123456
('1', '2', '3', '4', '5', '6')
tuple1[0] = 1
tuple1[1] = 2
tuple1[2] = 3
tuple1[3] = 4
tuple1[4] = 5
tuple1[5] = 6

Basic Tuple operations


Operator Description Example

Repetition The repetition operator enables the tuple elements to T1*2 = (1, 2, 3, 4, 5,
be repeated multiple times. 1, 2, 3, 4, 5)

Concatenation It concatenates the tuple mentioned on either side of T1+T2 = (1, 2, 3, 4,


the operator. 5, 6, 7, 8, 9)
Membership It returns true if a particular item exists in the tuple print (2 in T1) prints
otherwise false True.

Iteration The for loop is used to iterate over the tuple for i in T1:
elements. print(i)
Output
1
2
3
4
5

Python Tuple inbuilt functions


SN Function Description

1 cmp(tuple1, It compares two tuples and returns true if tuple1 is greater than
tuple2) tuple2 otherwise false.

2 len(tuple) It calculates the length of the tuple.

3 max(tuple) It returns the maximum element of the tuple

4 min(tuple) It returns the minimum element of the tuple.

5 tuple(seq) It converts the specified sequence to the tuple.


HADOOP
BASICS

PROGRAMMING
FOR BEGINNERS

J KING
HADOOP INTRO
In this book includes basics of Big Data Hadoop with HDFS, MapReduce,
Yarn, Hive, HBase, Pig, Sqoop etc.
Hadoop is a framework which is open source.
The processing and analysis of very huge volumes of data is provided by
Apache.
It's written in Java and used by Google , Facebook, LinkedIn, Yahoo, Twitter
and so on.

Big Data
Big Data is called data which is very large in size.
We normally work on MB size data (WordDoc, Excel) or maximum
GB(Movies, Codes), but data in Peta bytes is called Big Data, i.e. 10 ^ 15
byte size.
It is stated that in the last 3 years nearly 90 percent of today 's data has been
generated.

Big Data USED


These data come from many sources
Social networking sites: Facebook , Google, LinkedAll these
sites generate enormous amounts of data on a daily basis, as they
have billions of users around the world.
Weather Station: All the weather station and satellite provides
very huge data that is stored and manipulated for weather forecast.
E-commerce site: Sites like Amazon , Flipkart, Alibaba are
generating huge amounts of logs from which users can trace
trends.
Share Market: Worldwide, stock exchange generates enormous
amounts of data through its daily transaction.
Telecom company: Telecom giants like Airtel, Vodafone are
studying user trends and publishing their plans accordingly, and
store their million user data for this.

3V's
1. Variety: Data are now not stored in rows and columns.
Data is both structured, and unstructured. CCTV footage, log file, is
unstructured data.
Data that can be saved in tables are structured data such as the Bank's
transaction data.
2. Velocity: The data grows at a very fast rate. The volume of the data is
estimated to double in every 2 years.
3. Volume: The quantity of data we handle is very large in Peta Bytes.

Usage
An e-commerce site XYZ (with 100 million users) wants to offer some gift
voucher to its top 10 customers who have spent the most in the previous year.
In addition, they want to find these customers' buying trend so that the
company can suggest more items related to them.

Issues in Big Data


Enormous amount of unstructured data that must be stored , processed and
analyzed.

Solving
Processing: Map To find the required output, the paradigm Reduce is applied
to data distributed over a network.
Storage: Hadoop uses this enormous amount of data, HDFS (Hadoop
Distributed File System), which uses commodity hardware to form clusters
and store data in a distributed manner.
It works once on Write, reading the principle many times.

Cost: Hadoop is open source so costs aren't a problem anymore.

Analyze: The data can be analyzed using pig, hive.


What is Hadoop
Hadoop is Apache's open source framework and is used to store processes
and analyze data that are very large in volume. It is used by Facebook, Yahoo
, Google, Twitter , LinkedIn and many more for batch / offline processing.
Hadoop is written in Java, and not OLAP (analytical processing online). It
can also be scaled up merely by adding nodes to the cluster.

Modules
Yarn: Another Resource Negotiator is used for job planning
and the cluster management.
HDFS:Google published its GFS paper and was developed
based on that HDFS. It states that, over the distributed
architecture, the files are broken into blocks and stored in
nodes. HDFS Hadoop means Distributed File System.

• Hadoop Common: These Java libraries are used to start Hadoop and
other Hadoop modules are used for this.
• Map Reduce: This is a framework that helps Java programs use key
value pair to do the parallel data computation. The Map task takes
data from input and converts it into a set of data that can be
computed in key value pair. Map task output is consumed by
reducing task, and then the reducer output gives the desired
outcome.

Architecture
The Hadoop architecture is a file system package, MapReduce engine, and
HDFS (Hadoop Distributed File System) package.
The MapReduce motor may be either MapReduce / MR1 or YARN / MR2.
A Hadoop cluster comprises a single master and several slave nodes.
The master node includes Job Tracker, Task Tracker, NameNode, and
DataNode, while DataNode and TaskTracker are included as the slave node.

HDFS
It contains an architecture which is master / slave.
This architecture consists of a single NameNode that performs master role,
and multiple DataNodes perform a slave role.
The NameNode and DataNode are both capable of running on commodity
machines. HDFS is developed using the Java language.
So any machine which supports Java language can easily run the software
NameNode and DataNode.

DataNode
There are multiple DataNodes within the HDFS cluster.
Each DataNode is composed of several data blocks.
Those blocks of data are used to store data.
Upon instruction from the NameNode it performs block creation, deletion
and replication.
It is DataNode's responsibility to read and write requests from clients within
the file system.

NameNode

There is a single master server in the cluster of HDFS.


Since it is a single node, it may become the reason for the failure of one
point.
It simplifies the system 's Architecture.
It manages namespace of the file system by performing an operation such
as opening, renaming, and closing the files.

Task Tracker
For Job Tracker it works like a slave node.
It receives Job Tracker 's task and code, and applies that code to the file.
May also call this process a Mapper.

Job Tracker
Job Tracker 's role is to accept the client's MapReduce jobs, and process
the data using NameNode.
In response, Job Tracker is given metadata by NameNode.

MapReduce Layer
The MapReduce comes into being when Job Tracker submits the MapReduce
job to the client application. The Job Tracker sends the request to the
respective Task Trackers in response. TaskTracker sometimes fails, or time
out. That portion of the job is rescheduled in such a case.

Advantages
Cost Effective: Hadoop is open source and uses commodity hardware to store
data so compared to traditional relational database management system it is
really cost-effective.
Fast: In HDFS, the data is distributed and mapped across the cluster, which
helps to retrieve faster. Even the data processing tools are often on the same
servers, which reduces the processing time. It can process terabytes of data in
minutes, and bytes of Peta in hours.
Scalable: You can extend the hadoop cluster by simply adding nodes in the
cluster.

History of Hadoop
Let's focus on the history of Hadoop in the following steps: -
Doug Cutting and Mike Cafarella began working on a project entitled
Apache Nutch in 2002. It is a software project with open source web
crawler.
They 'd been dealing with big data while working on Apache Nutch.
They have to spend a lot of costs to store these data which becomes the
result of that project. This problem becomes one of the significant reasons
for Hadoop 's emergence.
In 2003 Google developed the GFS (Google File System) file system. It is
a proprietary distributed filesystem developed for efficient data access.
Google posted a white paper on Map Reduce in 2004. This technique
simplifies the processing of data on big clusters.
In 2005, a new file system named NDFS (Nutch Distributed File System)
was introduced by Doug Cutting and Mike Cafarella. Also includes Map
Reduction in this file system.
Doug Cutting departed Google in 2006 and joined Yahoo. Dough Cutting
is introducing a new Hadoop project based on the Nutch project, with a
file system known as HDFS (Hadoop Distributed File System). First
version of Hadoop 0.1.0 released this year.
After his son's toy elephant Doug Cutting named his project Hadoop.

In 2007, Yahoo operates two 1000-machine clusters.


In 2008, Hadoop became the fastest system to sort 1 terabyte of data
within 209 seconds on a cluster of 900 nodes.

Hadoop 2.2 was released in the year 2013.

Hadoop 3.0 was released in the year 2017.


HADOOP MODULES
Hadoop comes with a distributed system of files named HDFS.
In HDFS data are distributed across multiple machines and replicated to
ensure their durability to failure and high availability to parallel use.
It includes the notion of blocks, data nodes, and node name.

Not to use HDFS


Low Latency Data Access: Applications requiring much less time to
access the first data should not use HDFS, as it gives importance to
entire data rather than time to collect the first data.
Lots of Small Files: The name node contains the memory metadata of the
files and if the size of the files is small it takes a lot of memory for the
memory of the name node which is not feasible.
Multiple Writings: When we have to write multiple times it should not
be used.

To use HDFS
Very Large Files: Files should consist of hundreds of megabytes or more.
Streaming Data Access: When reading the first set, the time to read
whole data set is more important than latency.
HDFS is built on the pattern of writing-once and reading-many times.
Commodity Hardware: Works on hardware which is low cost.

HDFS Concepts
Blocks: A block is the minimum amount of data it can read or write.
HDFS blocks by default is 128 MB and this is configurable. Files n HDFS
are broken into block-sized chunks, which are stored as independent units.
Unlike a file system, if the file is in HDFS is smaller than block size, does
it not occupy full block Size, i.e. 5 MB of block size 128 MB HDFS file
only takes 5 MB of space.
The HDFS block size is large only to minimize search costs.
Data Node: When told to; by client or name node, they store and retrieve
blocks.
As a commodity hardware the data node also performs the work of block
creation, deletion and replication as stated by the name node.
Periodically they report back to name node, with a list of blocks they
store.

Name Node:Name Node is HDFS controller and manager as it knows the


status and metadata of all HDFS files; metadata information is file
permission, names and location of each block.
Metadata is small, so it is stored in the name node memory, allowing for
faster data access.
In addition, multiple clients access the HDFS cluster simultaneously, so
all this information is handled by a single machine.
It executes the file system operations such as opening, closing, renaming
etc.

It is very important because all the metadata is stored in name node.


The file system can not be used if it fails as there would be no way to know
how to reconstruct the files from blocks present in the data node.
To overcome this, the secondary name node concept comes up.
Secondary Name Node: This is a separate physical machine that acts as a
name node helper.
It performs periodic checkpoints.
It communicates with the name node and takes meta data snapshot that helps
to minimize downtime and data loss.
Starting HDFS
The HDFS should initially be formatted and then started in distributed mode.
Commands are given hereafter.
To Format $ hadoop namenode -format
To Start $ start-dfs.sh

Basic File Operations- HDFS


1. Putting data to HDFS from local file system

O First create an HDFS folder where data may be placed in the


format of a local file system.
$ hadoop fs -mkdir /user/test
Copy the file "data.txt" from a file kept in local folder
/usr/home/Desktop to HDFS folder /user/ test

$ hadoop fs -copyFromLocal /usr/home/Desktop/data.txt


/user/test
Display the content of HDFS folder

$ Hadoop fs -ls /user/test


2. Copying data from HDFS to local file system
$ hadoop fs -copyToLocal /user/test/data.txt
/usr/bin/data_copy.txt
3. Compare the files and see that both are same
$ md5 /usr/bin/data_copy.txt /usr/home/Desktop/data.txt

Recursive deleting
hadoop fs -rmr <arg>

Example:
hadoop fs -rmr /user/sonoo/

HDFS Other commands


The below is used in the commands
"<path>" means any file or directory name.
"<path>..." means one or more file or directory names.
"<file>" means any filename.
"<src>" and "<dest>" are path names in a directed operation.
"<localSrc>" and "<localDest>" are paths as above, but on the local file
system
put <localSrc><dest>

Copies the file or directory from the local file system identified by
localSrc to dest within the DFS.
copyFromLocal <localSrc><dest>

Identical to -put
copyFromLocal <localSrc><dest>

Identical to -put
moveFromLocal <localSrc><dest>

Copies the file or directory from the local file system identified by
localSrc to dest within HDFS, and then deletes the local copy on
success.
get [-crc] <src><localDest>

Copies the file or directory in HDFS identified by src to the local file
system path identified by localDest.
cat <filen-ame>
Displays the contents of filename on stdout.
moveToLocal <src><localDest>

Works like -get, but deletes the HDFS copy on success.


setrep [-R] [-w] rep <path>

Sets the target replication factor for files identified by path to rep. (The
actual replication factor will move toward the target over time)
touchz <path>

Creates a file at path containing the current time as a timestamp. Fails


if a file already exists at path, unless the file is already size 0.
test -[ezd] <path>

Returns 1 if path exists; has zero length; or is a directory or 0


otherwise.
stat [format] <path>

Prints information about path. Format is a string which accepts file size
in blocks (%b), filename (%n), block size (%o), replication (%r), and
modification date (%y, %Y).
Features and Goals Of HDFS
It can handle the application which contains large data sets with ease.
Unlike other distributed file systems, HDFS can be deployed on low-cost
hardware and is highly tolerant to faults.
Let's look at some of HDFS' significant features and goals.

HDFS Features
Replication-The node containing the data could be a loss due to some
unfavorable conditions. So HDFS always maintains the copy of data on a
different machine to overcome such problems.
Highly Scalable-HDFS is highly scalable because in a single cluster it can
scale hundreds of nodes.
Distributed data storage-This is one of HDFS's most important features ,
making Hadoop very powerful. Here, data is broken down into several blocks
and stored.
Fault tolerance-In HDFS, fault tolerance means system robustness in case of
failure. The HDFS is highly fault-tolerant that the other machine containing
the copy of that data will automatically become active if any machine fails.

HDFS Goals
Hardware failure handling-The HDFS includes multiple server machines.
Anyhow, if any machine fails, the aim of the HDFS is to quickly recover it.
Coherence Model-The application running on HDFS requires the write-once-
ready approach to be followed by many. So, you don't need to change a file
that was once created. It can be annexed and truncated though.
Access to streaming data-Usually the HDFS applications run on the file
system for general purposes. This application demands streaming access to its
data sets.
YARN
Yet Another Resource Manager takes programming to the next above anothor
programming and makes it interactive to allow another application to work
on it, such as Hbase, Spark etc.
Different yarn applications can co-exist on the same cluster, so that
MapReduce, Hbase, Spark can all run at the same time , bringing great
advantages for manageability and cluster use.

Components Of YARN
Client: To submit jobs to MapReduce.
Resource Manager: Manage resource utilization across a cluster
Map Reduce Master Application: Checks tasks that run the MapReduce job.
The application master and the tasks of MapReduce run in containers planned
by the resource manager and managed by the managers of the nodes.
Node Manager: For the start-up and monitoring of computer containers on
cluster.
Jobtracker & Tasktrackerwere used in the previous version of Hadoop, which
was responsible for resource handling and progress management checks.
Hadoop 2.0 however has Resource Manager and NodeManager to overcome
the Jobtracker & Tasktracker shortfall.

Benefits
Multitenancy: Different MapReduce version can run on YARN, making
MapReduce upgrade process more manageable.
Scalability: Map Reduce 1 hits bottleneck ascalability at 4000 nodes and
40000 tasks, but Yarn is designed for 10,000 nodes and 1 lakh tasks.
Usage: Node Manager manages a resource pool, rather than a fixed number
of designated slots, thus increasing the usage.
HADOOP MapReduce
A MapReduce is a data processing tool which is used in a distributed form to
process the data in parallel. Based on paper titled "MapReduce: Simplified
Data Processing on Large Clusters," it was developed in 2004, published by
Google.
The MapReduce is a two-phase paradigm, the phase mapper and the phase
reducer. The input is given in the Mapper in the form of a key-value pair. The
Mapper's output is fed in as input to the reducer. The reducer only runs after
the Mapper has ended. The reducer takes input in key-value format too, and
the reducer 's output is the end result.

Map Reduce Steps


The Hadoop architecture uses the output of Map to apply sort and shuffle.
This sort and shuffle acts on this list of pairs < key, value > and sends out
unique keys and a list of values associated with that unique key < key,
list(values) >.
Sort and shuffle output sent to phase reducer. The reducer executes a defined
function on a list of unique key values, and the final output < key, value > is
stored / displayed.
The map takes the data in pair form and returns a list of pairs with < key,
value >. In this case the keys aren't unique.

Sort and Shuffle


When the Mapper task is complete, the results will be sorted by key,
partitioned if multiple reducers are present, and then written to disk.
We collect all the values for each unique key k2 by using the input from each
Mapper < k2,v2 >.
This shuffle-phase output in the form of < k2, list(v2) > is sent to the reducer
phase as input.
The sort and shuffle occur on the Mapper output and in front of the reducer.
MapReduce Usage
It can be used in different applications, such as document clustering,
distributed sorting, and reversal of the web link-graph.
It can be used in multiple environments of computing such as multi-cluster,
multi-core, and mobile environment.
Google used this to regenerate Google's World Wide Web index.
It can be used for searching for distributed patterns.
In machine learning too we can use MapReduce.
MapReduce - Data Flow
MapReduce calculates the enormous amount of data.
The data must flow from different phases to handle the coming data in a
parallel and distributed form.

MapReduce data flow Phases


Input reader
The input reader reads the upcoming data and splits it into the relevant size
(64 MB to 128 MB) data blocks.
Each data block has a Map function associated with it.
Once the data is read in, the corresponding key-value pairs are generated.
The input files are in HDFS format.

Map function
The map function processes the next key-value pairs and generates the
corresponding key-value pairs for output.
The type of map input and output might differ from one another.

Partition function
The partition function assigns a corresponding reducer to the output of each
Map function.
That function is provided by the available key and value. It returns reducer
index.

Shuffling and Sorting


The data is shuffled between / within nodes so that it moves out of the map
and gets ready for reduction function processing.
Data shuffling can sometimes take a lot of computational time.
For Reduce function the sorting operation is performed on input data.
Here, the data is compared and arranged in a sorted form using the
comparison function.

Reduce function
Each unique key is assigned the function Reduce. Those keys are in sorted
order already.
The associated key values can iterate the Reduce and generate the
corresponding output.

Output writer
Output writer 's role is to write down the Reduce output into the stable
storage.
Once the data flows from all phases above, the Output writer executes.
MapReduce API
Here we learn about the classes and methods used to program MapReduce.
We are focussing on MapReduce APIs in this section.

MapReduce Mapper Class


In MapReduce, the role of the Mapper class is to map key-value pairs input to
a set of key-value pairs intermediately.
It transforms the records of the inputs into middle records.
These intermediate records were linked to a given output key and passed for
final output to Reducer.

Mapper Class Methods


void cleanup(context)-This method is called only once at the end of the task.
Void map(KEYIN key, VALUEIN value, context)-This method can only be
called once for each key-value in the split input.
Void run(Context context)-Control of Mapper execution can be overridden
by this method.
Void setup(Context context)-This method is called only once at task start.

MapReduce Reducer Class


The role of the reducer class in MapReduce is to reduce the set of
intermediate values.
Its implementations can use the JobContext.getConfiguration) (method to
access the Configuration.

Reducer Class Methods


Void cleanup(context)-This method is called only once at the end of the task.
Void map(KEYIN key, Iterable < VALUEIN > values, context)-This method
is called once for every key.
Void run(Context context)-This method may be used to control the Reducer
tasks.
Void setup(Context context)-This method is called only once at task start.

MapReduce Job Class


The Job class is used and submits to configure the work. The execution is
also controlled and the state queried.
Once the job is submitted, IllegalStateException is thrown into the set
method.

Methods of Job Class


Methods-Description
Counters getCounters()-This method is used for the job to obtain counters.
Long getFinishTime()-This method is used to get time for the job to
complete.
Job getInstance)-This (method is used without any cluster to generate a
new Job.
Job getInstance(Configuration conf)-This method is used to create a new
Job without a cluster and the configuration provided.
Job getInstance(Configuration conf, String jobName)-This method is used
to generate a new Job without a cluster, and configuration and job name
are provided.
String getJobFile()-This method is used to get the configuration path for
the job submitted.
String getJobName()-This method is used to get the work name specified
for the user.
JobPriority getPriority()-This method is applied to get the job's scheduling
function.
void setJarByClass(Class<?>c)-This method is used to set the jar by
providing a .class extension for the class name.
void setJobName(String name)-The user-specified job name is set using
this method.
Void setMapOutputKeyClass(Class<?>Class)-This method is used to set
the map output key class.
Void setMapOutputValueClass(Class<?>class)-This method is used to set
the map output value class for the data.
Void setMapperClass(Class <? extends Mapper > class)-This is the
method used to set the Mapper to the job.
Void setNumReduceTasks(int tasks)-This method is used to set the
number of tasks for the job to be reduced
Void setReducerClass(Class <? extends Reducer > class)-This method is
used to set the job reducer.
MapReduce - Word Count Example
In the example of MapReduce word count we find out the frequency of each
word.
Here, Mapper 's role is to map the keys to the existing values and Reducer 's
role is to aggregate the keys to common values.
So, all is represented in the key-value pair form.

To execute MapReduce word count example


Create a text file in ur local machine and write some text into it.
$ nano data.txt

Check the text written in the data.txt file.


$ cat data.txt
We find out in this example the frequency of each word that exists in this text
file.
Create a directory in HDFS where the text file is to be kept.
$ hdfs dfs -mkdir /test
Upload the data.txt file to a specific directory on HDFS.
$ hdfs dfs -put /home/codegyani/data.txt /test

Use eclipse to write MapReduce program.


MapReduce Char Count Example
In the example of char count MapReduce, we find out the frequency of each
character.
Here, Mapper 's role is to map the keys to the existing values and Reducer 's
role is to aggregate the keys to common values.
So, all is represented in the key-value pair form.

To execute MapReduce char count example


Create a text file in ur local machine and write some text .
$ nano info.txt
Check the text written in the file info.txt .
$ cat info.txt

In this example, u find out the frequency of each char value exists in this text
file.
Create a directory in HDFS, where to kept text file.
$ hdfs dfs -mkdir /count
In the specific directory Upload the info.txt file on HDFS.
$ hdfs dfs -put /home/codegyani/info.txt /count

Using eclipse Write the MapReduce program.


HBase
HBase chapter provides basic as well as advanced HBase concepts. Our book
HBase is intended for beginners.
Hbase is an open source framework which Apache provides.
It is a data from a sorted map built on Hadoop.
It is oriented towards the column, and horizontally scalable.
Our HBase chapter covers all Apache HBase topics with the HBase Data
model, HBase Read, HBase Write, HBase MemStore, HBase Installation,
RDBMS vs HBase, HBase Commands, HBase Example etc.
You have to have the knowledge of Hadoop and Java before learning HBase.
Our Chapter on HBase is designed to assist beginners.
We assure you won't find any issues in this tutorial on HBase.
HBase Intro
Built on Hadoop, Hbase is an open source and sorted map data. It is oriented
towards the column, and horizontally scalable.
Hbase provides APIs that allow for development in virtually any
programming language.
It is based on the Big Table of Google. It has a set of tables that keep data in
key value format.
It is a part of the Hadoop ecosystem that provides random read / write access
to the data in the Hadoop File System in real time.
Hbase is suitable for sparse data sets which are very common in cases of big
data use.

HBase Use
RDBMS slows exponentially as the data gets big
Expects highly structured data , i.e. ability to fit within a well-defined scheme
Any schema change could require a downtime
Too much overhead of keeping NULL values for sparse datasets

Characteristic of Hbase
Automatic failover: Automatic failover is a resource that allows a system
administrator to switch data handling automatically to a standby system in
case of system compromise
Horizontally scalable: Columns can be added at any time.
Often referred to as a family-oriented database with a key value store or
column, or as storing versioned maps.
Sparse, distributed, persistent and multidimensional sorted map, indexed
by rowkey, column key, and timestamp.
Map / Reduce framework integrations: Map / Reduce is implemented
internally to the commands and java codes to do the task and is built over
the Hadoop Distributed File System.
It does not enforce relationships inside your information.
Basically, it's a random-access data storage and retrieval platform.
It doesn't care about datatypes (storing an integer for the same column in
one row and a string in another).
It's designed to run on a computer cluster, built with commodity hardware.
HBase Read
A HBase read must be reconciled between the HFiles, MemStore &
BLOCKCACHE. The BlockCache is designed to keep frequently accessed
data from the HFiles in memory so as to avoid disk reads. Each column
family has its own BlockCache. BlockCache contains data in the form of a
'block' as a unit of data that HBase reads from the disk in a single pass. This
means reading a block from HBase requires only looking up the location of
that block in the index and having it retrieved from the disk.
Block: It is the smallest indexed data unit and is the smallest data unit
readable from the disk. Size 64 KB by default.
Scenario, if you prefer smaller block size: to perform random searches.
Having smaller blocks will create a bigger index and thus consume more
memory.
Larger block size is preferred, scenario: frequently perform sequential scans.
This enables you to save on memory, since larger blocks mean fewer index
entries and therefore a smaller index.
Reading an HBase row requires checking the MemStore first, then the
BlockCache, and finally accessing HFiles on the disk.
HBase Write
By default, when a write is made it goes into two places:
write-ahead log (WAL), HLog, and
in-memory write buffer, MemStore.

HBase MemStore
During writings, clients do not interact directly with the underlying HFiles,
but rather write goes in parallel to WAL & MemStore. Any write to HBase
requires both WAL and MemStore confirmation.
The MemStore is a write buffer where HBase collects data before
a permanent write.
Its content is flushed to form an HFile disk when the MemStore is
filled in.
It doesn't write to an existing HFile but forms a new file on each
flush instead.
The HFile is the HBase storage format underlying this.
HFiles belong to family of columns (one MemStore per family of
columns). A family of columns can have multiple HFiles but the
opposite is not true.

The MemStore size is set to hbase-site.xml, called


hbase.hregion.memstore.flush.size.
Every HBase cluster server keeps a WAL in place to record changes as they
occur. The WAL is a file on the underlying file system. Until the new WAL
entry is successfully written, a write is not considered successful, this ensures
durability.
If HBase goes down, the data that has not yet been flushed from the
MemStore to the HFile can be recovered by replaying the WAL, which
Hbase framework takes care of.
RDBMS vs HBase
Below there are differences between RDBMS and HBase.
The RDBMS schema / database can be compared with Hbase namespace.
You can compare a table in RDBMS to a column family in Hbase.
You can compare a record (after table joins) in RDBMS to a record in
Hbase.
A collection of RDBMS tables can be compared with Hbase table
HBase Example
By creating it through Java API, we need to import data present in the file
into an HBase tab.
Data_file.txt contains the following data

The Java code can be seen below


These data must be entered into a new HBase table that is to be created
through the JAVA API.
"sample,region,time.product,sale,profit".
Column family Time has two column qualifiers: year, month
Column family region has three column qualifiers: country, state, city

Jar Files
Make sure that the following jars are present while writing the code as they
are required by the HBase.
a. commons-loging-1.0.4
b. commons-loging-api-1.0.4
c. hadoop-core-0.20.2-cdh3u2
d. hbase-0.90.4-cdh3u2
e. log4j-1.2.15
f. zookeper-3.3.3-cdh3u0

Program Code
Hive Intro
Apache Hive is a Hadoop data warehouse system running SQL like queries
called HQL (Hive query language) which is converted internally to map
reduce jobs. Hive was developed through Facebook.
Hive is a data warehouse system that uses structured data to be analysed. It is
Facebook that created it.
Hive provides the read, write, and handle massive datasets residing in
distributed storage. It runs SQL like HQL (Hive query language) queries
which are converted internally to MapReduce jobs.
Using Hive, we can skip the traditional approach requirement for writing
complex MapReduce programmes. Hive supports Data Definition Language
(DDL), User Defined Functions (UDF), Data Manipulation Language (
DML).

Features
Hive is fast and can be scaled.
It provides queries similar to SQL (i.e., HQL) which are implicitly
transformed into MapReduce or Spark jobs.
It can analyze large data sets which are stored in HDFS.
It allows various types of data, including plain text, RCFile, and HBase.
Uses indexing to speed up queries.
It can operate on compressed data that is stored in the Hadoop ecosystem.
It supports user-defined functions (UDFs), where users can provide their
features.

Limitations
Hive is not able to handle the data in real time.
It isn't intended for processing transactions online.
Hive queries have high latency.
Hive Vs Pig
Data Analysts usually use hive.
Pig's generally used by programmers.
Hive follows queries similar to SQL's.
Pig follows the language of data-flows.
Hive is able to manage the structured data.
Pig can handle data which are semi-structured.
Hive works on HDFS cluster server-side.
Pig works on HDFS cluster client-side.
Hive is lighter than Pig's.
Pig is faster than Hive, comparatively.
Hive Architecture

Hive Client
Hive lets you write applications in different languages, including Java ,
Python, and C++. It Supports
Thrift Server-This is a cross-language service provider portal that serves
Thrift from all those programming languages that represent the application.
JDBC Driver-It is used to connect hive and Java applications. The Driver
JDBC is present in the org.apache.hadoop.hive.jdbc. HiveDriver.
ODBC Driver-Allows applications to link to Hive which support the ODBC
protocol.

Hive Services
Hive CLI-The Hive CLI (Command Line Interface) is a shell where Hive
queries and commands are executed.
Hive Web User Interface — Hive Web UI is just an alternative to Hive CLI.
It offers a web-based GUI for Hive queries and commands to be executed.
Hive MetaStore-It is a central repository that stores all the information about
the structure of different tables and partitions within the warehouse. It also
includes column metadata and information about its type, the serializers and
deserializers that are used to read and write data and the corresponding HDFS
files where the data is stored.
Hive Server-is known as Apache Thrift Server. It accepts the request from
various clients and delivers it to Hive Driver.
Hive Driver-Receives requests from different sources such as a web UI, CLI,
Thrift, and JDBC / ODBC driver. The queries are transferred to the compiler.
Hive Compiler-The compiler 's purpose is to parse the query and perform
semantic analysis on the various query blocks and expressions. It transforms
statements in HiveQL to jobs in MapReduce.
Hive Execution Engine-Optimizer produces the logical plan of map-reduction
tasks and HDFS tasks in the form of DAG. Basically the execution engine
handles the incoming tasks in the order of their dependencies.
HIVE Data Types
Hive types of data are classified in numerical types, string types, misc types,
and complex types. Below is a list of the Hive data types.

Date/Time Types
TIMESTAMP
It supports traditional UNIX timestamp with optional precision in
nanoseconds.
As a numeric Integer type, it is interpreted in seconds as UNIX timestamp.
As Floating point numeric type, it is interpreted with decimal precision as
UNIX timestamp in seconds.
As a string, the format java.sql. Timestamp "YYYY-MM-DD HH: MM:
SS.ffffffff" follows (precision 9 decimal places)

DATES
The Date value is used in the form YYYY — MM — DD to specify a given
year, month and day. That did not provide the time of the day, though. The
Date type range ranges from 0000--01--01 to 9999--12--31.

String Models
STRING
The string is character sequence. It may include values within single quotes
(') or double quotes (").

To Varchar
The varchar is a type of variable length whose range lies between 1 and
65535, specifying that the maximum number of characters permitted in the
string.
CHAR
The char is a type with a fixed length whose maximum length is set at 255.

Complex Type
Type Size Range

Struct It is similar to C struct or an struct('James','Roy')


object where the "dot" notation is
used to access the fields.

Map It contains the key-value tuples map('first','James','last','Roy')


where array notation is used to
access the fields.

Array It is a collection of similar type of array('James','Roy')


values that indexable using zero-
based integers.
Hive - Create Database
In Hive the database is regarded as a catalog or namespace of tables. So, we
can keep multiple tables within a database where each table is assigned a
unique name. Also Hive provides a default database with a default name.
Initially we'll check Hive's default database. So, follow the command below
to check the list of existing databases:-
hive> show databases;

Here we can see the existence of a Hive providing default database.


Let's create a new database using the command to:-
hive> create database demo;

So, it creates a new database.


Let's check what a newly created database
hive> show databases;

If need to suppress the warning generated by Hive about creating the


database with the same name , follow the command below:-
hive> create a database if not exists demo;
Hive also allows assignment of key-value pair properties with the database.
hive>create the database demo
>WITH DBPROPERTIES ('creator' = 'Gaurav Chawla', 'date' = '2019-
06-03');
Let's get back the data related to the database.
hive> describe database extended demo;
Hive - Drop Database
Let's check the list of existing databases with the command:-
hive> show databases;

Now, drop the database using the command below.


hive> drop database demo;

Let's check if it drops the database or not.


hive> show databases;

In Hive, the database that contains the tables is not allowed to drop directly.
In such a case, either dropping tables first, or using Cascade keyword with
command.
Let's see the command Cascade used to drop the database:
hive> drop database if exists demo cascade;
First this command drops the tables present in the database automatically.
HiveQL - Operators
The operators of the HiveQL facilitate various arithmetic and relation
operations. Here, on the records of the table below, we will execute such type
of operations:

Example of Operators
Let's create a table and load the data with the following steps to it: -
Choose the database we wish to create a table in.
hive> use hql;
Use the following command to create a hive table;
hive> create table employee (Id int, Name string , Salary float)
row format delimited
fields terminated by ',' ;
load the data into the table.
hive> load data local inpath '/home/codegyani/hive/emp_data' into table
employee;
Fetch the loaded data by using the following command: -
hive> select * from employee;
Arithmetic Operators
The Arithmetic Operator accepts any numeric type in Hive. The Arithmetic
Operators commonly used are:-

Operators Description

A+B This is used to add A and B.

A-B This is used to subtract B from A.

A*B This is used to multiply A and B.

A/B This is used to divide A and B and returns the quotient of


the operands.

A%B This returns the remainder of A / B.

A|B This is used to determine the bitwise OR of A and B.

A&B This is used to determine the bitwise AND of A and B.

A^B This is used to determine the bitwise XOR of A and B.

~A This is used to determine the bitwise NOT of A.

Relational Operators
In Hive, the relational operators are generally used to compare existing
records with clauses like Join and Having. The commonly used relational
operators are: -
Operator Description

A=B It returns true if A equals B, otherwise false.

A <> B, A It returns null if A or B is null; true if A is not equal to B,


!=B otherwise false.

A<B It returns null if A or B is null; true if A is less than B,


otherwise false.

A>B It returns null if A or B is null; true if A is greater than B,


otherwise false.

A<=B It returns null if A or B is null; true if A is less than or equal


to B, otherwise false.

A>=B It returns null if A or B is null; true if A is greater than or


equal to B, otherwise false.

A IS NULL It returns true if A evaluates to null, otherwise false.

A IS NOT It returns false if A evaluates to null, otherwise true.


NULL
HiveQL - Functions
The Hive provides various functions built in to carry out mathematical and
aggregate type operations.

Mathematical Functions
Return Functions Description
type

BIGINT round(num) It returns the BIGINT for the rounded


value of DOUBLE num.

BIGINT floor(num) It returns the largest BIGINT that is


less than or equal to num.

BIGINT ceil(num), It returns the smallest BIGINT that is


ceiling(DOUBLE greater than or equal to num.
num)

DOUBLE exp(num) It returns exponential of num.

DOUBLE ln(num) It returns the natural logarithm of num.

DOUBLE log10(num) It returns the base-10 logarithm of num.

DOUBLE sqrt(num) It returns the square root of num.

DOUBLE abs(num) It returns the absolute value of num.

DOUBLE sin(d) It returns the sin of num, in radians.

DOUBLE asin(d) It returns the arcsin of num, in radians.


DOUBLE cos(d) It returns the cosine of num, in radians.

DOUBLE acos(d) It returns the arccosine of num, in


radians.

DOUBLE tan(d) It returns the tangent of num, in


radians.

DOUBLE atan(d) It returns the arctangent of num, in


radians.

Aggregate Functions
In Hive, the aggregate function returns a single value over many rows as a
result of computation. Let's see some commonly used functions for
aggregates:-

Return Operator Description


Type

BIGINT count(*) It returns the count of the number of rows


present in the file.

DOUBLE sum(col) It returns the sum of values.

DOUBLE sum(DISTINCT It returns the sum of distinct values.


col)

DOUBLE avg(col) It returns the average of values.

DOUBLE avg(DISTINCT It returns the average of distinct values.


col)
DOUBLE min(col) It compares the values and returns the
minimum one form it.

DOUBLE max(col) It compares the values and returns the


maximum one form it.

Other built-in Functions


Return Operator Description
Type

INT length(str) It returns the length of the string.

STRING reverse(str) It returns the string in reverse order.

STRING concat(str1, It returns the concatenation of two or more


str2, ...) strings.

STRING substr(str, It returns the substring from the string based


start_index) on the provided starting index.

STRING substr(str, int It returns the substring from the string based
start, int on the provided starting index and length.
length)

STRING upper(str) It returns the string in uppercase.

STRING lower(str) It returns the string in lowercase.

STRING trim(str) It returns the string by removing whitespaces


from both the ends.

STRING ltrim(str) It returns the string by removing whitespaces


from left-hand side.

TRING rtrim(str) It returns the string by removing whitespaces


from right-hand side.

You might also like