Python and Hadoop Basics - Programmin...
Python and Hadoop Basics - Programmin...
BASICS
Identifier Naming
Variables - example of identifiers .To identify the literals used in the program
an Identifier is used. The rules for naming an identifier are set out below.
The variable must have an alphabet or underscore (_) as its first character.
All characters except the first character can be lower-case(a-z), upper-case
(A-Z), underscore, or digit (0-9) alphabets.
The name of the identifier shall not contain any white space or any special
character!, (@,#,%,^, &, *).
The name of the identifier must not be similar to any keyword set in the
language.
Identifier names are case sensitive; my name, for example, and MyName is
not identical.
Declaring Variable and Assigning Values
Python does not bind us to declare a variable before we use it in the app. It
enables us to create a variable in the time required.
In Python we do not need to declare explicitly variable. That variable is
declared automatically when we assign any value to the variable.
The operator equal (=) is used to assign value to a variable.
Object References
When we declare a variable it's necessary to understand how the Python
interpreter works. The process of treating variables differs slightly from many
other programming languages.
Python is a highly object-oriented programming language; therefore, each
data item belongs to a specific class type. Consider example below.
print("John")
OUTPUT
John
The object in Python creates an integer and displays it to the console. We
created a string object in the preceding print statement. Let's check its type
using the built-in type() function in Python.
type("John")
OUTPUT
<class 'str'>
Object Identity
In Python, each object created uniquely identifies with Python. Python
ensures no two objects will have the same identifier. The Id() function built-
in is used to identify the identifier of the object. Consider example below.
a = 50
b=a
print(id(a))
print(id(b))
# Reassigned variable a
a = 500
print(id(a))
Output:
140734982691168
140734982691168
2822056960944
We assigned both b = a, a and b points to the same object. When we checked
by the Id() function the same number was returned. We reassign a to 500;
then the new object identifier was referenced.
Variable Names
We've already discussed how to declare the valid variable. Variable names
may be uppercase, lowercase (A to Z, a to z), digit (0-9), and underscore ( ).
Consider the following example of names for the valid variables.
name = "Devansh"
age = 20
marks = 80.50
print(name)
print(age)
print(marks)
OUTPUT
Devansh
20
80.5
Consider the following example
name = "A"
Name = "B"
naMe = "C"
NAME = "D"
n_a_m_e = "E"
_name = "F"
name_ = "G"
_name_ = "H"
na56me = "I"
Multiple Assignment
In a single statement, which is also called multiple assignments, Python
allows us to assign a value to multiple assignments.
Multiple assignments can be applied in two ways, either by assigning one
value to multiple variables or by assigning multiple values to multiple
variables. Consider example below.
Assigning single value to multiple variables
x=y=z=50
print(x)
print(y)
print(z)
Output
50
50
50
Assigning multiple values to multiple variables:
a,b,c=5,10,15
print a
print b
print c
OUTPUT
5
10
15
Python Data Types
Variables can hold values, with each value having a data-type. Python is a
dynamically typed language; therefore when declaring it, we don't need to
define the type of variable. Implicitly the interpreter binds the value to their
type.
a=5
The variable a holds the value of integer five, and we have not defined its
type. Python interpreter will interpret variables as a type of integer
automatically.
Python allows us to check what type of variable the program uses. Python
gives us the type() function, which returns the type of the passed variable.
Consider the example below to define the values of various types of data and
to check their type.
a=10
b="Hi Python"
c = 10.5
print(type(a))
print(type(b))
print(type(c))
OUTPUT
<type 'int'>
<type 'str'>
<type 'float'>
Standard data types
A variable can hold a variety of values. For instance, the name of a person
has to be stored as a string while its Id has to be stored as an integer.
Python provides different standard data types on each of them which define
the storage method. Below you will find the data types defined in Python.
1. Numbers
2. Sequence Type
3. Boolean
4. Set
5. Dictionary
Numbers
Number numbers store numeric values. The integer, float, and complex
values belong to a data-type of Python Numbers. Python gives the type)
(function for knowing the variable's data type. Similarly the function
isinstance() is used to check that an object belongs to a certain class.
Python creates Number objects when a variable has a number assigned. For
instance;
a=5
print("The type of a", type(a))
b = 40.5
print("The type of b", type(b))
c = 1+3j
print("The type of c", type(c))
print(" c is a complex number", isinstance(1+3j,complex))
OUTPUT
The type of a <class 'int'>
The type of b <class 'float'>
The type of c <class 'complex'>
c is complex number: True
Sequence Type
String
The sequence of characters represented in quotation marks can be defined as
the string. In Python, we can define a string by using single, double, or triple
quotes.
Python's string handling is a straightforward task, since Python provides
built-in functions and operators to perform string operations.
In the case of string handling, the operator + is used to combine two strings,
as the "hello"+" python operation returns "hello python."
The example below illustrates Python string.
EXAMPLE 1
str = "string using double quotes"
print(str)
s = '''''A multiline
string'''
print(s)
OUTPUT
string using double quotes
A multiline
String
EXAMPLE 2
str1 = 'hello learnpython' #string str1
str2 = ' how are you' #string str2
print (str1[0:2]) #printing first two character using slice operator
print (str1[4]) #printing 4th character of the string
print (str1*2) #printing the string twice
print (str1 + str2) #printing the concatenation of str1 and str2
OUTPUT
he
o
hello learnpythonhello learnpython
hello learnpython how are you
List
Python lists are similar to C arrays. However, the list may contain data of
various types. The items stored in the list are separated by a comma(,)(and
inside square brackets[].
To access list data, we may use slice [:] operators. The concatenation operator
(+) and the repeat operator (*) works with the list in the same way as the
strings worked.
EXAMPLE
list1 = [1, "hi", "Python", 2]
#Checking type of given list
print(type(list1))
# List slicing
print (list1[3:])
# List slicing
print (list1[0:2])
Tuple
In many ways a tuple is similar to a list. Like lists, tuples also include the
collection of items of various data types. Tuple items are separated by a
comma(,) and bound in parentheses().
A tuple is a read-only data structure, because we can not change the size and
value of the tuple items.
EXAMPLE
tup = ("hi", "Python", 2)
# Checking type of tup
print (type(tup))
# Tuple slicing
print (tup[1:])
print (tup[0:1])
Dictionary
Dictionary is an unordered set of a pair of items with key-value. It is like an
associative array or a hash table where a specific value is stored on each key.
Key can hold any type of primitive data whereas value is an arbitrary Python
object.
The dictionary items are separated by the comma(,) and enclosed within the
curly braces{}.
EXAMPLE
d = {1:'Jimmy', 2:'Alex', 3:'john', 4:'mike'}
# Printing dictionary
print (d)
print (d.keys())
print (d.values())
OUTPUT
1st name is Jimmy
2nd name is mike
{1: 'Jimmy', 2: 'Alex', 3: 'john', 4: 'mike'}
dict_keys([1, 2, 3, 4])
dict_values(['Jimmy', 'Alex', 'john', 'mike'])
Boolean
Boolean type provides two incorporated values, True and False. These values
are used to determine true or false for the given statement. The class bool
denotes that. True can be represented by any non-zero value or 'T' whereas
the 0 or 'F' may represent false.
EXAMPLE
# Python program to check the boolean type
print(type(True))
print(type(False))
print(false)
OUTPUT
<class 'bool'>
<class 'bool'>
NameError: name 'false' is not defined
SET
Python Set is the unordered data-type collection. It is iterable, mutable (can
change after it has been created), and has unique elements. In set, the order of
the elements is undefined; the changed sequence of the element may be
returned. The set is created using a built-in function set(), or a sequence of
elements passed through the curly braces and separated by the comma. It can
contain different values.
EXAMPLE
# Creating Empty set
set1 = set()
set2.add(10)
print(set2)
Assert
This keyword is used in Python as debugging tool. The code is checked for
correctness. It raises an AssertionError if any error has been found in the
code, and it also prints an error message.
EXAMPLE
a = 10
b=0
print('a is dividing by Zero')
assert b != 0
print(a / b)
OUTPUT
a is dividing by Zero
Runtime Exception:
Traceback (most recent call last):
File "/home/40545678b342ce3b70beb1224bed345f.py", line 4, in
assert b != 0, "Divide by 0 error"
AssertionError: Divide by 0 error
def
This keyword is used in Python to declare a function.
def my_func(a,b):
c = a+b
print(c)
my_func(10,20)
OUTPUT
30
Class
In Python, it is used to represent the class. The class is Objects' blueprint. It is
the variable collection and the methods. Consider the Class below.
class Myclass:
#Variables……..
def function_name(self):
#statements………
continue
It is used to stop actual iteration execution. Consider example below.
a=0
while a < 4:
a += 1
if a == 2:
continue
print(a)
OUTPUT
1
3
4
Break
It is used to terminate execution of the loop and transfer control to the end of
the loop. Consider example below.
for i in range(5):
if(i==3):
break
print(i)
print("End of execution")
OUTPUT
0
1
2
End of execution
Elif
Uses this keyword to check multiple conditions. If the previous condition is
false then check until you find the true condition.
marks = int(input("Enter the marks:"))
if(marks>=90):
print("Excellent")
elif(marks<90 and marks>=75):
print("Very Good")
elif(marks<75 and marks>=60):
print("Good")
else:
print("Average")
Python Literals
Python Literals can be defined as data given in a constant or a variable.
Python supports the following literals:
1. String literals:
You may form string literals by enclosing a text in the quotes. To create a
string, we may use both single and double quotes.
Example:
"Aman" , '12345'
Types of Strings:
Python supports two types of Strings:
a.Single line strings- Single line strings ended in a single line are known
as Single line strings.
Example:
text1='hello'
B) Multi-line Line-A piece of text written in several lines is called multi-line
series.
Multiline strings can be formed by two ways:
1) Adding black slash at the end of each line.
Example:
text1='hello\
user'
print(text1)
2) Using triple quotation marks:-
Example:
str2='''''welcome
to
SSSIT'''
print str2
OUTPUT
welcome
to
SSSIT
2. Numeric literals:
Numeric Literals are unchangeable
Example - Numeric Literals
x = 0b10100 #Binary Literals
y = 100 #Decimal Literal
z = 0o215 #Octal Literal
u = 0x12d #Hexadecimal Literal
#Float Literal
float_1 = 100.5
float_2 = 1.5e2
#Complex Literal
a = 5+3.14j
print(x, y, z, u)
print(float_1, float_2)
print(a, a.imag, a.real)
OUTPUT
20 100 141 301
100.5 150.0
(5+3.14j) 3.14 5.0
3. Boolean literals:
A Boolean literal can have one of these two values: True or False.
Example - Boolean Literals
x = (1 == True)
y = (2 == False)
z = (3 == True)
a = True + 10
b = False + 10
print("x is", x)
print("y is", y)
print("z is", z)
print("a:", a)
print("b:", b)
OUTPUT
x is True
y is False
z is False
a: 11
b: 10
4. Special literals
Python has one special literal, that is None.
None is used to indicate the field is not generated within the field. This is also
used in Python for finishing lists.
Example - Special Literals
val1=10
val2=None
print(val1)
print(val2)
OUTPUT
10
None
5. Literal Collections
Python provides the literal collection of four types, such as List literals, Tuple
literals, Dict literals, and Set literals.
LIST
The List includes items of different types of data. Lists are also mutable, i.e.
modifiable.
Example - List literals
list=['John',678,20.4,'Peter']
list1=[456,'Andrew']
print(list)
print(list + list1)
output
['John', 678, 20.4, 'Peter']
['John', 678, 20.4, 'Peter', 456, 'Andrew']
Dictionary:
In key-value pair, Python dictionary stores the data.
It's enclosed by curly-braces{} and the commas(,) separates each pair.
Example
dict = {'name': 'Pater', 'Age':18,'Roll_nu':101}
print(dict)
Output:
Arithmetic Operators
Operator Description
+ (Addition) It is used to add two operands. For example, if a = 20, b = 10 => a+b = 30
- (Subtraction) It is used to subtract the second operand from the first operand. If the first
operand is less than the second operand, the value results negative. For
example, if a = 20, b = 10 => a - b = 10
/ (divide) It returns the quotient after dividing the first operand by the second operand.
For example, if a = 20, b = 10 => a/b = 2.0
* It is used to multiply one operand with the other. For example, if a = 20, b =
(Multiplication) 10 => a * b = 200
% (reminder) It returns the reminder after dividing the first operand by the second operand.
For example, if a = 20, b = 10 => a%b = 0
// (Floor It gives the floor value of the quotient produced by dividing the two
division) operands.
Comparison operator
Comparison operators are used to evaluate the value of the two operands, and
returns accordingly Boolean true or false. The comparison operators are
listed in the table below.
Operator Description
== If the value of two operands is equal, then the condition becomes true.
!= If the value of two operands is not equal, then the condition becomes true.
<= If the first operand is less than or equal to the second operand, then the
condition becomes true.
>= If the first operand is greater than or equal to the second operand, then the
condition becomes true.
> If the first operand is greater than the second operand, then the condition
becomes true.
< If the first operand is less than the second operand, then the condition
becomes true.
Assignment Operators
The assignment operators are used to add left operand to the value of the
right expression. The operators for assignments are listed in the table below.
Operator Description
+= It increases the value of the left operand by the value of the right operand
and assigns the modified value back to left operand. For example, if a = 10,
b = 20 => a+ = b will be equal to a = a+ b and therefore, a = 30.
-= It decreases the value of the left operand by the value of the right operand
and assigns the modified value back to left operand. For example, if a = 20,
b = 10 => a- = b will be equal to a = a- b and therefore, a = 10.
*= It multiplies the value of the left operand by the value of the right operand
and assigns the modified value back to then the left operand. For example, if
a = 10, b = 20 => a* = b will be equal to a = a* b and therefore, a = 200.
%= It divides the value of the left operand by the value of the right operand and
assigns the reminder back to the left operand. For example, if a = 20, b = 10
=> a % = b will be equal to a = a % b and therefore, a = 0.
**= a**=b will be equal to a=a**b, for example, if a = 4, b =2, a**=b will assign
4**2 = 16 to a.
//= A//=b will be equal to a = a// b, for example, if a = 4, b = 3, a//=b will assign
4//3 = 1 to a.
Bitwise Operators
The bitwise operators work bit by bit on the values of the two operands. Find
scenario below.
if a = 7
b=6
then, binary (a) = 0111
binary (b) = 0011
Operator Description
& (binary and) If both the bits at the same place in two operands are 1, then 1 is copied to
the result. Otherwise, 0 is copied.
| (binary or) The resulting bit will be 0 if both the bits are zero; otherwise, the resulting
bit will be 1.
^ (binary xor) The resulting bit will be 1 if both the bits are different; otherwise, the
resulting bit will be 0.
~ (negation) It calculates the negation of each bit of the operand, i.e., if the bit is 0, the
resulting bit will be 1 and vice versa.
<< (left shift) The left operand value is moved left by the number of bits present in the
right operand.
>> (right shift) The left operand is moved right by the number of bits present in the right
operand.
Logical Operators
The logical operators are used primarily for making a decision in the
expression evaluation. Python supports the logical operators which follow.
Operator Description
and If both the expression are true, then the condition will be true. If a and b are
the two expressions, a → true, b → true => a and b → true.
or If one of the expressions is true, then the condition will be true. If a and b are
the two expressions, a → true, b → false => a or b → true.
not If an expression a is true, then not (a) will be false and vice versa.
Membership Operators
Within a Python data structure, Python membership operators are used to
verify the value membership. If the value in the data structure is present
otherwise the resulting value is true else it will be false.
Operator Description
not in It is evaluated to be true if the first operand is not found in the second
operand (list, tuple, or dictionary).
Identity Operators
Operator Description
is not It is evaluated to be true if the reference present at both sides do not point to
the same object.
Operator Precedence
The operators' precedemce is important to figure out as it helps one to learn
which operator will first be measured. The operators' precedence table in
Python is shown below.
Operator Description
** The exponent operator is given priority over all the others used in the
expression.
<= < > >= Comparison operators (less than, less than equal to, greater than, greater then
equal to).
Statement Description
If Statement The if statement is used to test a specific condition. If the condition is true, a
block of code (if-block) will be executed.
If - else The if-else statement is identical to if statement except that, it also includes
Statement the code block for the condition to be verified in the false case. If the
condition provided in the if statement is false, then the other statement is
executed.
Indentation in Python
Python does not require the use of parentheses for the block level code for the
convenience of programming and to achieve simplicity. In Python, they use
indentation to declare a line. If two statements are at the same level of
indentation then they are the same block part.
In general , four spaces are provided for indenting statements which are a
standard amount of python indentation.
Indentation is the most commonly used aspect of the python language, since
it defines the code block. All one block statements are meant to be at the
same degree of indentation. We'll see how the actual indentation occurs in
python decision making and other stuff.
The if statement
The if statement is used to test a particular condition and executes a block of
code when the condition is true.
SYNTAX
if expression:
statement
EXAMPLE 1
num = int(input("enter the number?"))
if num%2 == 0:
print("Number is even")
OUTPUT
enter the number?10
Number is even
Example 2 : Program to print the largest of the three numbers.
a = int(input("Enter a? "));
b = int(input("Enter b? "));
c = int(input("Enter c? "));
if a>b and a>c:
print("a is largest");
if b>a and b>c:
print("b is largest");
if c>a and c>b:
print("c is largest");
OUTPUT
Enter a? 100
Enter b? 120
Enter c? 130
c is largest
The if-else statement
The if-else statement provides another block in conjunction with the if
statement, which is executed in the condition 's false case.
If the condition is true then perform the if-block. The else-block is executed,
otherwise.
SYNTAX
if condition:
#block of statements
else:
#another block of statements (else-block)
Example 1 : Program to check whether a person is eligible to
vote or not.
age = int (input("Enter your age? "))
if age>=18:
print("You are eligible to vote !!");
else:
print("Sorry! you have to wait !!");
OUTPUT
Enter your age? 90
You are eligible to vote !!
Example 2: Program to check whether a number is even or not.
num = int(input("enter the number?"))
if num%2 == 0:
print("Number is even...")
else:
print("Number is odd...")
OUTPUT
enter the number?10
Number is even
elif expression 2:
# block of statements
elif expression 3:
# block of statements
else:
# block of statements
EXAMPLE
number = int(input("Enter the number?"))
if number==10:
print("number is equals to 10")
elif number==50:
print("number is equal to 50");
elif number==100:
print("number is equal to 100");
else:
print("number is not equal to 10, 50 or 100");
OUTPUT
Enter the number?15
number is not equal to 10, 50 or 100
EXAMPLE 2
marks = int(input("Enter the marks? "))
f marks > 85 and marks <= 100:
print("Congrats ! you scored grade A ...")
lif marks > 60 and marks <= 85:
print("You scored grade B + ...")
lif marks > 40 and marks <= 60:
print("You scored grade B ...")
lif (marks > 30 and marks <= 40):
print("You scored grade C ...")
lse:
print("Sorry you are fail ?")
Python Loops
By example, the flow of programs written into every programming language
is sequential. Sometimes, we might need to change the program's flow. You
can need to repeat many numbers of times to execute a particular code.
The programming languages provide various types of loops for this purpose
which are capable of repeating several numbers of times a certain specific
code. Consider the diagram below to understand what a loop statement
works.
Loop Description
Statement
for loop The for loop is used in case we need to execute some part of the code until
the condition is fulfilled. The for loop is also known as a per-tested loop. If
the amount of iteration is specified in advance it is easier to use it for loop.
while loop The while loop will be included in the situation where we don't know the
number of iterations beforehand. In the while loop the block of statements
is executed until the condition stated in the while loop is satisfied. It is also
referred to as a pretested loop.
do-while loop The do-while loop continues until a given condition satisfies. It is also
called post tested loop. It is used when it is necessary to execute the loop at
least once (mostly menu driven programs).
Python for loop
In Python the for loop is used many times to iterate the statements or a
portion of the program. This is also used to navigate data structures such as
the list, tuple, or dictionary.
The syntax in python for loop is provided below.
for iterating_var in sequence:
statement(s)
1
22
333
4444
55555
0
1
2
3
4
for loop completely exhausted, since there is no break.
Example 2
for i range( 0 , 5 ):
in
print (i)
break ;
else : print ( "for loop is exhausted" );
print ( "The loop is broken due to break statement...came out of the loop" )
Output:
0
Python While loop
The Python while loop enables execution of a part of the code until the given
condition returns false. It is also referred to as a pretested loop.
If we don't know the number of iterations then using the while loop is the
most efficient.
The following syntax is given
while expression:
statements
10 X 2 = 20
10 X 3 = 30
10 X 4 = 40
10 X 5 = 50
10 X 6 = 60
10 X 7 = 70
10 X 8 = 80
10 X 9 = 90
10 X 10 = 100
count += 1
Output:
item matched
found at 2 location
EXAMPLE 2
str = "python"
for i in str:
if i == 'o':
break
print(i);
Output:
p
y
t
h
Example 3: break statement with while loop
i= 0;
while 1 :
print (i, " " ,end=""),
i=i+ 1 ;
if i == 10 :
break ;
print ( "came out of while loop" );
Output:
0 1 2 3 4 5 6 7 8 9 came out of while loop
Example 4
n= 2
while 1 :
i= 1 ;
while i<= 10 :
print ( "%d X %d = %d\n" %(n,i,n*i));
i = i+ 1 ;
choice = int(input( "Do you want to continue printing the table, press 0 for no?" ))
if choice == 0 :
break ;
n=n+ 1
OUTPUT
2X1=2
2X2=4
2X3=6
2X4=8
2 X 5 = 10
2 X 6 = 12
2 X 7 = 14
2 X 8 = 16
2 X 9 = 18
2 X 10 = 20
Do you want to continue printing the table, press 0 for no?
Python continue Statement
The continue statement in Python is used to bring program control to the loop
start. The continue statement skips within the loop the remaining lines of
code and begins with the next iteration. It is mostly used within the loop for a
specific condition, so that we can bypass any particular code for a specific
condition.
SYNTAX
#loop statements
continue
#the code to be skipped
FLOW DIAGRAM
Example 1
i= 0
while (i < 10 ):
i = i+ 1
if (i == 5 ):
continue
print (i)
Output:
1
2
3
4
6
7
8
9
10
EXAMPLE
str = "LearnTPython"
for i in str:
if(i == 'T'):
continue
print(i)
OUTPUT
L
e
a
r
n
P
y
t
h
o
n
Python String
Python string is the set of single quotes, double quotes, or triple quotes
accompanying the characters. The computer doesn't understand the
characters; internally, it stores manipulated characters as the 0's and 1's
combination.
Each symbol shall be stored in the symbol ASCII or Unicode. So we might
say Python strings are also called Unicode character.
Python enables us to build the string of single quotes, double quotes or triple
quotes.
Syntax:
str = "Hi Python !"
Hello Python
Hello Python
Triple quotes are generally used for
represent the multiline or
docstring
STRING OPERATORS
Operator Description
[:] It is known as range slice operator. It is used to access the characters from
the specified range.
not in It is also a membership operator and does the exact reverse of in. It returns
true if a particular substring is not present in the specified string.
r/R It is used to specify the raw string. Raw strings are used in the cases where
we need to print the actual meaning of escape characters such as
"C://python". To define any string as a raw string, the character r or R is
followed by the string.
EXAMPLE
Find the example below for understanding the practical use of Python
operators.
str = "Hello"
str1 = " world"
print(str*3) # prints HelloHelloHello
print(str+str1)# prints Hello world
print(str[4]) # prints o
print(str[2:4]); # prints ll
print('w' in str) # prints false as w is not present in str
print('wo' not in str1) # prints false as wo is present in str1.
print(r'C://python37') # prints C://python37 as it is written
print("The string str : %s"%(str)) # prints The string str : Hello
Output:
HelloHelloHello
Hello world
o
ll
False
False
C://python37
The string str : Hello
Python Tuple
Python Tuple is used to store unchanged Python object set. The tuple is
similar to lists since it is possible to adjust the value of the items stored in the
array, whereas the tuple is immutable and it is not possible to alter the value
of the items stored in the tuple.
A tuple can be written as a collection of comma-separated(,) values
accompanied by the small() brackets. The parentheses are optional but use of
them is good practice. They may describe a tuple as follows.
T1 = (101, "Peter", 22)
T2 = ("Apple", "Banana", "Orange")
T3 = 10,20,30,40,50
print(type(T1))
print(type(T2))
print(type(T3))
Output:
<class 'tuple'>
<class 'tuple'>
<class 'tuple'>
A tuple is indexed as is the case for the lists. You can view the items in tuple
using their unique index value.
Example - 1
tuple1 = ( 10 , 20 , 30 , 40 , 50 , 60 )
print(tuple1)
count = 0
for i in tuple1:
print( "tuple1[%d] = %d" %(count, i))
count = count+ 1
Output:
(10, 20, 30, 40, 50, 60)
tuple1[0] = 10
tuple1[1] = 20
tuple1[2] = 30
tuple1[3] = 40
tuple1[4] = 50
tuple1[5] = 60
Example - 2
tuple1 = tuple(input( "Enter the tuple elements ..." ))
print(tuple1)
count = 0
for i in tuple1:
print( "tuple1[%d] = %s" %(count, i))
count = count+ 1
Output :
Enter the tuple elements ...123456
('1', '2', '3', '4', '5', '6')
tuple1[0] = 1
tuple1[1] = 2
tuple1[2] = 3
tuple1[3] = 4
tuple1[4] = 5
tuple1[5] = 6
Repetition The repetition operator enables the tuple elements to T1*2 = (1, 2, 3, 4, 5,
be repeated multiple times. 1, 2, 3, 4, 5)
Iteration The for loop is used to iterate over the tuple for i in T1:
elements. print(i)
Output
1
2
3
4
5
1 cmp(tuple1, It compares two tuples and returns true if tuple1 is greater than
tuple2) tuple2 otherwise false.
PROGRAMMING
FOR BEGINNERS
J KING
HADOOP INTRO
In this book includes basics of Big Data Hadoop with HDFS, MapReduce,
Yarn, Hive, HBase, Pig, Sqoop etc.
Hadoop is a framework which is open source.
The processing and analysis of very huge volumes of data is provided by
Apache.
It's written in Java and used by Google , Facebook, LinkedIn, Yahoo, Twitter
and so on.
Big Data
Big Data is called data which is very large in size.
We normally work on MB size data (WordDoc, Excel) or maximum
GB(Movies, Codes), but data in Peta bytes is called Big Data, i.e. 10 ^ 15
byte size.
It is stated that in the last 3 years nearly 90 percent of today 's data has been
generated.
3V's
1. Variety: Data are now not stored in rows and columns.
Data is both structured, and unstructured. CCTV footage, log file, is
unstructured data.
Data that can be saved in tables are structured data such as the Bank's
transaction data.
2. Velocity: The data grows at a very fast rate. The volume of the data is
estimated to double in every 2 years.
3. Volume: The quantity of data we handle is very large in Peta Bytes.
Usage
An e-commerce site XYZ (with 100 million users) wants to offer some gift
voucher to its top 10 customers who have spent the most in the previous year.
In addition, they want to find these customers' buying trend so that the
company can suggest more items related to them.
Solving
Processing: Map To find the required output, the paradigm Reduce is applied
to data distributed over a network.
Storage: Hadoop uses this enormous amount of data, HDFS (Hadoop
Distributed File System), which uses commodity hardware to form clusters
and store data in a distributed manner.
It works once on Write, reading the principle many times.
Modules
Yarn: Another Resource Negotiator is used for job planning
and the cluster management.
HDFS:Google published its GFS paper and was developed
based on that HDFS. It states that, over the distributed
architecture, the files are broken into blocks and stored in
nodes. HDFS Hadoop means Distributed File System.
• Hadoop Common: These Java libraries are used to start Hadoop and
other Hadoop modules are used for this.
• Map Reduce: This is a framework that helps Java programs use key
value pair to do the parallel data computation. The Map task takes
data from input and converts it into a set of data that can be
computed in key value pair. Map task output is consumed by
reducing task, and then the reducer output gives the desired
outcome.
Architecture
The Hadoop architecture is a file system package, MapReduce engine, and
HDFS (Hadoop Distributed File System) package.
The MapReduce motor may be either MapReduce / MR1 or YARN / MR2.
A Hadoop cluster comprises a single master and several slave nodes.
The master node includes Job Tracker, Task Tracker, NameNode, and
DataNode, while DataNode and TaskTracker are included as the slave node.
HDFS
It contains an architecture which is master / slave.
This architecture consists of a single NameNode that performs master role,
and multiple DataNodes perform a slave role.
The NameNode and DataNode are both capable of running on commodity
machines. HDFS is developed using the Java language.
So any machine which supports Java language can easily run the software
NameNode and DataNode.
DataNode
There are multiple DataNodes within the HDFS cluster.
Each DataNode is composed of several data blocks.
Those blocks of data are used to store data.
Upon instruction from the NameNode it performs block creation, deletion
and replication.
It is DataNode's responsibility to read and write requests from clients within
the file system.
NameNode
Task Tracker
For Job Tracker it works like a slave node.
It receives Job Tracker 's task and code, and applies that code to the file.
May also call this process a Mapper.
Job Tracker
Job Tracker 's role is to accept the client's MapReduce jobs, and process
the data using NameNode.
In response, Job Tracker is given metadata by NameNode.
MapReduce Layer
The MapReduce comes into being when Job Tracker submits the MapReduce
job to the client application. The Job Tracker sends the request to the
respective Task Trackers in response. TaskTracker sometimes fails, or time
out. That portion of the job is rescheduled in such a case.
Advantages
Cost Effective: Hadoop is open source and uses commodity hardware to store
data so compared to traditional relational database management system it is
really cost-effective.
Fast: In HDFS, the data is distributed and mapped across the cluster, which
helps to retrieve faster. Even the data processing tools are often on the same
servers, which reduces the processing time. It can process terabytes of data in
minutes, and bytes of Peta in hours.
Scalable: You can extend the hadoop cluster by simply adding nodes in the
cluster.
History of Hadoop
Let's focus on the history of Hadoop in the following steps: -
Doug Cutting and Mike Cafarella began working on a project entitled
Apache Nutch in 2002. It is a software project with open source web
crawler.
They 'd been dealing with big data while working on Apache Nutch.
They have to spend a lot of costs to store these data which becomes the
result of that project. This problem becomes one of the significant reasons
for Hadoop 's emergence.
In 2003 Google developed the GFS (Google File System) file system. It is
a proprietary distributed filesystem developed for efficient data access.
Google posted a white paper on Map Reduce in 2004. This technique
simplifies the processing of data on big clusters.
In 2005, a new file system named NDFS (Nutch Distributed File System)
was introduced by Doug Cutting and Mike Cafarella. Also includes Map
Reduction in this file system.
Doug Cutting departed Google in 2006 and joined Yahoo. Dough Cutting
is introducing a new Hadoop project based on the Nutch project, with a
file system known as HDFS (Hadoop Distributed File System). First
version of Hadoop 0.1.0 released this year.
After his son's toy elephant Doug Cutting named his project Hadoop.
To use HDFS
Very Large Files: Files should consist of hundreds of megabytes or more.
Streaming Data Access: When reading the first set, the time to read
whole data set is more important than latency.
HDFS is built on the pattern of writing-once and reading-many times.
Commodity Hardware: Works on hardware which is low cost.
HDFS Concepts
Blocks: A block is the minimum amount of data it can read or write.
HDFS blocks by default is 128 MB and this is configurable. Files n HDFS
are broken into block-sized chunks, which are stored as independent units.
Unlike a file system, if the file is in HDFS is smaller than block size, does
it not occupy full block Size, i.e. 5 MB of block size 128 MB HDFS file
only takes 5 MB of space.
The HDFS block size is large only to minimize search costs.
Data Node: When told to; by client or name node, they store and retrieve
blocks.
As a commodity hardware the data node also performs the work of block
creation, deletion and replication as stated by the name node.
Periodically they report back to name node, with a list of blocks they
store.
Recursive deleting
hadoop fs -rmr <arg>
Example:
hadoop fs -rmr /user/sonoo/
Copies the file or directory from the local file system identified by
localSrc to dest within the DFS.
copyFromLocal <localSrc><dest>
Identical to -put
copyFromLocal <localSrc><dest>
Identical to -put
moveFromLocal <localSrc><dest>
Copies the file or directory from the local file system identified by
localSrc to dest within HDFS, and then deletes the local copy on
success.
get [-crc] <src><localDest>
Copies the file or directory in HDFS identified by src to the local file
system path identified by localDest.
cat <filen-ame>
Displays the contents of filename on stdout.
moveToLocal <src><localDest>
Sets the target replication factor for files identified by path to rep. (The
actual replication factor will move toward the target over time)
touchz <path>
Prints information about path. Format is a string which accepts file size
in blocks (%b), filename (%n), block size (%o), replication (%r), and
modification date (%y, %Y).
Features and Goals Of HDFS
It can handle the application which contains large data sets with ease.
Unlike other distributed file systems, HDFS can be deployed on low-cost
hardware and is highly tolerant to faults.
Let's look at some of HDFS' significant features and goals.
HDFS Features
Replication-The node containing the data could be a loss due to some
unfavorable conditions. So HDFS always maintains the copy of data on a
different machine to overcome such problems.
Highly Scalable-HDFS is highly scalable because in a single cluster it can
scale hundreds of nodes.
Distributed data storage-This is one of HDFS's most important features ,
making Hadoop very powerful. Here, data is broken down into several blocks
and stored.
Fault tolerance-In HDFS, fault tolerance means system robustness in case of
failure. The HDFS is highly fault-tolerant that the other machine containing
the copy of that data will automatically become active if any machine fails.
HDFS Goals
Hardware failure handling-The HDFS includes multiple server machines.
Anyhow, if any machine fails, the aim of the HDFS is to quickly recover it.
Coherence Model-The application running on HDFS requires the write-once-
ready approach to be followed by many. So, you don't need to change a file
that was once created. It can be annexed and truncated though.
Access to streaming data-Usually the HDFS applications run on the file
system for general purposes. This application demands streaming access to its
data sets.
YARN
Yet Another Resource Manager takes programming to the next above anothor
programming and makes it interactive to allow another application to work
on it, such as Hbase, Spark etc.
Different yarn applications can co-exist on the same cluster, so that
MapReduce, Hbase, Spark can all run at the same time , bringing great
advantages for manageability and cluster use.
Components Of YARN
Client: To submit jobs to MapReduce.
Resource Manager: Manage resource utilization across a cluster
Map Reduce Master Application: Checks tasks that run the MapReduce job.
The application master and the tasks of MapReduce run in containers planned
by the resource manager and managed by the managers of the nodes.
Node Manager: For the start-up and monitoring of computer containers on
cluster.
Jobtracker & Tasktrackerwere used in the previous version of Hadoop, which
was responsible for resource handling and progress management checks.
Hadoop 2.0 however has Resource Manager and NodeManager to overcome
the Jobtracker & Tasktracker shortfall.
Benefits
Multitenancy: Different MapReduce version can run on YARN, making
MapReduce upgrade process more manageable.
Scalability: Map Reduce 1 hits bottleneck ascalability at 4000 nodes and
40000 tasks, but Yarn is designed for 10,000 nodes and 1 lakh tasks.
Usage: Node Manager manages a resource pool, rather than a fixed number
of designated slots, thus increasing the usage.
HADOOP MapReduce
A MapReduce is a data processing tool which is used in a distributed form to
process the data in parallel. Based on paper titled "MapReduce: Simplified
Data Processing on Large Clusters," it was developed in 2004, published by
Google.
The MapReduce is a two-phase paradigm, the phase mapper and the phase
reducer. The input is given in the Mapper in the form of a key-value pair. The
Mapper's output is fed in as input to the reducer. The reducer only runs after
the Mapper has ended. The reducer takes input in key-value format too, and
the reducer 's output is the end result.
Map function
The map function processes the next key-value pairs and generates the
corresponding key-value pairs for output.
The type of map input and output might differ from one another.
Partition function
The partition function assigns a corresponding reducer to the output of each
Map function.
That function is provided by the available key and value. It returns reducer
index.
Reduce function
Each unique key is assigned the function Reduce. Those keys are in sorted
order already.
The associated key values can iterate the Reduce and generate the
corresponding output.
Output writer
Output writer 's role is to write down the Reduce output into the stable
storage.
Once the data flows from all phases above, the Output writer executes.
MapReduce API
Here we learn about the classes and methods used to program MapReduce.
We are focussing on MapReduce APIs in this section.
In this example, u find out the frequency of each char value exists in this text
file.
Create a directory in HDFS, where to kept text file.
$ hdfs dfs -mkdir /count
In the specific directory Upload the info.txt file on HDFS.
$ hdfs dfs -put /home/codegyani/info.txt /count
HBase Use
RDBMS slows exponentially as the data gets big
Expects highly structured data , i.e. ability to fit within a well-defined scheme
Any schema change could require a downtime
Too much overhead of keeping NULL values for sparse datasets
Characteristic of Hbase
Automatic failover: Automatic failover is a resource that allows a system
administrator to switch data handling automatically to a standby system in
case of system compromise
Horizontally scalable: Columns can be added at any time.
Often referred to as a family-oriented database with a key value store or
column, or as storing versioned maps.
Sparse, distributed, persistent and multidimensional sorted map, indexed
by rowkey, column key, and timestamp.
Map / Reduce framework integrations: Map / Reduce is implemented
internally to the commands and java codes to do the task and is built over
the Hadoop Distributed File System.
It does not enforce relationships inside your information.
Basically, it's a random-access data storage and retrieval platform.
It doesn't care about datatypes (storing an integer for the same column in
one row and a string in another).
It's designed to run on a computer cluster, built with commodity hardware.
HBase Read
A HBase read must be reconciled between the HFiles, MemStore &
BLOCKCACHE. The BlockCache is designed to keep frequently accessed
data from the HFiles in memory so as to avoid disk reads. Each column
family has its own BlockCache. BlockCache contains data in the form of a
'block' as a unit of data that HBase reads from the disk in a single pass. This
means reading a block from HBase requires only looking up the location of
that block in the index and having it retrieved from the disk.
Block: It is the smallest indexed data unit and is the smallest data unit
readable from the disk. Size 64 KB by default.
Scenario, if you prefer smaller block size: to perform random searches.
Having smaller blocks will create a bigger index and thus consume more
memory.
Larger block size is preferred, scenario: frequently perform sequential scans.
This enables you to save on memory, since larger blocks mean fewer index
entries and therefore a smaller index.
Reading an HBase row requires checking the MemStore first, then the
BlockCache, and finally accessing HFiles on the disk.
HBase Write
By default, when a write is made it goes into two places:
write-ahead log (WAL), HLog, and
in-memory write buffer, MemStore.
HBase MemStore
During writings, clients do not interact directly with the underlying HFiles,
but rather write goes in parallel to WAL & MemStore. Any write to HBase
requires both WAL and MemStore confirmation.
The MemStore is a write buffer where HBase collects data before
a permanent write.
Its content is flushed to form an HFile disk when the MemStore is
filled in.
It doesn't write to an existing HFile but forms a new file on each
flush instead.
The HFile is the HBase storage format underlying this.
HFiles belong to family of columns (one MemStore per family of
columns). A family of columns can have multiple HFiles but the
opposite is not true.
Jar Files
Make sure that the following jars are present while writing the code as they
are required by the HBase.
a. commons-loging-1.0.4
b. commons-loging-api-1.0.4
c. hadoop-core-0.20.2-cdh3u2
d. hbase-0.90.4-cdh3u2
e. log4j-1.2.15
f. zookeper-3.3.3-cdh3u0
Program Code
Hive Intro
Apache Hive is a Hadoop data warehouse system running SQL like queries
called HQL (Hive query language) which is converted internally to map
reduce jobs. Hive was developed through Facebook.
Hive is a data warehouse system that uses structured data to be analysed. It is
Facebook that created it.
Hive provides the read, write, and handle massive datasets residing in
distributed storage. It runs SQL like HQL (Hive query language) queries
which are converted internally to MapReduce jobs.
Using Hive, we can skip the traditional approach requirement for writing
complex MapReduce programmes. Hive supports Data Definition Language
(DDL), User Defined Functions (UDF), Data Manipulation Language (
DML).
Features
Hive is fast and can be scaled.
It provides queries similar to SQL (i.e., HQL) which are implicitly
transformed into MapReduce or Spark jobs.
It can analyze large data sets which are stored in HDFS.
It allows various types of data, including plain text, RCFile, and HBase.
Uses indexing to speed up queries.
It can operate on compressed data that is stored in the Hadoop ecosystem.
It supports user-defined functions (UDFs), where users can provide their
features.
Limitations
Hive is not able to handle the data in real time.
It isn't intended for processing transactions online.
Hive queries have high latency.
Hive Vs Pig
Data Analysts usually use hive.
Pig's generally used by programmers.
Hive follows queries similar to SQL's.
Pig follows the language of data-flows.
Hive is able to manage the structured data.
Pig can handle data which are semi-structured.
Hive works on HDFS cluster server-side.
Pig works on HDFS cluster client-side.
Hive is lighter than Pig's.
Pig is faster than Hive, comparatively.
Hive Architecture
Hive Client
Hive lets you write applications in different languages, including Java ,
Python, and C++. It Supports
Thrift Server-This is a cross-language service provider portal that serves
Thrift from all those programming languages that represent the application.
JDBC Driver-It is used to connect hive and Java applications. The Driver
JDBC is present in the org.apache.hadoop.hive.jdbc. HiveDriver.
ODBC Driver-Allows applications to link to Hive which support the ODBC
protocol.
Hive Services
Hive CLI-The Hive CLI (Command Line Interface) is a shell where Hive
queries and commands are executed.
Hive Web User Interface — Hive Web UI is just an alternative to Hive CLI.
It offers a web-based GUI for Hive queries and commands to be executed.
Hive MetaStore-It is a central repository that stores all the information about
the structure of different tables and partitions within the warehouse. It also
includes column metadata and information about its type, the serializers and
deserializers that are used to read and write data and the corresponding HDFS
files where the data is stored.
Hive Server-is known as Apache Thrift Server. It accepts the request from
various clients and delivers it to Hive Driver.
Hive Driver-Receives requests from different sources such as a web UI, CLI,
Thrift, and JDBC / ODBC driver. The queries are transferred to the compiler.
Hive Compiler-The compiler 's purpose is to parse the query and perform
semantic analysis on the various query blocks and expressions. It transforms
statements in HiveQL to jobs in MapReduce.
Hive Execution Engine-Optimizer produces the logical plan of map-reduction
tasks and HDFS tasks in the form of DAG. Basically the execution engine
handles the incoming tasks in the order of their dependencies.
HIVE Data Types
Hive types of data are classified in numerical types, string types, misc types,
and complex types. Below is a list of the Hive data types.
Date/Time Types
TIMESTAMP
It supports traditional UNIX timestamp with optional precision in
nanoseconds.
As a numeric Integer type, it is interpreted in seconds as UNIX timestamp.
As Floating point numeric type, it is interpreted with decimal precision as
UNIX timestamp in seconds.
As a string, the format java.sql. Timestamp "YYYY-MM-DD HH: MM:
SS.ffffffff" follows (precision 9 decimal places)
DATES
The Date value is used in the form YYYY — MM — DD to specify a given
year, month and day. That did not provide the time of the day, though. The
Date type range ranges from 0000--01--01 to 9999--12--31.
String Models
STRING
The string is character sequence. It may include values within single quotes
(') or double quotes (").
To Varchar
The varchar is a type of variable length whose range lies between 1 and
65535, specifying that the maximum number of characters permitted in the
string.
CHAR
The char is a type with a fixed length whose maximum length is set at 255.
Complex Type
Type Size Range
In Hive, the database that contains the tables is not allowed to drop directly.
In such a case, either dropping tables first, or using Cascade keyword with
command.
Let's see the command Cascade used to drop the database:
hive> drop database if exists demo cascade;
First this command drops the tables present in the database automatically.
HiveQL - Operators
The operators of the HiveQL facilitate various arithmetic and relation
operations. Here, on the records of the table below, we will execute such type
of operations:
Example of Operators
Let's create a table and load the data with the following steps to it: -
Choose the database we wish to create a table in.
hive> use hql;
Use the following command to create a hive table;
hive> create table employee (Id int, Name string , Salary float)
row format delimited
fields terminated by ',' ;
load the data into the table.
hive> load data local inpath '/home/codegyani/hive/emp_data' into table
employee;
Fetch the loaded data by using the following command: -
hive> select * from employee;
Arithmetic Operators
The Arithmetic Operator accepts any numeric type in Hive. The Arithmetic
Operators commonly used are:-
Operators Description
Relational Operators
In Hive, the relational operators are generally used to compare existing
records with clauses like Join and Having. The commonly used relational
operators are: -
Operator Description
Mathematical Functions
Return Functions Description
type
Aggregate Functions
In Hive, the aggregate function returns a single value over many rows as a
result of computation. Let's see some commonly used functions for
aggregates:-
STRING substr(str, int It returns the substring from the string based
start, int on the provided starting index and length.
length)