0% found this document useful (0 votes)
155 views300 pages

CS 441 Handouts

The document provides an overview of a course focused on Big Data, covering its basic concepts, types of data, and the importance of data visualization and processing. It discusses structured, unstructured, and semi-structured data, as well as the challenges and storage needs associated with Big Data. Additionally, it introduces Python programming as a tool for handling Big Data and outlines the key technologies and requirements for effective data storage and processing.

Uploaded by

sehardilbar2512
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
155 views300 pages

CS 441 Handouts

The document provides an overview of a course focused on Big Data, covering its basic concepts, types of data, and the importance of data visualization and processing. It discusses structured, unstructured, and semi-structured data, as well as the challenges and storage needs associated with Big Data. Additionally, it introduces Python programming as a tool for handling Big Data and outlines the key technologies and requirements for effective data storage and processing.

Uploaded by

sehardilbar2512
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 300

Course Overview and Basic Concepts

Problem?
 In today’s age, as data is rapidly growing, so
expectations on how to analyze it faster are also
increasing.

 Organizations are faced with three options these days.


 Add hardware.
 Consider other ways to manage your data.
 Do nothing
Prerequisite
 Basic knowledge of Web
 Knowledge of Database
 Working knowledge of C++ Programming
 Basics of Mathematics and Statistics
Course Overview
The course will cover:
 Basic concepts of the Big Data
 Storage and processing of the Big Data
 Various big data mining and machine learning
algorithms
 Python programming language
 Different Data Acquisition techniques
 Architecture of a Search Engine as an application
of Big Data storage and retrieval
Data and its Types
What is Data?
Types of Data
 Structured

 Unstructured

 Semi-Structured
Structured Data
 Information stored in databases is known as structured
data because it is represented in a strict format.

 The DBMS then checks to ensure that all data follows


the structures and constraints specified in the schema.

Copyright © 2007 Ramez Elmasri and Shamkant B.


Navathe
Structured Data (Cont’d)
 Structured data tends to refer to information in “tables”

Employee Manager Salary


Smith Jones 50000
Chang Smith 60000
Ivy Smith 50000

Typically allows numerical range and exact match


(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith.
5
Unstructured Data
 Many have an internal structure but does not fit a
relational data model.
 Examples include:
 Personal messaging – email, instant messages, tweets,
chat
 Business documents – business reports, presentations,
survey responses
 Web content – web pages, blogs, wikis, audio files, photos,
videos
 Sensor output – satellite imagery, geolocation data,
scanner transactions

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe


Semi-Structured Data
 Some data may have a certain structure, but not all the
information collected will have identical structure. This
type of data is known as semi-structured data.

 Semi structured data is data that has some structure, but


structure may not be rigid, regular, or complete.

 Generally, data does not conform to fixed schema (sometimes


use terms schema-less or self-describing).

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe


Example: XML
<bib>
<book year="1995">
<title> Database Systems </title>
<author> <lastname> Date </lastname> </author>
<publisher> Addison-Wesley </publisher>
</book>
<book year="1998">
<title> Foundation for Object/Relational Databases </title>
<author> <lastname> Date </lastname> </author>
<author> <lastname> Darwen </lastname> </author>
<ISBN> <number> 01-23-456 </number > </ISBN>
</book>
</bib>
Type of available data generated and
stored by sector
Data Visualization and Processing
Why Data Visualization and
Processing?
• It can be very difficult for average end users to
understand the raw data, even for those with advanced
technical skills.

• This data needs to be processed and prepared in a form


that is easy to understand.

• Data processing and visualization are essential in order


to facilitate the interpretation of data and retrieval of
hidden information.
Why Data Visualization and
Processing?
• It is much easier to see trends and patterns in the
visual representation.

• It is easier to make comparisons in the visual


representation.
Anscombe’s Quartet

Copyright 2015 Keith Andrews


Anscombe’s Quartet

Copyright 2015 Keith Andrews


Figure 1
Sales for 2012

Table 2: Sales for 2012 by salesperson. Table of fictional sales data. Compare the sales figures
of the two salespeople.
Copyright 2015 Keith Andrews
Sales for 2012

Figure 2: Sales for 2012 by salesperson. Line chart of the same sales data. It is
much easier to see the trends and compare the data, when it is presented visually.

Copyright 2015 Keith Andrews


History of Data Visualization
 Originated from statistics and science

 Advancements made by NCSA


 National Center for Supercomputing Application

 Newest developments by Xerox in virtual reality


Applications of Data
Visualization Techniques
 Retail Banking
 Government
 Insurance
 Health Care and Medicine
 Telecommunications
 Transportation
 Capital Markets
 Asset Management
Data Visualization
Example: Line Chart for U.S. Exports to China

Copyright © 2013 Pearson


Education, Inc. publishing as
Prentice Hall
Data Visualization
Example: Pie Chart for Census Data

Figure 3.8

Copyright © 2013 Pearson


Education, Inc. publishing as
Prentice Hall
Marley, 1885

12
Copyright 2015 Keith Andrews
Snow’s Cholera
Map, 1855

13
Copyright 2015 Keith Andrews
Introduction to Big Data
What is Big Data?
 Big data is data that exceeds the processing capacity of
conventional database systems.
 The data is too big, moves too fast, or doesn’t fit the
structures of your database architectures.
 To gain value from this data, you must choose an
alternative way to process it.
Who generates Big Data?
 Web data, e-commerce
 purchases at department/
grocery stores
 Bank/Credit Card
transactions
 Mobile Devices
 Scientific instruments
 Sensor Technology and network
 Social Media and Network etc.
Data Measurement Chart
How much data?
 Google processes 20 PB a day (2008)

 Facebook has 2.5 PB of user data + 15 TB/day (4/2009)

 eBay has 6.5 PB of user data + 50 TB/day (5/2009)


Challenges in Big Data
 Capture
 Storage
 Search
 Share
 Transfer
 Analysis
 Visualization
Characterization of Big-Data:
Volume, Velocity, and Variety (V3)
 Volume
amount of data
 Velocity
Speed rate in collecting
or acquiring or generating or
processing of data
 Variety
different data type such
as audio, video, image data
(mostly unstructured data)
BD Storage and Processing Needs
Big Data Storage Needs
 One of the key characteristics of big data applications
is that they demand real-time or near real-time
responses.
 Example: A financial application needs to pull data
from a variety of sources quickly to present traders
with correlated information that allows them to make
buy or sell decisions ahead of the competition.
 Data volumes are growing very quickly
 This means that big data infrastructures tend to
demand high processing/IOPS performance and very
large capacity.
What is Big Data Storage?
 Big data storage is a storage infrastructure that is
designed specifically to store, manage and retrieve
massive amounts of data, or big data.
 Big data storage enables the storage and sorting of big
data in such a way that it can easily be accessed, used
and processed by applications and services working on
big data.
 Big data storage is also able to flexibly scale as
required.
Key Requirements of Big Data
Storage
• Scalability
• Highly available
• Security
• Accessibility
Traditional Storage Options
Big Data Storage Choices
Big Data Storage Architecture (Con’t)

 Moreover, most big data storage


architectures/infrastructures have native support for
big data analytics solutions such as Hadoop, Cassandra
and NoSQL.
Big Data Storage Architecture
 The storage infrastructure is connected to computing
server nodes that enable quick processing and retrieval
of big quantities of data.

 A typical big data storage architecture is made up of a


redundant and scalable supply of direct attached
storage (DAS) pools, scale-out or clustered network
attached storage (NAS) or an infrastructure based on
object storage format.
Scale out NAS
Object Storage
Hyper Scale Storage
BD Processing, Most Popular
Technologies
 Apache Hadoop
 NO SQL Databases
 Java
Apache Hadoop
 Apache Hadoop is a framework that allows for the
distributed processing of large data sets across clusters
of commodity computers using a simple programming
model.
 Hadoop features include:
 Flexible
 Reliable
 Economical
 Scalable
NOSQL Databases
 Class of non-relational data storage systems
 Usually do not require a fixed table schema
nor do they use the concept of joins

Examples:
 Hbase
 Cassandra
 MongoDB
 Neo4j
Python
 Created in 1991 by Guido van Rossum (now at Google)
 Named for Monty Python
 Useful as a scripting language
 script: A small program meant for one-time use
 Targeted towards small to medium sized projects
 Used by:
 Google, Yahoo!, Youtube
 Many Linux distributions
 Games and apps (e.g. Eve Online)

1
Why Python?
 Easy-to-learn
 Easy-to-read
 Easy-to-maintain
 A broad standard library
 Interactive Mode
 Portable
 Extendable
 Databases
 GUI Programming
 Scalable
2
Installing Python - Windows
Mac OS X:
Windows:  Python is already installed.
 Download Python from  Open a terminal and run python
http://www.python.org or run Idle from Finder.
 Install Python.
 Run Idle from the Start Menu. Linux:
 Chances are you already have
Python installed. To check, run
python from the terminal.
 If not, install from your
distribution's package system.

3
Interpreted Language
 Interpreted
 Not compiled like Java
 Code is written and then directly executed by an interpreter
 Type commands into interpreter and see immediate results

Java: Runtime
Code Compiler Computer
Environment

Python: Code Interpreter Computer

4
Python Interpreter
 Allows you to type commands one-at-a-time and see results
 A great way to explore Python's syntax
 Repeat previous command: Alt+P

5
Basic Syntax
 Console output: System.out.println
 Methods: public static void name() { ...

Hello.java
1 public class Hello {
2 public static void main(String[] args) {
3 hello();
4 }
5
6 public static void hello() {
7 System.out.println("Hello, world!");
8 }
9 }

6
Our First Python Program
 Python does not have a main method like Java
 The program's main code is just written directly in the
file
 Python statements do not end with semicolons

hello.py
1 print("Hello, world!”)

7
The Print Statement
print("text”)
print() (a blank line)

 Escape sequences such as \" are the same as in Java


 Strings can also start/end with '

swallows.py
1 print(”Hello, world!”)
2 print()
3 print("Suppose two swallows \"carry\" it together.”)
4 print('African or "European" swallows?’)

8
Comments
 Syntax:
# comment text (one line)

swallows2.py
1 # Suzy Student, CSE 142, Fall 2097
2 # This program prints important messages.
3 print("Hello, world!”)
4 print() # blank line
5 print(”Suppose two swallows \"carry\" it together.”)
6 print('African or "European" swallows?’)

9
Functions
 Function: Equivalent to a static method in Java.
 Syntax:
def name():
statement
statement
...
statement

 Must be declared above the 'main' code


 Statements inside the function must be indented
10
Whitespace Significance
 Python uses indentation to indicate blocks, instead of {}
 Makes the code simpler and more readable
 In Java, indenting is optional. In Python, you must indent.

11
Variables
 Declaring
 no type is written; same syntax as assignment
 Operators
 no ++ or -- operators (must manually adjust by 1)

Java Python
int x = 2; x = 2
x++; x = x + 1
System.out.println(x); print(x)
x = x * 8; x = x * 8
System.out.println(x); print(x)
double d = 3.2; d = 3.2
d = d / 2; d = d / 2
System.out.println(d); print(d)

12
Constants
 Python doesn't really have constants.
 Instead, declare a variable at the top of your code.
 All methods will be able to use this "constant" value.

12
Types
 Python is looser about types than Java
 Variables' types do not need to be declared
 Variables can change types as a program is running
 Standard Data Types used in Python are Numbers, String,
List, Tuple and Dictionary
Value Java type Python type

42 int int
3.14 double float
"ni!" String str

12
String
 Python strings can be multiplied by an integer.
 The result is many copies of the string concatenated together.
 Plus (+) Sign is String Concatenation Operator and asterisk
(*) sign is the repetition operator

>>> "hello" * 3
"hellohellohello"

>>> print(10 * "yo ”)


yo yo yo yo yo yo yo yo yo yo

>>> print(2 * 3 * "4“)


444444

12
String Concatenation
 Integers and strings cannot be concatenated in Python.
 Workarounds:
 str(value) - converts a value into a string
 print value, value - prints value twice, separated by a space

>>> x = 4
>>> print("Thou shalt not count to " + x + ".“)
TypeError: cannot concatenate 'str' and 'int' objects

>>> print("Thou shalt not count to " + str(x) + ".“)


Thou shalt not count to 4.

>>> print(x + 1, "is out of the question.“)


5 is out of the question.

12
Operators - Arithmetic
 Arithmetic is very similar to Java
 Operators: + - * / % (plus ** for exponentiation)
 Precedence: () before ** before * / % before + -
 You may use // for integer division
Operator Example
+ Addition a + b = 30
- Subtraction a – b = -10
* Multiplication a * b = 200
/ Division b/a=2
% Modulus b%a=0
** Exponent a**b =10 to the power 20
// Floor Division 9//2 = 4 and 9.0//2.0 = 4.0
12
Operators - Comparison
 compare the values on either sides of them and decide the
relation among them
 Also called Relational operators
Operator Meaning Example Result

== equals 1 + 1 == 2 True

!= does not equal 3.2 != 2.5 True

< less than 10 < 5 False

> greater than 10 > 5 True

<= less than or equal to 126 <= 100 False

>= greater than or equal to 5.0 >= 5.0 True

12
Operators - Assignment
 List of assignment operators are
Operator Example

= c = a + b assigns value of a + b into c

+= Add AND c += a is equivalent to c = c + a

-= Subtract AND c -= a is equivalent to c = c - a

*= Multiply AND c *= a is equivalent to c = c * a

/= Divide AND c /= a is equivalent to c = c / a

%= Modulus AND c %= a is equivalent to c = c % a

**= Exponent AND c **= a is equivalent to c = c ** a

12
Operators - Bitwise
 Works on bits and performs bit by bit operation

Operator Example

& Binary AND (a & b) = 12 (means 0000 1100)

| Binary OR (a | b) = 61 (means 0011 1101)

^ Binary XOR (a ^ b) = 49 (means 0011 0001)

~ Binary Ones (~a ) = -61 (means 1100 0011 in 2's complement


Complement form due to a signed binary number.

<< Binary Left Shift a << 2 = 240 (means 1111 0000)

>> Binary Right Shift a >> 2 = 15 (means 0000 1111)

12
Operators - Logical
 Logical operators supported by python

Operator Example Result


and (2 == 3) and (-1 < 5) False
or (2 == 3) or (-1 < 5) True
not not (2 == 3) True

12
Decision Making
 Anticipation of conditions occurring while execution of the
program and specifying actions taken according to the
conditions
 Evaluate multiple expressions which produce TRUE or FALSE as
outcome.

12
If Condition
if condition:
statements

 Example:
gpa = input("What is your GPA? ")
if gpa > 2.0:
print("Your application is accepted.“)

12
If/Else Example:
Syntax:
gpa = input("What is your GPA? ")
if condition: if gpa > 3.5:
statements print("You have qualified for the
elif condition: honor roll.“)
statements elif gpa > 2.0:
else: print("Welcome to Mars
statements University!“)
else:
print("Your application is
denied.“)
12
If..in
if value in sequence: Example:
statements
x=3
if x in range(0, 10):
 The sequence can be a
range, string, tuple, or print("x is between 0 and 9“)
list name = input("What is your name? ")
name = name.lower()
if name[0] in "aeiou":
print("Your name starts with a vowel!“)

12
Loops
 Loop statement allows us to execute a statement or group of
statements multiple times.

12
For Loop
 for name in range(max):
 statements

 Repeats for values 0 (inclusive) to max (exclusive)

>>> for i in range(5):


... print(i)
0
1
2
3
4

12
For Loop Variations
 for name in range(min, max):
 statements
 for name in range(min, max, step):
 statements
 Can specify a minimum other than 0, and a step other than 1
>>> for i in range(2, 6):
... print(i)
2
3
4
5
>>> for i in range(15, 0, -5):
... print(i)
15
10
5
12
Nested Loop
 Nested loops are often replaced by string * and +
....1
...2 Java
..3 1 for (int line = 1; line <= 5; line++) {
.4 2 for (int j = 1; j <= (5 - line); j++) {
3 System.out.print(".");
4 }
5 5 System.out.println(line);
6 }

Python

1 for line in range(1, 6):


2 print((5 - line) * "." + str(line))

12
While Loop
 while test:
 statements

>>> n = 91
>>> factor = 2 # find first factor of n

>>> while n % factor != 0:


... factor += 1
...

>>> factor
7

12
Tuple
 Sequence of immutable Python objects
 Tuples are sequences, just like lists
 Differences between tuples and lists are, the tuples cannot be changed unlike lists
and tuples use parentheses, whereas lists use square brackets
 tuple_name = (value, value, ..., value)
 A way of "packing" multiple values into one variable

>>> x = 3
>>> y = -5
>>> p = (x, y, 42)
>>> p
(3, -5, 42)
Tuple … Continued
 name, name, ..., name = tuple_name
 "unpacking" a tuple's contents into multiple variables
>>> a, b, c = p
>>> a
3
>>> b
-5
>>> c
42

 Useful for storing multi-dimensional data (e.g. (x, y) points)


>>> p = (42, 79)

2 3
Tuple … Continued
 Useful for returning more than one value
>>> from random import *
>>> def roll2():
... die1 = randint(1, 6)
... die2 = randint(1, 6)
... return (die1, die2)
...
>>> d1, d2 = roll2()
>>> d1
6
>>> d2
4

4
Tuple as Parameter
def name( (name, name, ..., name), ... ):
statements
 Declares tuple as a parameter by naming each of its
pieces

>>> def slope((x1, y1), (x2, y2)):


... return (y2 - y1) / (x2 - x1)
...
>>> p1 = (2, 5)
>>> p2 = (4, 11)
>>> slope(p1, p2)
3

5
Tuple as Return
def name(parameters):
statements
return (name, name, ..., name)
>>> from random import *
>>> def roll2():
... die1 = randint(1, 6)
... die2 = randint(1, 6)
... return (die1, die2)
...
>>> d1, d2 = roll2()
>>> d1
6
>>> d2
4
6
Dictionaries
 are similar to map in JAVA
 Dictionaries store a mapping between a set of keys and a set of values.
 Keys can be any immutable type.
 Values can be any type
 Values and keys can be of different types in a single dictionary
 You can
 define
 modify
 view
 lookup
 delete
the key-value pairs in the dictionary
2
Creating and Accessing Values in
Dictionaries
 To access dictionary elements, you can use the familiar square brackets along with the key to
obtain its value

2 3
Updating Dictionaries
 Update a dictionary by adding a new entry or a key-value pair, modifying an existing
entry, or deleting an existing entry

 Keys must be unique


 Assigning to an existing key replaces its value

 Dictionaries are unordered


 New entry might appear anywhere in the output.
 (Dictionaries work by hashing) 4
Removing Dictionaries
 There are possible methods of removing dictionary
 Either remove individual dictionary elements
 Clear the entire contents of a dictionary. Entire dictionary can also be deleted in a single
operation

2 5
Properties of Dictionary Keys
 Dictionary values have no restrictions. They can be any arbitrary Python object,
either standard objects or user-defined objects. However, same is not true for
the keys.
 Two important points to remember about dictionary keys are
 More than one entry per key not allowed. Which means no duplicate key is allowed.
When duplicate keys encountered during assignment, the last assignment wins.

6
Properties of Dictionary Keys .. Continued
 Keys must be immutable. Which means you can use strings, numbers or tuples as
dictionary keys but something like ['key'] is not allowed.

2 7
Functions

2 2
Calling a Function
 Defining a function only gives it a name, specifies the parameters that are to
be included in the function and structures the blocks of code
 Once the basic structure of a function is finalized, you can execute it by calling it
from another function or directly from the Python prompt. Following is the example
to call printme() function:

2 3
Functions without returns
 All functions in Python have a return value
 even if no return line inside the code.
 Functions without a “return” return the special value None.
 None is a special constant in the language.

 None is used like null in Java.


 None is also logically equivalent to False.

 The interpreter doesn’t print None

2 4
Functions Overloading
 There is no function overloading in Python.
 Unlike Java, a Python function is specified by its name alone
 The number, order, names, or types ofits arguments cannot be used to distinguish
between two functions with the same name.
 Two different functions can’t have the same name, even if they have different
numbers of arguments.
 But operator overloading–overloading +, ==, -, etc. –is possible using special
methods on various classes

2 5
Functions behave like Object
 Functions can be used just like any other data
 They can be
 Arguments to function

 Return values of functions


 Assigned to variables
 Parts of tuples, lists, etc.

2 6
Lambda Notation
 Functions can be defined without giving them names like anonymous
inner classes in JAVA
 This is most useful when passing a short function as an argument to
another function.

 The first argument to apply() is an unnamed function that takes one input and
returns the input multiplied by four.
 Note: only single-expression functions can be defined using this lambda
notation.
 Lambda notation has a rich history in CS research and the design of many
current programming languages.
2 7
Default Value for Arguments
 You can provide default values for a function’s arguments

 These arguments are optional when the function is called

 All of the above function calls return 8

2 8
Keyword Arguments
 Functions can be called with arguments out of order
 These arguments are specified in the call
 Keyword arguments can be used for a final subset of the arguments.

2 9
Import and Modules
 Programs will often use classes & functions defined in another file
 A Python module is a single file with the same name (plus the .py extension)
 Modules can contain many classes and functions
 Access using import (like Java)

 Where does Python look for module files?


 The list of directories where Python looks: sys.path
 When Python starts up, this variable is initialized from the PYTHONPATH
environment variable
 To add a directory of your own to this list, append it to this list.
 sys.path.append(‘/my/new/path’)
 Oops! Operating system dependent….
2 2
Import I
 Import somefile
 Everything in somefile.py can be referred to by:
somefile.className.method(“abc”)
somefile.myFunction(34)

 From somefile import *


 Everything in somefile.py can be referred to by:
className.method(“abc”)
myFunction(34)
 Careful! This can overwrite the definition
2 of an existing function or variable! 3
Import II
From somefile import className
 Only the item className in somefile.py gets imported.

 Refer to it without a module prefix.

 Caveat! This can overwrite an existing definition.

className.method(“abc”) This was imported


myFunction(34) Not this one

2 4
Commonly Used Modules
 Some useful modules, included with Python:

 Module: sys- Lots of handy stuff


 Module: os- OS specific code
 Module: os.path- Directory processing

 The Python standard library has lots of other useful stuff...

2 5
More Commonly Used Modules
 Module: math- Mathematical functions
 Exponents

 Sqrt

 Module: Random-Random numbers


 Randrange (good for simulations, games, …)
 Uniform
 Choice
 Shuffle

 To see what’s in the standard library of modules, check out the Python
Library Reference: http://docs.python.org/lib/lib.html
2 6
Files I/O: Create a File Object
 Python interacts with files with file objects
 Instantiate a file object with open or file
f = open(filename, option)
 At the end, use f.close()
 filename is a string containing the location of a file
 option can be:
 r to read a file (default option)
 w to write a file
 a to append to a file
 r+ to read and write
 rb for binary reading
 wb for binary writing

2
Files I/O: Reading Files
 The file object keeps track of its current location
 Example in first few lines of reader.py
 Some useful commands:
 read() Read next line by default in ‘r’ mode
 Read next n characters if argument is there in ‘r’ mode
 Read next n bytes in ‘rb’ mode
 readline() Reads a single line per call
 readlines() Returns a list of lines (splits at newline)

3
Files I/O: with open() as my FileObject
 The file object keeps track of its current location
 Example in first few lines of reader.py
 Some useful commands:
 read() Read next line by default in ‘r’ mode
 Read next n characters if argument is there in ‘r’ mode
 Read next n bytes in ‘rb’ mode
 readline() Reads a single line per call
 readlines() Returns a list of lines (splits at newline)

4
Files I/O: The read() method
 The read() method reads a string
from an open file. It is important
to note that Python strings can
have binary data, apart from text
data
 Syntax: fileObject.read([count]);
 Here, passed parameter is the number of
bytes to be read from the opened file.
This method starts reading from the
beginning of the file and if count is
missing, then it tries to read as much as
possible, maybe until the end of file
 Let us take a file foo.txt, which we created
above

1
Files I/O: File Positions
 The tell() method tells you the
current position within the file; in
other words, the next read or write
will occur at that many bytes from
the beginning of the file
 The seek(offset[, from]) method changes
the current file position. The offset
argument indicates the number of bytes
to be moved. The from argument
specifies the reference position from
where the bytes are to be moved
 Let us take a file foo.txt, which we created
above

6
Files I/O: Writing Files
 Use write() to write to a file
 The write() method writes any string to an open file. It is important to note
that Python strings can have binary data and not just text.
 The write() method does not add a newline character ('\n') to the end of the
string:
 Syntax: fileObject.write(string);
 See example in writer.py

7
Files I/O: Renaming Files
 Use rename() method to rename the file
 The rename() method takes two arguments, the current
filename and the new filename.
 Syntax: os.rename(current_file_name, new_file_name)
 Following is the example to rename an existing file test1.txt

8
Files I/O: Deleting Files
 You can use the remove() method to delete files by supplying the
name of the file to be deleted as the argument.
 Syntax: os.remove(file_name)
 Following is the example to delete an existing file test2.txt:

9
Files I/O: Useful Tools & Tips for I/O with
Binary Files
 When you read and write, use the struct module’s pack and unpack function
https://docs.python.org/2/library/struct.html
 When you unpack, you must take the [0] element of a tuple
 Use seek function to move to a particular location for reading/writing
 If you get nonsense, try swapping byte-order (little/big endian denoted by >, <, @,
!,=)

10
Exceptions
 Exceptions are events that can modify the flow or control
through a program.
 They are automatically triggered on errors.
 try/except : catch and recover from raised by you or Python
exceptions
 try/finally: perform cleanup actions whether exceptions occur or
not
 raise: trigger an exception manually in your code
 assert: conditionally trigger an exception in your code

2
Exception Roles
 Error handling
 Wherever Python detects an error it raises exceptions
 Default behavior: stops program.
 Otherwise, code try to catch and recover from the exception (try handler)
 Event notification
 Can signal a valid condition (for example, in search)
 Special-case handling
 Handles unusual situations
 Termination actions
 Guarantees the required closing-time operators (try/finally)
 Unusual control-flows
 A sort of high-level “goto”
3
try/except/else
 try:
You do your operations here;
......................
 except ExceptionI:
If there is ExceptionI, then execute this block.
 except ExceptionII:
If there is ExceptionII, then execute this block.
......................
 else:
If there is no exception then execute this block.

4
try/except/else - Example
 This example tries to open a file where you do not have write permission, so
it raises an exception:

5
try/else
 else is used to verify if no exception occurred in try.
 You can always eliminate else by moving its logic at the end of the try block.
 However, if “else statement” triggers exceptions, it would be misclassified as
exception in try block.

6
try/finally
 In try/finally, finally block is always run whether an exception occurs or not
try:
<block of statements>
finally:
<block of statements>

 Ensure some actions to be done in any case


 It can not be used in the try with except and else.

7
try/finally - Example

8
raise
 raise triggers exceptions explicitly
raise <name>
raise <name>,<data> # provide data to handler
raise #re-raise last exception

>>>try:
raise ‘zero’, (3,0)
except ‘zero’: print “zero argument”
except ‘zero’, data: print data

 Last form may be useful if you want to propagate caught exception to another
handler.
 Exception name: built-in name, string, user-defined class
9
raise - Example
 An exception can be a string, a class or an object. Most of the exceptions that the
Python core raises are classes, with an argument that is an instance of the class.
Defining new exceptions is quite easy and can be done as follows

10
Assert & Exception Objects
 An exception can be a string, a class or an object. Most of the exceptions that the
Python core raises are classes, with an argument that is an instance of the class.
Defining new exceptions is quite easy
 Exception Objects
 String-based exceptions are any string object
 Class-based exceptions are identified with classes. They also identify categories
of exceptions.
 String exceptions are matched by object identity: is
 Class exceptions are matched by superclass identity: except catches instances of the
mentioned class and instances of all its subclasses lower in the class tree.

1
Built-in Exception Classes
 Exception – top-level root superclass of exceptions.
 StandardError – the superclass of all built-in error exceptions.
 ArithmeticError – the superclass of all numeric errors.
 OverflowError – a subclass that identifies a specific numeric error.

>>>import exceptions
>>>help(exceptions)

12
Classes & Objects – Defining a Class
 Python was built as a procedural language
 OOP exists and works fine, but feels a bit more "tacked on"
 Java probably does classes better than Python (gasp)

 Declaring a class:

class name:
statements

2
Classes & Objects – Fields of a Class
 name = value
 Example:
class Point:
x=0
y=0
# main
p1 = Point()
p1.x = 2
p1.y = -5
 can be declared directly inside class (as shown here)
or in constructors (more common)
 Python does not really have encapsulation or private fields
 Relies on caller to "be nice" and not mess with objects' contents

3
Classes & Objects – Class Example
 Following is the example of a simple Python class:

4
Classes & Objects – Using a Class
 import class
 client programs must import the classes they use

5
Classes & Objects – Object Methods
 def name(self, parameter, ..., parameter):
statements

 self must be the first parameter to any object method


 represents the "implicit parameter" (this in Java)

 must access the object's fields through the self reference

class Point:
def translate(self, dx, dy):
self.x += dx
self.y += dy
Classes & Objects – "Implicit" Parameter
(self)
 Java: this, implicit
public void translate(int dx, int dy) {
x += dx; // this.x += dx;
y += dy; // this.y += dy;
}
 Python: self, explicit
def translate(self, dx, dy):
self.x += dx
self.y += dy
1
Classes & Objects – "Implicit" Parameter
(self) - Example
 Declaration and definition of distance, set_location, and distance_from_origin
methods

8
Classes & Objects – Calling Methods
 A client can call the methods of an object in two ways:
 (the value of self can be an implicit or explicit parameter)

 object.method(parameters) or
 Class.method(object, parameters)

 Example:
p = Point(3, -4)
p.translate(1, 5)
Point.translate(p, 1, 5)

1
Classes & Objects – Constructors
 def __init__(self, parameter, ..., parameter):
statements
 a constructor is a special method with the name __init__

Example:
class Point:
def __init__(self, x, y):
self.x = x
self.y = y
Classes & Objects – toString and __str__
 def __str__(self):
return string

 equivalent to Java's toString (converts object to a string)


 invoked automatically when str or print is called

 Syntax of Writing of __str__ method for Point objects that returns


strings like "(3, -14)"

def __str__(self):
return "(" + str(self.x) + ", " + str(self.y) + ")"
Classes & Objects – Complete Point Class

12
Classes & Objects – Operator Overloading
 operator overloading: You can define functions so that Python's built-in operators
can be used with your class.
 See also: http://docs.python.org/ref/customization.html

Operator Class Method Operator Class Method


- __neg__(self, other) == __eq__(self, other)
+ __pos__(self, other) != __ne__(self, other)
* __mul__(self, other) < __lt__(self, other)
/ __truediv__(self,
Unary Operators other) > __gt__(self, other)
<= __le__(self, other)
>= __ge__(self, other)
- __neg__(self)
+ __pos__(self) 13
Regular Expressions
 Regular expressions are a powerful string manipulation tool
 All modern languages have similar library packages for regular
expressions
 Use regular expressions to:
 Search a string (search and match)
 Replace parts of a string (sub)
 Break stings into smaller pieces (split)

2
Regular Expression Python Syntax
 Most characters match themselves
 The regular expression “test” matches the string ‘test’, and only
that string
 [x] matches any one of a list of characters
 “[abc]” matches ‘a’,‘b’,or ‘c’
 [^x] matches any one character that is not included in x
 “[^abc]” matches any single character except ‘a’,’b’,or ‘c’
 “.” matches any single character
 Parentheses can be used for grouping
 “(abc)+” matches ’abc’, ‘abcabc’, ‘abcabcabc’, etc.

3
Regular Expression Python Syntax .. Cont
 x|y matches x or y
“this|that” matches ‘this’ and ‘that’, but not ‘thisthat’.
 x* matches zero or more x’s
 “a*” matches ’’, ’a’, ’aa’, etc.
 x+ matches one or more x’s
 “a+” matches ’a’,’aa’,’aaa’, etc.
 x? matches zero or one x’s
 “a?” matches ’’ or ’a’ .
 x{m, n} matches i x‘s, where m<i< n
 “a{2,3}” matches ’aa’ or ’aaa’

4
Regular Expression Python Syntax .. Cont
 “\d” matches any digit; “\D” matches any non-digit
 “\s” matches any whitespace character; “\S” matches any non-
whitespace character
 “\w” matches any alphanumeric character; “\W” matches any
non-alphanumeric character
 “^” matches the beginning of the string; “$” matches the end of
the string
 “\b” matches a word boundary; “\B” matches position that is not
a word boundary

5
Search and Match
 The two basic functions are re.search and re.match
 Search looks for a pattern anywhere in a string
 Match looks for a match staring at the beginning
 Both return None if the pattern is not found (logical false) and a
“match object” if it is
 >>> pat = "a*b"
 >>> import re
 >>> re.search(pat,"fooaaabcde")
 <_sre.SRE_Match object at 0x809c0>
 >>> re.match(pat,"fooaaabcde")

6
Match Object
 A: an instance of the match class with the details of the match
result
pat = "a*b"
>>> r1 = re.search(pat,"fooaaabcde")
>>> r1.group() # group returns string matched
'aaab'
>>> r1.start() # index of the match start
3
>>> r1.end() # index of the match end
7
>>> r1.span() # tuple of (start, end)
(3, 7)
7
Example: Email Address Matching
 Here’s a pattern to match simple email addresses
\w+@(\w+\.)+(com|org|net|edu)

>>> pat1 = "\w+@(\w+\.)+(com|org|net|edu)"


>>> r1 = re.match(pat,"[email protected]")
>>> r1.group()
'[email protected]

 We might want to extract the pattern parts, like the email name
and host

8
Example: Email Address Matching .. Cont
 We can put parentheses around groups we want to be able to reference
>>> pat2 = "(\w+)@((\w+\.)+(com|org|net|edu))“
>>> r2 = re.match(pat2,"[email protected]")
>>> r2.group(1)
'finin'
>>> r2.group(2)
'cs.umbc.edu'
>>> r2.groups()
r2.groups()
('finin', 'cs.umbc.edu', 'umbc.', 'edu’)
 Note that the ‘groups’ are numbered in a preorder traversal of the
forest 9
Example: Email Address Matching .. Cont
 We can ‘label’ the groups as well…
>>> pat3
="(?P<name>\w+)@(?P<host>(\w+\.)+(com|org|net|edu))"
>>> r3 = re.match(pat3,"[email protected]")
>>> r3.group('name')
'finin'
>>> r3.group('host')
'cs.umbc.edu’
 And reference the matching parts by the labels

10
More re functions
 re.split() is like split but can use patterns
>>> re.split("\W+", “This... is a test,
short and sweet, of split().”)
['This', 'is', 'a', 'test', 'short’, ‘and', 'sweet', 'of',
'split’, ‘’]
 re.sub substitutes one string for a pattern
>>> re.sub('(blue|white|red)', 'black', 'blue socks and
red shoes')
'black socks and black shoes’
 re.findall() finds al matches
>>> re.findall("\d+”,"12 dogs,11 cats, 1 egg")
['12', '11', ’1’]
11
Methods of Pattern Objects
 There are methods defined for a pattern object that parallel the
regular expression functions, e.g.,
 match
 search
 split
 findall
 Sub

12
Introduction
 Big data technologies are important in providing more
accurate analysis.
 It may lead to more concrete decision-making, which
results in;
 Greater operational efficiencies
 Cost reductions
 Reduced risks for the business
 To harness the power of big data, an infrastructure is
required to;
 Manage and process huge volumes of structured and
unstructured data in real-time
 Protect data privacy and security
Classes of Technologies
 Two classes
 Operational
 Systems that provide operational capabilities for real-time,
interactive workloads where data is primarily captured and
stored.
 MongoDB
 NoSQL
 Analytical
 Systems that provide analytical capabilities for retrospective,
complex analysis that may touch most or all of the data
 Massively Parallel Processing (MPP) database systems
 MapReduce
Cont’d
 Both classes present opposing requirements
 Systems have evolved to address their particular demands
separately and in very different ways
 Each has driven the creation of new technology architectures
 Operational systems, focus on servicing highly concurrent
requests while exhibiting low latency for responses operating on
highly selective access criteria
 Analytical systems tend to focus on high throughput;
 Queries can be very complex and touch most if not all of the data in
the system at any time
 Both systems tend to operate over many servers operating in a
cluster, managing tens or hundreds of terabytes of data across
billions of records
Operational vs. Analytical
Systems
Reference
 http://www.tutorialspoint.com/hadoop/index.htm
 https://www.mongodb.com/big-data-explained
Hadoop
 Doug Cutting and his team developed an Open Source
Project called HADOOP written in Java.
 It is used to develop applications that could perform
complete statistical analysis on huge amounts of data.
 It uses MapReduce algorithm.
 It allows distributed processing of large datasets across
clusters of computers using simple programming models.
 It works in distributed storage environment and
computation across clusters of computers.
 It is designed to scale up from single server to thousands of
machines, each offering local computation and storage.
Benefits of Hadoop
 Saleable
 Economical
 Efficient
 Reliable
 Computing Power
When to use Hadoop
 When “standard tools” don’t work anymore because of
sheer size of data.
Hadoop Architecture
 Two major Layers
 Processing/Computation
layer (MapReduce)
 Storage layer (Hadoop
Distributed File System)
Cont’d
 Apart from the mentioned two core components,
Hadoop framework also includes the following two
modules:
 Hadoop Common: These are Java libraries and utilities
required by other Hadoop modules.

 Hadoop YARN: This is a framework for job scheduling


and cluster resource management.
Reference
 http://www.tutorialspoint.com/hadoop/index.htm
 https://www.mongodb.com/big-data-explained
Who uses Hadoop?
 Amazon/A9
 Facebook
 Google
 New York Times
 Veoh
 Yahoo!
 …. many more
Hadoop Daemons
 NameNode
 Runs on a “master node” that tracks and directs the storage of the cluster
 DataNode
 Runs on “slave nodes”
 The NameNode instructs data files to be split into blocks, each of which are
replicated three times and stored on machines across the cluster.
 These replicas ensure the entire system won’t go down if one server fails or is
taken offline—known as “fault tolerance.”
 JobTracker
 It oversees how MapReduce jobs are split up into tasks and divided among
nodes within the cluster.
 TaskTracker
 It accepts tasks from the JobTracker, performs the work and alerts the
JobTracker once it’s done.
 TaskTrackers and DataNodes are located on the same nodes to improve
performance
Modes of Hadoop Framework
 Hadoop is used with three different modes:
 The standalone mode
 The pseudo mode
 The full distributed mode
The Standalone Mode
 No need to start any Hadoop daemons
 All daemons run in a single Java process
 Mostly used for debugging
 Don't really use HDFS
 Input and output is used from local file system
 Recommended for testing purposes
 Default mode, no need to configure anything else
The Pseudo Mode
 Hadoop is configured for all the nodes.
 A separate Java Virtual Machine (JVM) is spawned for
each of the Hadoop components or daemons like mini
cluster on a single host.
The Full Distributed Mode
 Hadoop is distributed across multiple machines.
 Dedicated hosts are configured for Hadoop
components
 Separate JVM processes are present for all daemons.
Hadoop Installation
 Hadoop can be installed on both windows and Linux
 Best performance is on Linux
 Installation on Linux
 It requires a working Java 1.5+ installation
 Installation on Windows
 Hadoop
 Cloudera
 Hortonworks
 Sandbox
Cloudera Distribution Hadoop
 Cloudera Distribution Hadoop(CDH) is open-source
Apache Hadoop distribution provided by Cloudera
Inc. which is a Palo Alto-based American enterprise
software company
 It is the most complete, tested, and widely deployed
distribution of Apache Hadoop
 It offers
 Batch processing
 Interactive SQL
 Interactive search
Cloudera Taxonomy
Requirements
 64 bit Operating System
 9 GB RAM
 The VM from Cloudera is available in
 Vmware
 VirtualBox
 KVM flavors
Step 1
 Download the virtual box- executable file from
https://www.virtualbox.org/wiki/Downloads
Download VirtualBox 4.2.16 for Windows hosts
Step 1
 Download the virtual box- executable file from
https://www.virtualbox.org/wiki/Downloads
Download VirtualBox 4.2.16 for Windows hosts
Step 2
 Install VirtualBox by double clicking on the
downloaded file.
Step 3
 Download the Cloudera quickstart vm for VirtualBox
 Go to the link -
https://ccp.cloudera.com/display/SUPPORT/Cloudera
+QuickStart+VM
 Select quickVM for VirtualBox and click on download
Step 4
 Unzip the downloaded file.
 When you unzip the file cloudera-quickstart-vm-4.3.0-
virtualbox.tar you will find these two files in the
directory.
Step 5
 Open VirualBox and click on “New” to create new
virtual box
 Give name for new virtual machine and select type as
Linux and versions as Linux 2.6
Step 5
 Open VirualBox and click on “New” to create new
virtual box
Step 6
 Select Memory Size as 4GB and click Next.
Step 7
 In the next page, VirtualBox asks to select Hard Drive
for new VirualBox as shown in the screenshot. Create a
virtual hard drive now is selected by default. But you
have to select “Use an existing virtual hard drive file”
option.
Step 8
 Click on the small yellow icon beside the dropdown to
browse and select the cloudera-quickstart-vm-4.3.0-
virtualbox-disk1.vm file (which is download in step 4).
Click on create to create Cloudera
quickstart vm.
Step 9
 Your virtual box should look like following screen
shots. We can see the new virtual machine named
Cloudera Hadoop on the left side.
Step 10
 Select Cloudera vm and click on “Start”
 Virtual Machine starts to boot
Step 11
 System is loaded and CDH is installed on virtual
machine.
Step 12
 System redirects you to the index page of Cloudera.
Step 13
 Select Cloudera Manager and Agree to the information
assurance policy.
Step 14
 Login to Cloudera Manager as admin. Password is
admin.
Step 15
 We can see all the services running on our single node
cluster.
Step 16
 Click on the Hosts tab and we can see that one host is
running , version of CDH installed on it is 4 , health of
the host is good and last heart beat was listened 5.22s
ago.
Step 17
 Click on the localhost.localdomain to see the detail
information about the host
Step 18
 We can also change the password for admin by
selecting the administration tab and clicking on
“Change Password” button.
Format the NameNode
 Need to format the NameNode to create a Hadoop
Distributed File System (HDFS).
 Open Cygwin terminal (Run as Administration) and
execute following command

 Following message will be displayed;


"Storage Directory has been successfully formatted"
Start Hadoop Daemons
 Once the filesystem has been created, next step would
be to check and start Hadoop Cluster Daemons
NameNode, DataNode, SecondaryNameNode,
JobTracker, TaskTracker.
 Restart the Cygwin Terminal and execute below
command to start all daemons on Hadoop Cluster.

 Start Distributed File System


 The following command will start the namenode as well
as the data nodes as cluster.
Listing Files in HDFS
 After loading the information in the server, the list of
files in a directory and status of a file can be found
using ‘ls’.
 Given below is the syntax of ls that can be passed to a
directory or a filename as an argument.
Inserting Data into HDFS
 Assume there is data in the file called file.txt in the
local system which is ought to be saved in the hdfs file
system.
 Follow the steps given to insert the required file in the
Hadoop file system.
 Step 1
 You have to create an input directory.
Contd.
 Step 2
 Transfer and store a data file from local systems to the
Hadoop file system using the put command.

 Step 3
 Verify the file using ls command.
Retrieving Data from
HDFS
 Step 1
 Initially, view the data from HDFS using cat command.

 Step 2
 Get the file from HDFS to the local file system using get
command.
Shutting Down the HDFS
 Stop Hadoop HDFS
 Shut down the HDFS by using the following command.

 Stop Hadoop all Daemons


 To stop all the daemons, we can execute the command.
Reference
 http://saphanatutorial.com/hadoop-installation-on-
windows-7-using-cygwin/
 http://www.cloudera.com/developers/get-started-
with-hadoop-tutorial/setup.html
 https://hadoop.apache.org/docs/r2.7.1/hadoop-
project-dist/hadoop-hdfs/HDFSCommands.html
 http://blog.matthewrathbone.com/2013/04/17/what-
is-hadoop.html
HDFS
 The Hadoop Distributed File System (HDFS) is based
on the Google File System (GFS)
 It provides a distributed file system that is designed to
run on commodity hardware.
 It has many similarities with existing distributed file
systems.
 It is highly fault-tolerant and is designed to be
deployed on low-cost hardware.
 It provides high throughput access to application data
and is suitable for applications having large datasets.
HDFS Features
 It is suitable for the distributed storage and processing.
 Hadoop provides a command interface to interact with
HDFS.
 The built-in servers of namenode and datanode help
users to easily check the status of cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.
HDFS Architecture
 HDFS follows the master-
slave architecture and it has
the following elements.
 Namenode
 Datanode
 Block
Namenode
 The namenode is the commodity hardware that
contains the GNU/Linux operating system and the
namenode software.
 The system having the namenode acts as the master
server and it does the following tasks:
 Manages the file system namespace.
 Regulates client’s access to files.
 It also executes file system operations such as renaming,
closing, and opening files and directories.
Datanode
 The datanode is a commodity hardware having the
GNU/Linux operating system and datanode software.
 For every node (Commodity hardware/System) in a
cluster, there will be a datanode.
 These nodes manage the data storage of their system.
 Datanodes perform read-write operations on the file
systems, as per client request.
 They also perform operations such as block creation,
deletion, and replication according to the instructions of
the namenode.
Block
 Generally the user data is stored in the files of HDFS.
 The file in a file system will be divided into one or
more segments and/or stored in individual data nodes.
 These file segments are called as blocks.
 In other words, the minimum amount of data that
HDFS can read or write is called a Block.
 The default block size is 64MB, but it can be increased
as per the need to change in HDFS configuration.
Refference
 http://www.tutorialspoint.com/hadoop/index.htm
 https://www.mongodb.com/big-data-explained
MapReduce
 It is a parallel programming model for writing
distributed applications
 It is devised at Google for efficient processing of large
amounts of data , on large clusters of commodity
hardware in a reliable, fault-tolerant manner.
 The MapReduce program runs on Hadoop which is an
Apache open-source framework
The Algorithm
 Generally MapReduce paradigm is based on sending
the computer to where the data resides
 MapReduce program executes in three stages, namely
map stage, shuffle stage, and reduce stage.
Map stage
 The map or mapper’s job is to process the input data.
 The input data is in the form of file or directory and is
stored in the Hadoop file system (HDFS).
 The input file is passed to the mapper function line by
line.
 The mapper processes the data and creates several
small chunks of data.
Reduce stage
 This stage is the combination of the Shuffle stage and
the Reduce stage.
 The Reducer’s job is to process the data that comes
from the mapper.
 After processing, it produces a new set of output,
which will be stored in the HDFS.
Cont’d
 During a MapReduce job, Hadoop sends the Map and
Reduce tasks to the appropriate servers in the cluster.
 The framework manages all the details of data-passing
such as issuing tasks, verifying task completion, and
copying data around the cluster between the nodes.
 Most of the computing takes place on nodes with data
on local disks that reduces the network traffic.
 After completion of the given tasks, the cluster collects
and reduces the data to form an appropriate result,
and sends it back to the Hadoop server.
Typical problem solved by
MapReduce
 Read a lot of data
 Map
 Shuffle and Sort
 Reduce
 Write the results
Refference
 http://www.tutorialspoint.com/hadoop/index.htm
 https://www.mongodb.com/big-data-explained
Streaming
 Hadoop streaming is a utility that comes with the
Hadoop distribution.
 This utility allows you to create and run Map/Reduce
jobs with any executable or script as the mapper
and/or the reducer.
Example using Python
 For Hadoop streaming, we are considering the word-
count problem.
 Any job in Hadoop must have two phases, Mapper and
Reducer.
 Codes for the mapper and reducer are written in
python script to run it under Hadoop.
 One can also write the same in Java, Perl and Ruby.
Mapper Phase Code
Reducer Phase Code
Execution of
WordCountProgram
 Save the mapper and reducer codes in mapper.py and
reducer.py in Hadoop home directory.
 Make sure these files have execution permission
(chmod +x mapper.py and chmod +x reducer.py).
 Below is the code for executing mapper and reducer
codes.
Cont’d
 Where "\" is used for line continuation for clear
readability.
 For Example
How Streaming Works
 Both the mapper and the reducer are python scripts
that read the input from standard input and emit the
output to standard output.
 The utility will create a Map/Reduce job, submit the
job to an appropriate cluster, and monitor the progress
of the job until it completes.
Mapper’s Working
 When a script is specified for mappers, each mapper task will
launch the script as a separate process when the mapper is
initialized.
 As the mapper task runs, it converts its inputs into lines and feed
the lines to the standard input (STDIN) of the process.
 In the meantime, the mapper collects the line-oriented outputs
from the standard output (STDOUT) of the process and converts
each line into a key/value pair, which is collected as the output of
the mapper.
 By default, the prefix of a line up to the first tab character is the key
and the rest of the line (excluding the tab character) will be the
value.
 If there is no tab character in the line, then the entire line is
considered as the key and the value is null.
 This can be customized, as per one need.
Reducer’s Working
 When a script is specified for reducers, each reducer task will
launch the script as a separate process, then the reducer is
initialized.
 As the reducer task runs, it converts its input key/values pairs
into lines and feeds the lines to the standard input (STDIN) of
the process.
 In the meantime, the reducer collects the line-oriented outputs
from the standard output (STDOUT) of the process, converts
each line into a key/value pair, which is collected as the output of
the reducer.
 By default, the prefix of a line up to the first tab character is the
key and the rest of the line (excluding the tab character) is the
value.
 This can be customized as per specific requirements.
Reference
 http://www.tutorialspoint.com/hadoop/index.htm
 https://www.mongodb.com/big-data-explained
Introduction to HBase
 HBase is a column-oriented database management
system that runs on top of Hadoop Distributed File
System (HDFS).

 Hbase is the database application that stores huge


amounts of data and access the data in a random
manner.

 It is an open-source project and is horizontally


scalable.
Con’t
 HBase is a data model that is similar to Google’s big
table designed to provide quick random access to huge
amounts of structured data.

 It is a part of the Hadoop ecosystem that provides


random real-time read/write access to data in the
Hadoop File System.

 One can store the data in HDFS either directly or


through HBase.
Con’t
 Data consumer reads/accesses the data in HDFS
randomly using HBase.
 HBase sits on top of the Hadoop File System and
provides read and write access.
Features of Hbase
 Linearly scalable.
 Automatic failure support.
 Consistent read and writes.
 Integrates with Hadoop, both as a source and a
destination.
 Easy java API for client.
 Provides data replication across clusters.
Where to use Hbase?
 Apache HBase is used to have random, real-time
read/write access to Big Data.

 It hosts very large tables on top of clusters of


commodity hardware.

 Apache HBase is a non-relational database modeled


after Google's Bigtable. Bigtable acts up on Google File
System, likewise Apache HBase works on top of
Hadoop and HDFS.
Applications of HBase
 It is used whenever there is a need to write heavy
applications.

 HBase is used whenever we need to provide fast


random access to available data.

 Companies such as Facebook, Twitter, Yahoo, and


Adobe use HBase internally.
HBase Architercture
Setting and Running HBase
 Java and Hadoop are required to proceed with HBase,
so you have to download and install java and Hadoop
in your system.
 Setting up HBase on Windows:
1. Download cygwin setup.exe and run it
2. Choose an appropriate mirror:
1. Cygwin will be installed
into C:\Programs\Cygwin. Do notinstall Cygwin
into a folder that contains a space character
(C:\Program Files). If you do so, you will face
many random and unexpected troubles.
Con’t
1. From packages, choose the following:
1. OpenSSH,

2. tcp_wrappers,

3. diffutils [this should be pre-selected],

4. zlib

 Proceed with installation until it is finished.


 Configuring Cygwin:
1. Run CygWin Bash Shell with Administrator
privileges (C:\cygwin\Cygwin.bat)
Con’t
1. From this Bash shell run ssh-host-config
 say “yes” to privilege separation
 say “yes” to create the sshd account
 say “yes” to install sshd as a service
 press to enter an empty value of CYGWIN for the
daemon
 Now Cygwin needs to create a new account that will
be used as a “proxy”/setuid origin account. Say “no”
to use the default name (cyg_server).
Con’t
 say “yes” to create a new privileged
account cyg_server.
 create a password for this new privileged account
and confirm it
1. Synchronize Windows user accounts with Cygwin
user accounts:
 mkpasswd -cl > /etc/passwd
 mkgroup --local > /etc/group
Con’t
1. start SSH server with net start sshd
2. test connection with ssh localhost from Cygwin
Bash Shell.
 say “yes” to check and store server fingerprint
 put your Windows account password to
authenticate
 issue a few test commands in the remote session
 close session with exit.
3. alternatively: test your SSHD with putty.
Con’t
 Configuring Hbase:
1. Assume that you have Java JDK installed into a file
without spaces in the name
2. Download HBase from Apache Site. Unpack it into an
appropriate folder. I assume this should
be C:\java\hbase.
3. Open ./conf/hbase-env.sh in HBase directory
 uncomment and modify this line so it reads:
 export JAVA_HOME=/cygdrive/c/java/jdk7
 uncomment and modify this line so it reads:
 exportHBASE_CLASSPATH=/cygdrive/c/java/hbase/lib/zookeeper-
3.4.3.jar
4. Copy ./src/main/resources/hbase-default.xml to ./conf
HBase Shell
 HBase contains a shell using which you can communicate
with HBase.

 HBase uses the Hadoop File System to store its data.



 It will have a master server and region servers.

 The data storage will be in the form of regions (tables).


These regions will be split up and stored in region servers.

 The master server manages these region servers and all


these tasks take place on HDFS.
Con’t
 General commands supported by HBase shell are :

 status:
 Provides the status of HBase, for example, the number of
servers.
 version:
 Provides the version of HBase being used.
 table_help:
 Provides help for table-reference commands.
 whoami:
 Provides information about the user.
HBase Security
 We can grant and revoke permissions to users in HBase.
 There are three commands for security purpose:
 grant,
 revoke,
 user_permission.

 grant:
 The grant command grants specific rights such as read,
write, execute, and admin on a table to a certain user. The
syntax of grant command is as follows:
 hbase> grant <user> <permissions> [<table> [<column
family> [<column qualifier>]]
Con’t
 We can grant zero or more privileges to a user from the
set of RWXCA, where

 R - represents read privilege.


 W - represents write privilege.
 X - represents execute privilege.
 C - represents create privilege.
 A - represents admin privilege.
Con’t
 revoke:
 The revoke command is used to revoke a user's access
rights of a table. Its syntax is as follows:
 hbase> revoke <user>

 user_permission:
 This command is used to list all the permissions for a
particular table. The syntax of user_permission is as
follows:
 hbase>user_permission ‘tablename’
 The following code lists all the user permissions of ‘emp’
table.
 hbase(main):013:0> user_permission 'emp'
HBase DDL
 create: Creates a table.
 list: Lists all the tables in HBase.
 disable: Disables a table.
 is_disabled: Verifies whether a table is disabled.
 enable: Enables a table.
 is_enabled: Verifies whether a table is enabled.
 describe: Provides the description of a table.
 alter: Alters a table.
 exists: Verifies whether a table exists.
 drop: Drops a table from HBase.
 drop_all: Drops the tables matching the ‘regex’ given in
the command.
Con’t
 Java Admin API: Prior to all the above commands,
Java provides an Admin API to achieve DDL
functionalities through programming.
 Under org.apache.hadoop.hbase.client package,
HBaseAdmin and HTableDescriptor are the two
important classes in this package that provide DDL
functionalities.
DDL Python Example Code
HBase DML
 put: Puts a cell value at a specified column in a specified row in a
particular table.
 get: Fetches the contents of row or a cell.
 delete: Deletes a cell value in a table.
 deleteall: Deletes all the cells in a given row.
 scan: Scans and returns the table data.
 count: Counts and returns the number of rows in a table.
 truncate: Disables, drops, and recreates a specified table.
 Java client API: Prior to all the above commands, Java provides a client
API to achieve DML functionalities.
 CRUD (Create Retrieve Update Delete) operations and more through
programming, under org.apache.hadoop.hbase.client package.
 HTable Put and Get are the important classes in this package.
HBase Scan & Python Example
Code
 The scan command is used to view the data in
HTable. Using the scan command, you can get the
table data. Its syntax is as follows:
 scan ‘<table name>’
 The following example shows how to read data
from a table using the scan command. Here we are
reading the emp table.
Con’t
 Scanning Using Java API:
 The complete program to scan the entire table data using java
API is as follows:
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;

public class ScanTable{


public static void main(String args[]) throws IOException{
Con’t
// Instantiating Configuration class
Configuration config = HBaseConfiguration.create();

// Instantiating HTable class


HTable table = new HTable(config, "emp");

// Instantiating the Scan class


Scan scan = new Scan();

// Scanning the required columns


scan.addColumn(Bytes.toBytes("personal"),
Bytes.toBytes("name"));
scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("city"));

// Getting the scan result


ResultScanner scanner = table.getScanner(scan);
Con’t
// Reading values from scan result
for (Result result = scanner.next(); result != null; result =
Scanner.next())

System.out.println("Found row : " + result);

//closing the scanner


scanner.close();

}
}
 Compile and execute the above program as shown below.
$javac ScanTable.java
$java ScanTable
Con’t
 The above compilation works only if you have set the
classpath in “ .bashrc ”. If you haven't, follow the procedure
given below to compile your .java file.

//if “/home/home/hadoop/hbase” is your Hbase home folder


then.
$javac -cp /home/hadoop/hbase/lib/*: Demo.java

 If everything goes well, it will produce the following


output:
Found row :
keyvalues={row1/personal:city/1418275612888/Put/vlen=5/mvcc=
0, row1/personal:name/1418035791555/Put/vlen=4/mvcc=0}

You might also like