Big Data Lab Manual
Big Data Lab Manual
Hadoop Architecture:
Hadoop Common: Contains Java libraries and utilities needed by other Hadoop modules. These
libraries give file system and OS level abstraction and comprise of the essential Java files and
scripts that are required to start Hadoop.
Hadoop Distributed File System (HDFS): A distributed file-system that provides high-
throughput access to application data on the community machines thus providing very high
aggregate bandwidth across the cluster.
Hadoop YARN: A resource-management framework responsible for job scheduling and cluster
resource management.
Hadoop MapReduce: This is aYARN- based programming model for parallel processing of
Hadoop has gained its popularity due to its ability of storing, analyzing and accessing large
amount of data, quickly and cost effectively through clusters of commodity hardware. It won‘t be
wrong if we say that Apache Hadoop is actually a collection of several components and not just a
single product.
With Hadoop Ecosystem there are several commercial along with an open source products which
are broadly used to make Hadoop laymen accessible and more usable.
MapReduce
Hadoop Map Reduce is a software framework for easily writing applications which process big
amounts of data in-parallel on large clusters of commodity hardware in are liable, fault-tolerant
manner. In terms of programming, there are two functions which are most common in
MapReduce.
• The Map Task: Master computer or node takes input and convert it into
divide it into smaller parts and distribute it on other worker nodes. All worker nodes solve
their own small problem and give answer to the master node.
• The Reduce Task: Master node combines all answers coming from
worker node and forms it in some form of output which is answer of our big distributed
problem.
Generally both the input and the output are reserved in a file-system. The framework is
responsible for scheduling tasks, monitoring them and even re-executes the failed tasks.
HDFS is a distributed file-system that provides high throughput access to data. When data is
pushed to HDFS, it automatically splits up into multiple blocks and stores/replicates the data thus
ensuring high availability and fault tolerance.
Note: A file consists of many blocks (large blocks of64MB and above).
• Name Node: It acts as the master of the system. It maintains the name
system i.e., directories and files and manages the blocks which are present on the Data
Nodes.
• Data Nodes: They are the slaves which are deployed on each machine and
provide the actual storage. They are responsible for serving read and write requests for
the clients.
Hive
Hive is part of the Hadoop ecosystem and provides an SQL like interface to Hadoop. It is a data
warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the
analysis of large datasets stored in Hadoop compatible file systems.
HBase(Hadoop DataBase)
HBase is a distributed, column oriented database and uses HDFS for the underlying storage. As
said earlier, HDFS works on write once and read many times pattern, but this isn‘t a case always.
We may require real time read/write random access for huge dataset; this is where HBase comes
into the picture. HBase is built on top of HDFS and distributed on column- oriented database.
Installation Steps –
$java–version
$java–version
If it returns "The program java can be found in the following packages" ,If Java isn't been
installed yet, so execute the following command:
$sudoapt-getinstalldefault-jdk
$sudoapt-getinstalldefault-jdk
$sudogedit~/.bashrc
$sudogedit~/.bashrc
• 1.Openbashrcfilein geditor
1.Openbashrcfilein geditor
• Set java environment variable
exportJAVA_HOME=/usr/jdk1.7.0_45/
exportJAVA_HOME=/usr/jdk1.7.0_45/
• Set Hadoop environment variable
exportHADOOP_HOME=/usr/Hadoop2.6/
exportHADOOP_HOME=/usr/Hadoop2.6/
$source~/.bashrc
$source~/.bashrc
Step3: Install eclipse
Step4: Copy Hadoop plug-in ssuch as
• hadoop-eclipse-kepler-plugin-2.2.0.jar
• hadoop-eclipse-kepler-plugin-2.4.1.jar
• hadoop-eclipse-plugin-2.6.0.jarfromreleasefolderofhadoop2x-
eclipse-plugin- master to eclipse plugins
File->new->other->MapReduceproject
File->new->other->MapReduceproject
Step7:Create Mapper, Reducer,and driver
Inside a project->src->File->new->other->Mapper/Reducer/Driver
Inside a project->src->File->new->other->Mapper/Reducer/Driver
Hadoop is powerful because it is extensible and it is easy to integrate with any component. Its
popularity is due in part to its ability to store, analyze and access large amounts of data, quickly
and cost effectively across clusters of commodity hardware. Apache Hadoop is not actually a
single product but instead a collection of several components. When all these components are
merged, it makes the Hadoop very user friendly.
Practical-2
AIM :- Implement the following file management tasks in Hadoop
• Retrieving files
• Deleting Files
ALGORITHM:-
SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM HDFS
Before you can run Hadoop programs on data stored in HDFS, you‘ll need to put the data into
HDFS first. Let‘s create a directory and put a file in it. HDFS has a default working directory of
/user/$USER, where $USER is your login user name. This directory isn‘t automatically created
for you, though, so let‘s create it with the mkdir command. For the purpose of illustration, we
use chuck. You should substitute your user name in the example commands.
The Hadoop command get copies files from HDFS back to the local filesystem. To retrieve
example.txt, we can run the following command:
/home/lendi/Desktop/shakes/glossary /lendicse/”
SAMPLE INPUT: Input as any data format of type structured, Unstructured or Semi Structured
EXPECTED OUTPUT:
Practical-3
DESCRIPTION:
We can represent a matrix as a relation (table) in RDBMS where each cell in the matrix can be
represented as a record (i,j,value). As an example let us consider the following matrix and its
representation. It is important to understand that this relation is a very inefficient relation if the
matrix is dense. Let us say we have 5 Rows and 6 Columns , then we need to store only 30
values. But if you consider above relation we are storing 30 rowid, 30 col_id and 30 values in
other sense we are tripling the data. So a natural question arises why we need to store in this
format ? In practice most of the matrices are sparse matrices . In sparse matrices not all cells
used to have any values , so we don‘t have to store those cells in DB. So this turns out to be very
efficient in storing such matrices.
MapReduceLogic
Logic is to send the calculation part of each output cell of the result matrix to a reducer. So in
matrix multiplication the first cell of output (0,0) has multiplication and summation of elements
from row 0 of the matrix A and elements from col 0 of matrix B. To do the computation of value
in the output cell (0,0) of resultant matrix in a seperate reducer we need to use (0,0) as output key
of mapphase and value should have array of values from row 0 of matrix A and column 0 of
matrix B. Hopefully this picture will explain the point. So in this algorithm output from map
phase should be having a , where key represents the output cell location (0,0) , (0,1) etc.. and
value will be list of all values required for reducer to do computation. Let us take an example for
calculatiing value at output cell (00). Here we need to collect values from row 0 of matrix A and
col 0 of matrix B in the map phase and pass (0,0) as key. So a single reducer can do the
calculation
ALGORITHM
We assume that the input files for A and B are streams of (key,value) pairs in sparse matrix
format, where each key is a pair of indices (i,j) and each value is the corresponding matrix
element value. The output files for matrix C=A*B are in the same format.
The path of the directory for the output files for matrix C.
strategy = 1, 2, 3 or 4.
In the pseudo-code for the individual strategies below, we have intentionally avoided factoring
common code for the purposes of clarity.
Note that in all the strategies the memory footprint of both the mappers and the reducers is flat at
scale.
Note that the strategies all work reasonably well with both dense and sparse matrices. For sparse
matrices we do not emit zero elements. That said, the
Steps
1. setup ()
8. emit (i/IB, k/KB, jb, 0), (i mod IB, k mod KB, a(i,k))
emit (ib, k/KB, j/JB, 1), (k mod KB, j mod JB, b(k,j))
Intermediate keys (ib, kb, jb, m) sort in increasing order first by ib, then by kb, then by jb, then
by m. Note that m = 0 for A data and m = 1 for B data.
The partitioner maps intermediate key (ib, kb, jb, m) to a reducer r as follows:
11. r = ((ib*JB + jb)*KB + kb) mod R
12. These definitions for the sorting order and partitioner guarantee that each reducer R[ib,kb,jb]
receives the data it needs for blocks A[ib,kb] and B[kb,jb], with the data for the A block
immediately preceding the data for the B block.
19. sib = ib
20. skb = kb
33. sum = 0
34. for 0 <= k < column dimension of A = row dimension of B a. sum += A(i,k)*B(k,j)
INPUT:-
Set of Data sets over different Clusters are taken as Rows and Columns
OUTPUT:
Practical-4
DESCRIPTION:
Climate change has been seeking a lot of attention sincelong time. The antagonistic effect of this
climate is being felt in every part of the earth. There are many examples for these, such as sea
levels are rising, less rainfall, increase in humidity. The propose system overcomes thesome
issues that occurred by using other techniques. Inthis project we use the concept of Big data
Hadoop. In the proposed architecture we are able to process offline data,which is stored in the
National Climatic Data Centre(NCDC). Through this we are able to find out the maximum
temperature and minimum temperature of year, and able to predict the future weather forecast.
Finally, we plot the graph for the obtained MAX and MIN temperaturefor each moth of the
particular year to visualize thetemperature. Based on the previous year data weather data of
coming year is predicted.
ALGORITHM:-
Word Count is a simple program which counts the number of occurrences of each word in a
given text input dataset. Word Count fits very well with the Map Reduce programming model
making it a great example to understand the Hadoop Map/Reduce programming style. Our
implementation consists of three main parts:
• Mapper
• Reducer
• Main program
Step-1.Write a Mapper
• Pseudo-code
A Reducer collects the intermediate <key, value>output from multiple map tasks and assemble a
single result. Here, the Word Count program will sum up the occurrence of each word to pairs as
<word, occurrence>.
Pseudo-code
sum+=x; final_output.collect(max_temp,sum);
sum+=x; final_output.collect(min_temp,sum);
• Write a Driver
The Driver program configures and run the Map Reduce job. We use the main program to
perform basic configurations such as:
• Executable(Jar) Class: the main executable class. For here, Word Count.
• MapperClass: class which overrides the "map" function. Forhere, Map.
• Reducer: class which override the "reduce" function. For here , Reduce.
INPUT:-
OUTPUT:
Practical-5
AIM:- Run a basic Word Count Map Reduce Program to understand Map Reduce Paradigm
DESCRIPTION:--
Map Reduce is the heart of Hadoop. It is this programming paradigm that allows for
massivescalabilityacrosshundredsorthousandsofserversinaHadoopcluster.TheMapReduceconcept
is fairly simple to understand for those who are familiar with clustered scale-out data processing
solutions. The term Map Reduce actually refers to two separate and distinct tasks that Hadoop
programs perform. The first is the map job, which takes a set of data and converts it into another
set of data, where individual elements are broken down into tuples (key/value pairs). The reduce
job takes the output from a map as input and combines those data tuples into a smaller set of
tuples. As the sequence of the name Map Reduce implies, the reduce job is always performed
after the map job.
ALGORITHM
Word Count is a simple program which counts the number of occurrences of each word in a
given text input dataset. Word Count fits very well with the Map Reduce programming model
making it a great example to understand the Hadoop Map/Reduce programming style. Our
implementation consists of three main parts:
• Mapper
• Reducer
• Driver
Step-1.WriteaMapper
Pseudo-code
Step-2.WriteaReducer
A Reducer collects the intermediate <key,value>output from multiple map tasks and assemble a
single result. Here, the Word Count program will sum up the occurrence of each word to pairs as
<word,occurrence>.
Pseudo-code
sum+=x;
final_output.collect(keyword, sum);
Step-3.WriteDriver
The Driver program configures and run the Map Reduce job. We use the main program to
perform basic configurations such as:
• Reducer: class which override the "reduce" function. For here, Reduce.
INPUT:-
Output:
Practical-6
AIM:-
Install and Run Hive then use Hive to Create, alter and drop databases, tables, views, functions
and Indexes.
DESCRIPTION
Hive, allows SQL developers to write Hive Query Language (HQL) statements that are similar to
standard SQL statements; now you should be aware that HQL is limited in the commands it
understands, but it is still pretty useful. HQL statements are broken down by the Hive service
into Map Reduce jobs and executed across a Hadoop cluster. Hive looks very much like
traditional database code with SQL access. However, because Hive is based on Hadoop and Map
Reduce operations, there are several key differences. The first is that Hadoop is intended for long
sequential scans, and because Hive is based on Hadoop, you can expect queries to have a very
high latency (many minutes). This means that Hive would not be appropriate for applications that
need very fast response times, as you would expect with a database such as DB2. Finally, Hive is
read-based and therefore not appropriate for transaction processing that typically involves a high
percentage of write operations.
ALGORITHM:
• Install MySQL-Server
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value> jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true
</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hadoop</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hadoop</value>
</property>
• Copyingmysql-java-connector.jartohive/libdirectory.
SYNTAX for HIVE Database Operations DATABASE Creation
database_name[RESTRICT|CASCADE];
[(col_namedata_type[COMMENTcol_comment],...)]
Syntax
DROPVIEW view_name
Functions in HIVE
INDEXES
[PARTITIONEDBY(col_name,...)] [
[ROWFORMAT...]STOREDAS...
|STORED BY...
[LOCATIONhdfs_path] [TBLPROPERTIES(...)]
Creating Index
CREATE INDEX index_ip ON TABLE log_data(ip_address) AS
'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'WITHDEFERRED REBUILD;
SET
hive.index.compact.file=/home/administrator/Desktop/big/metastore_db/tmp/index_ipaddress_re
sult;
SET
hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;
Dropping Index
INPUT
OUTPUT
Practical-7
DESCRIPTION
Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The
language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in
MapReduce,ApacheTez,orApacheSpark.PigLatinabstractstheprogrammingfromtheJavaMapRedu
ce idiom into a notation which makes MapReduce programming high level, similar to that of
SQL for RDBMSs. Pig Latin can be extended using User Defined Functions(UDFs) which the
user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the
language.
Pig Latin is procedural and fits very naturally in the pipeline paradigm while SQL is instead
declarative. In SQL users can specify that data from two tables must be joined, but not what join
implementation to use (You can specify the implementation of JOIN in SQL, thus "... for many
SQL applications the query writer may not have enough knowledge of the data or enough
expertise to specify an appropriate join algorithm."). Pig Latin allows users to specify an
implementationoraspectsofanimplementationtobeusedinexecutingascriptinseveral ways. In effect,
Pig Latin programming is similar to specifying a query execution plan, making it easier for
programmers to explicitly control the flow of their data processing task.
SQL is oriented around queries that produce a single result. SQL handles trees naturally, but has
no built in mechanism for splitting a data processing stream and applying different operators to
each sub-stream. Pig Latin script describes a directed acyclic graph(DAG)rather than a pipeline.
Pig Latin's ability to include user code at any point in the pipeline is useful for pipeline
development. If SQL is used, data must first be imported into the database, and then the
cleansing and transformation process can begin.
ALGORITHM
STEPSFORINSTALLINGAPACHEPIG
• Grunt Shell
Grunt>
DATA=LOAD<CLASSPATH>USING PigStorage(DELIMITER)as(ATTRIBUTE:
DataType1,ATTRIBUTE: DataType2…..)
• Describe Data
Describe DATA;
• DUMP Data
Dump DATA;
• FILTER Data
• GROUP Data
• Iterating Data
FOR_DATA=FOREACHDATAGENERATEGROUPASGROUP_FUN, ATTRIBUTE =
<VALUE>
• Sorting Data
• LIMIT Data
• JOIN Data
INPUT:
OUTPUT:
Practical-8
MongoDB
Step1—Update system
Step 2 — Installing and Verifying MongoDB Now we can install the MongoDB package itself.
Sudo apt-get install-ymongodb
This command will install several packages containing latest stable version of MongoDB along
with helpful management tools for the MongoDB server.
After package installation Mongo DB will be automatically started. you can check this by
running the following command.
If MongoDB is running, you'll see an output like this (with a different process ID).
Output
mongodstart/running,process1611
You can also stop, start, and restart MongoDB using the service command (e.g.service mongod
stop, service mongod start).
Commands
MongoDB use DATABASE_NAME is used to create database. The command will create a new
database, if it doesn't exist otherwise it will return the existing database.
Syntax:
Use DATABASE_NAME
Syntax:
Basic syntax of dropDatabase() command is as follows:
db.dropDatabase()
This will delete the selected database. If you have not selected any database, then it will delete
default 'test' database
MongoDBdb.createCollection(name,options)isusedtocreatecollection.
Syntax:
In the command, name is name of collection to be created. Options is a document and used to
specify configuration of collection
The find()Method
To query data from MongoDB collection, you need to use MongoDB's find () method.
Syntax
>db.COLLECTION_NAME.find().
MongoDB'supdate() and save() methods are used to update document into a collection. The
update() method update values in the existing document while the save()method replaces the
existing document with the document passed in save() method.
MongoDBUpdate() method
Syntax
>db.COLLECTION_NAME.update(SELECTIOIN_CRITERIA,UPDATED_DATA)