Configure Hadoop Cluster in Pseudo Distributed Mode. Try Hadoop Basic Commands
Configure Hadoop Cluster in Pseudo Distributed Mode. Try Hadoop Basic Commands
Practical 1
Aim:- Configure Hadoop cluster in pseudo distributed mode. Try Hadoop basic commands.
sbin/start-all.sh
To check the Hadoop services are up and running use the following command:
jps
rahul@rahul:~/hadoop-3.3.6-cdh5.3.2$ jps
2546 SecondaryNameNode
2404 DataNode
2295 NameNode
2760 ResourceManager
2874 NodeManager
4251 Jps
ls: This command is used to list all the files. Use lsr for recursive approach. It is useful when we want a
hierarchy of a folder.
Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains executable so, bin/hdfs means we
want the executable of hdfs particularly dfs(Distributed File System) commands.
mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So let’s first create it.
Example:
bin/hdfs dfs -mkdir /geeks => '/' means absolute path
bin/hdfs dfs -mkdir geeks2 => Relative path -> the folder will be created relative to the home directory.
Example:
bin/hdfs dfs -touchz /geeks/myfile.txt
copyFromLocal (or) put: To copy files/folders from local file system to hdfs store. This is the most
important command. Local filesystem means the files present on the OS.
1
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Example:
copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Example:
Example:
bin/hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt /Rahul
cp: This command is used to copy files within hdfs. Let’s copy folder rahul to rahul_copied.
Example:
bin/hdfs -cp /rahul /rahul_copied
mv: This command is used to move files within hdfs. Let’s cut-paste a file myfile.txt from rahul folder to
geeks_copied.
Example:
bin/hdfs -mv /rahul/myfile.txt /rahul_copied
rmr: This command deletes a file from HDFS recursively. It is very useful command when you want to
delete a non-empty directory.
Example:
bin/hdfs dfs -rmr /rahul_copied -> It will delete all the content inside the directory then the directory itself.
Example:
bin/hdfs dfs -du /Rahul
Example:
bin/hdfs dfs -dus /Rahul
stat: It will give the last modified time of directory or path. In short it will give stats of the directory or file.
Example:
bin/hdfs dfs -stat /Rahul
setrep: This command is used to change the replication factor of a file/directory in HDFS. By default it is 3
for anything which is stored in HDFS (as set in hdfs core-site.xml).
Example 2: To change the replication factor to 4 for a directory rahulInput stored in HDFS.
bin/hdfs dfs -setrep -R 4 /Rahul
Note: The -w means wait till the replication is completed. And -R means recursively, we use it for
directories as they may also contain many files and folders inside them.
3
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Practical 2
Output
Convert into another set of data
(Key,Value)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),
(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1), (car,1), (BUS,1), (TRAIN,1)
Reduce Function – Takes the output from Map as an input and combines those data tuples into a smaller setof
tuples.
Input
(output of Map function)
Set of Tuples
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1), (TRAIN,1),(BUS,1),
(buS,1), (caR,1), (CAR,1), (car,1), (BUS,1), (TRAIN,1)
Output
Converts into smaller set of tuples
(BUS,7), (CAR,7), (TRAIN,4)
Work Flow of the Program
4
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Workflow of MapReduce consists of 5 steps:
Splitting – The splitting parameter can be anything, e.g. splitting by space, comma, semicolon, or even by a
new line (‘\n’).
Intermediate splitting – the entire process in parallel on different clusters. In order to group them in
“Reduce Phase” the similar KEY data should be on the same cluster.
Combining – The last phase where all the data (individual result set from each cluster) is combined together
to form a result.
Steps
1. Open Eclipse> File > New > Java Project >( Name it – MRProgramsDemo) > Finish.
2. Right Click > New > Package ( Name it - PackageDemo) > Finish.
i. /usr/lib/hadoop-0.20/hadoop-core.jar
ii. Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar
package PackageDemo;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
5
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static void main(String [] args) throws Exception
{
Configuration c=new Configuration();
String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
Path input=new Path(files[0]);
Path output=new Path(files[1]); Job
j=new Job(c,"wordcount");
j.setJarByClass(WordCount.class);
j.setMapperClass(MapForWordCount.class);
j.setReducerClass(ReduceForWordCount.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, input);
FileOutputFormat.setOutputPath(j, output);
System.exit(j.waitForCompletion(true)?0:1);
}
public static class MapForWordCount extends Mapper<LongWritable, Text, Text, IntWritable>{
public void map(LongWritable key, Text value, Context con) throws IOException, InterruptedException
{
String line = value.toString();
String[] words=line.split(",");
for(String word: words )
{
Text outputKey = new Text(word.toUpperCase().trim());IntWritable
outputValue = new IntWritable(1); con.write(outputKey,
outputValue);
}
}
}
public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text word, Iterable<IntWritable> values, Context con) throws IOException, InterruptedE
xception
{
int sum = 0;
for(IntWritable value : values)
{
sum += value.get();
}
con.write(word, new IntWritable(sum));
}
}
}
6
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
The above program consists of three classes:
Driver class (Public, void, static, or main; this is the entry point).
The Map class which extends the public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> and
implements the Map function.
The Reduce class which extends the public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>and
implements the Reduce function.
Right Click on Project> Export> Select export destination as Jar File > next> Finish.
To move this into Hadoop directly, open the terminal and enter the following commands:
Found 3 items
b. Find year wise maximum temperature using the weather data set which consists of
year, month, and temperature.
Step 2:
Below is the example of our dataset where column 6 and column 7 is showing Maximum and Minimum
temperature, respectively.
Step 3:
Make a project in Eclipse with below steps:
First Open Eclipse -> then select File -> New -> Java Project ->Name it MyProject -> then select use
8
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
an execution environment -> choose JavaSE-1.8 then next -> Finish.
In this Project Create Java class with name MyMaxMin -> then click Finish
9
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
MyMaxMin.java
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
// Mapper
/*MaxTemperatureMapper class is static and extends Mapper abstract class having four Hadoop generics
type LongWritable, Text, Text, Text. */
/**
* @method map
* This method takes the input as a text data type.
* Now leaving the first five tokens, it takes
* 6th token is taken as temp_max and
* 7th token is taken as temp_min. Now
* temp_max > 30 and temp_min < 15 are
* passed to the reducer.
*/
@Override
public void map(LongWritable arg0, Text Value, Context context)throws
IOException, InterruptedException {
// Hot day
context.write(new Text("The Day is Hot Day :" + date),
new Text(String.valueOf(temp_Max)));
}
// Cold day
context.write(new Text("The Day is Cold Day :" + date),
new Text(String.valueOf(temp_Min)));
}
}
}
// Reducer
/*MaxTemperatureReducer class is static and extends Reducer abstract class having four Hadoop
generics type Text, Text, Text, Text. */
/**
* @method reduce
* This method takes the input as key and
* list of values pair from the mapper,
* it does aggregation based on keys and
* produces the final context.
*/
11
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
public void reduce(Text Key, Iterator<Text> Values, Context context)throws IOException,
InterruptedException {
/**
* @method main
* This method is used for setting
* all the configuration properties.
* It acts as a driver for map-reduce
* code.
*/
// Defining input Format class which is responsible to parse the dataset into a key value pair
job.setInputFormatClass(TextInputFormat.class);
// Defining output Format class which is responsible to parse the dataset into a key value pair
job.setOutputFormatClass(TextOutputFormat.class);
12
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
// Configuring the input path from the filesystem into the job
FileInputFormat.addInputPath(job, new Path(args[0]));
// Configuring the output path from the filesystem into the job
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// deleting the context path automatically from hdfs so that we don't have to delete it explicitly
OutputPath.getFileSystem(conf).delete(OutputPath);
}
}
Now we need to add external jar for the packages that we have import. Download the jar package Hadoop
Common and Hadoop MapReduce Core according to your Hadoop version.
You can check Hadoop Version:
hadoop version
Now we add these external jars to our MyProject. Right Click on MyProject -> then select Build Path->
Click on Configure Build Path and select Add External jars…. and add jars from it’s download location
then click -> Apply and Close.
13
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Now export the project as jar file. Right-click on MyProject choose Export. and go to Java -> JAR file
click -> Next and choose your export destination then click -> Next. choose Main Class as MyMaxMin by
clicking -> Browse and then click -> Finish -> Ok.
14
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
start-dfs.sh
start-yarn.sh
Step 6: Now Run your Jar File with below command and produce the output in MyOutput File.
15
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Syntax:
hadoop jar /jar_file_location /dataset_location_in_HDFS /output-file_name
Command:
hadoop jar /home/dikshant/Documents/Project.jar /CRND0103-2020-AK_Fairbanks_11_NE.txt /MyOutput
Step 7: Now Move to localhost:50070/, under utilities select Browse the file system and download part-r-
00000 in /MyOutput directory to see result.
16
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
In the above image, you can see the top 10 results showing the cold days. The second column is a day in
yyyy/mm/dd format.
year = 2020
month = 01
Date = 01
c. Patent data files consist of patent id and sub patent id. One patent is associated with
multiple sub patents. Write a map reduce code to find out the total sub patent associated
with the patent.
/* The Patent program finds the number of sub-patents associated with each id in the provided input file. We
write a map reduce code to achieve this, where mapper makes key value pair from the input file and reducer
17
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
does
aggregation on this key value pair. */ public class Patent {
/*Map class is static and extends MapReduceBase and implements Mapper interface having four hadoop
generics type LongWritable, Text, Text, Text. */
//Mapper
/*This method takes the input as text data type and and tokenizes input by taking whitespace as delimiter. Now
key value pair is made and this key value pair is passed to reducer. @method_arguments key, value, output,
reporter @return void */
@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//Converting the record (single line) to String and storing it in a String variable line
String line = value.toString();
//Iterating through all the tokens and forming the key value pair
while (tokenizer.hasMoreTokens()) {
/* The first token is going in jiten, second token in jiten1, third token in jiten,fourth token in jiten1 and so on.
*/ String jiten= tokenizer.nextToken();
k.set(jiten);
String jiten1= tokenizer.nextToken(); v.set(jiten1);
//Reducer
/* Reduce class is static and extends MapReduceBase and implements Reducer interface having four hadoop
generics type Text, Text, Text, IntWritable. */
public static class Reduce extends Reducer<Text, Text, Text, IntWritable> { @Override
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
InterruptedException {
18
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
/* Iterates through all the values available with a key and add them together and give the final result as the key
and sum of its values */
for(Text x : values)
{
sum++;
}
/*Driver
\* This method is used for setting all the configuration properties. It acts as a driver for map reduce code.
@return void, @method_arguments args, @throws Exception */
//reads the default configuration of cluster from the configuration xml files
Configuration conf = new Configuration();
//Explicitly setting the out key/value type from the mapper if it is not same as that of reducer
job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class);
//Defining the output key class for the final output i.e. from reducer
job.setOutputKeyClass(Text.class);
//Defining the output value class for the final output i.e. from reducer
job.setOutputValueClass(IntWritable.class);
//Defining the output key class for the final output i.e. from reducer
job.setOutputKeyClass(Text.class);
//Defining the output value class for the final output i.e. from reducer
job.setOutputValueClass(Text.class);
19
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
//Defining input Format class which is responsible to parse the dataset into a key value pair
job.setInputFormatClass(TextInputFormat.class);
//Defining output Format class which is responsible to parse the final key-value output from MR framework to
a text file into the hard disk
job.setOutputFormatClass(TextOutputFormat.class);
//Configuring the input/output path from the filesystem into the job
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//deleting the output path automatically from hdfs so that we don't have delete it explicitly
outputPath.getFileSystem(conf).delete(outputPath);
}
}
Output:
Patent Number of Associated Sub-patents 1 13
2 10
3 4
20
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Practical 3
Using Practitioner
Input: aa bb cc dd ee aa ff bb cc dd ee ff
Program:
package org.example.wordcount;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Intwritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
21
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Jobex.setInputFormatClass(TextInputFormat.class);
Jobex.setOutputFormatClass(TextOutputFormat.class);
FileSystem fs = FileSystem.newInstance(getConf());
if(fs.exists(outputFilePath)) {
fs.delete(outputFilePath, true);
}
return jobex.waitForCompletion(true) ? 0: 1;
}
}
22
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
return 1;
}else if(numPartitions = =1)return 0;
}
}
}
23
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Output:
aa 2
bb 2
cc 2
dd 2
ee 2
ff 2
Using Combiner
Input: aa bb cc dd ee aa ff bb cc dd ee ff
Program:
package
org.example.wordcount;import
java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem; import
org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Intwritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job; import
org.apache.hadoop.mapreduce.Mapper;import
org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import
24
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import
org.apache.hadoop.util.GenericOptionParser;
25
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
FileSystem fs = FileSystem.newInstance(getConf());
if(fs.exists(outputFilePath)) {
fs.delete(outputFilePath, true);
}
return jobex.waitForCompletion(true) ? 0: 1;
}
}
26
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Output:
aa 2
bb 2
cc 2
dd 2
ee 2
ff 2
27
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Practical 4
Pre-requisite:
OS: UBUNTU 14.04 LTS
FRAMEWORK: Hadoop 2.7.3
JAVA VERSION: 1.7.0_131
Steps:
3. Download Hadoop from apache.hadoop.org site and to install hadoop perform the step
as under:
Gcet@gfl1-5:~$ tar -xvf Hadoop-2.7.3.tar.gz
28
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
hadoop-2.7.3/share/hadoop/tools/lib/hadoop-extras-2.7.3.jar
hadoop-2.7.3/share/hadoop/tools/lib/asm-3.2.jar
hadoop-2.7.3/include/
hadoop-2.7.3/include/hdfs.h
hadoop-2.7.3/include/Pipes.hh
hadoop-2.7.3/include/TemplateFactory.hh
hadoop-2.7.3/include/StringUtils.hh
hadoop-2.7.3/include/SerialUtils.hh
hadoop-2.7.3/LICENSE.txt
hadoop-2.7.3/NOTICE.txt
hadoop-2.7.3/README.txt
29
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Gcet@gfl1-5:~$ export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-i386
30
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
17/07/26 15:15:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
17/07/26 15:15:52 INFO Configuration.deprecation: session.id is deprecated. Instead, use
dfs.metrics.session-id
17/07/26 15:15:52 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
17/07/26 15:15:52 INFO input.FileInputFormat: Total input paths to process : 2
17/07/26 15:15:52 INFO mapreduce.JobSubmitter: number of splits:2
17/07/26 15:15:53 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_local1790612813_0001
17/07/26 15:15:53 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
17/07/26 15:15:53 INFO mapreduce.Job: Running job: job_local1790612813_0001
17/07/26 15:15:53 INFO mapred.LocalJobRunner: OutputCommitter set in config null
17/07/26 15:15:53 INFO output.FileOutputCommitter: File Output Committer Algorithm version
is 1
17/07/26 15:15:53 INFO mapred.LocalJobRunner: OutputCommitter is
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
17/07/26 15:15:53 INFO mapred.LocalJobRunner: Waiting for map tasks
17/07/26 15:15:53 INFO mapred.LocalJobRunner: Starting task:
attempt_local1790612813_0001_m_000000_0
17/07/26 15:15:53 INFO output.FileOutputCommitter: File Output Committer Algorithm version
is 1
17/07/26 15:15:53 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
.
.
.
.
17/07/26 15:15:55 INFO mapreduce.Job: Job job_local192240145_0002 running in uber mode :
false
17/07/26 15:15:55 INFO mapreduce.Job: map 100% reduce 100%
17/07/26 15:15:55 INFO mapreduce.Job: Job job_local192240145_0002 completed successfully
17/07/26 15:15:55 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=1195494
FILE: Number of bytes written=2315812
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=0
Spilled Records=0
Shuffled Maps =1
GC time elapsed (ms)=10
Total committed heap usage (bytes)=854065152
ShuffleBAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
31
WRONG_REDUCE=0
File Input Format Counters
32
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Multi node cluster:
Steps:
Bytes Read=98
File Output Format
CountersBytes
Written=8
Command: ip addr show (you can use the ifconfig command as well)
restrictions.Command: service
33
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
STEP 3: Open hosts file to add master and data node with their respective IP addresses.
Same properties will be displayed in the master and slave hosts files.
STEP 5: Create the SSH Key in the master node. (Press enter button when it asks you to enter a
STEP 6: Copy the generated ssh key to master node’s authorized keys.
34
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
STEP 7: Copy the master node’s ssh key to slave’s authorized keys.
STEP 8: Click here to download the Java 8 Package. Save this file in your home directory.
STEP 12: Add the Hadoop and Java paths in the bash file (.bashrc) on all
nodes.Open. bashrc file. Now, add Hadoop and Java Path as shown below:
35
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
For applying all these changes to the current Terminal, execute the source command.
To make sure that Java and Hadoop have been properly installed on your system and can
beaccessed through the Terminal, execute the java -version and hadoop version commands.
STEP 13: Create masters file and edit as follows in both master and slave machines as below:
36
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
STEP 16: Edit core-site.xml on both master and slave machines as follows:
37
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
18 <value>/home/edureka/hadoop-2.7.3/datanode</value>
19 </property>
20 </configuration>
STEP 19: Copy mapred-site from the template in configuration folder and the edit mapred-site.xml on
38
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
STEP 20: Edit yarn-site.xml on both master and slave machines as follows:
Command: sudo gedit /home/edureka/hadoop-2.7.3/etc/hadoop/yarn-site.xml
39
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Command: ./sbin/start-all.sh
STEP 23: Check all the daemons running on both master and slave machines.
Command: jps
On master
On slave
At last, open the browser and go to master:50070/dfshealth.html on your master machine, this will
giveyou the NameNode interface. Scroll down and see for the number of live nodes, if its 2, you
have successfully setup a multi node Hadoop cluster. In case, it’s not 2, you might have missed out
any of
the steps which I have mentioned above. But no need to worry, you can go back and verify
all theconfigurations again to find the issues and then correct them.
40
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
41
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Practical 5
You need to have Java installed on your system before installing Sqoop.
$ java –version
If Java is already installed on your system, you get to see the following response:-
If Java is not installed on your system, then follow the steps given below.
Download Java (JDK <latest version> - X64.tar.gz) by visiting the following link.
$ hadoop version
If Hadoop is already installed on your system, then you will get the following response:-
Hadoop 2.4.1
--
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
If Hadoop is not installed on your system, then proceed with the following steps −
Downloading Hadoop
Download and extract Hadoop 2.4.1 from Apache Software Foundation using the following commands.
We can download the latest version of Sqoop from the following link For this tutorial, we are using version
1.4.5, that is, sqoop-1.4.5.bin hadoop-2.0.4-alpha.tar.gz.
The following commands are used to extract the Sqoop tar ball and move it to “/usr/lib/sqoop” directory.
$tar -xvf sqoop-1.4.4.bin hadoop-2.0.4-alpha.tar.gz
42
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
$ su
password:
You have to set up the Sqoop environment by appending the following lines to ~/.bashrc file −
#Sqoop
export SQOOP_HOME=/usr/lib/sqoop export PATH=$PATH:$SQOOP_HOME/bin
$ source ~/.bashrc
To configure Sqoop with Hadoop, you need to edit the sqoop-env.sh file, which is placed in the
$SQOOP_HOME/conf directory.
First of all, Redirect to Sqoop config directory and copy the template file using the following command:-
$ cd $SQOOP_HOME/conf
$ mv sqoop-env-template.sh sqoop-env.sh
export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop
The following commands are used to extract mysql-connector-java tarball and move mysql-connector-
java-5.1.30-bin.jar to /usr/lib/sqoop/lib directory.
# cd mysql-connector-java-5.1.30
# mv mysql-connector-java-5.1.30-bin.jar /usr/lib/sqoop/lib
$ cd $SQOOP_HOME/bin
$ sqoop-version
43
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Expected output −
SQOOP is basically used to transfer data from relational databases such as MySQL, Oracle to data
warehouses such as Hadoop HDFS (Hadoop File System). Thus, when data is transferred from a relational
database to HDFS, we say we are importing data. Otherwise, when we transfer data from HDFS to
relational databases, we say we are exporting data.
Sqoop Import
Importing a Table
Sqoop tool „import‟ is used to import table data from the table to the Hadoop file system as a text file
or a binary file.
The following command is used to import the emp table from MySQL database server to HDFS.
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp --m 1
44
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
30/06/23 10:28:22 INFO mapreduce.ImportJobBase: Transferred 145 bytes in 177.5849 seconds
(0.8165 bytes/sec)
30/06/23 10:25:23 INFO mapreduce.ImportJobBase: Retrieved 5 records.
It shows you the emp table data and fields are separated with comma (,).
We can specify the target directory while importing table data into HDFS using the Sqoop import tool.
Following is the syntax to specify the target directory as option to the Sqoop import command.
The following command is used to import emp_add table data into „/queryresult‟ directory.
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp_add \
--m 1 \
--target-dir /queryresult
The following command is used to verify the imported data in /queryresult directory form emp_add table.
It will show you the emp_add table data with comma (,) separated fields.
We can import a subset of a table using the „where‟ clause in Sqoop import tool. It executes the
corresponding SQL query in the respective database server and stores the result in a target directory in
HDFS.
45
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
The syntax for where clause is as follows.
--where <condition>
The following command is used to import a subset of emp_add table data. The subset query is to retrieve the
employee id and address, who lives in Secunderabad city.
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp_add \
--m 1 \
--where “city =‟sec-bad‟” \
--target-dir /wherequery
The following command is used to verify the imported data in /wherequery directory from the emp_add
table.
It will show you the emp_add table data with comma (,) separated fields.
Incremental Import
Incremental import is a technique that imports only the newly added rows in a table. It is required to add
„incremental‟, „check-column‟, and „last-value‟ options to perform the incremental import.
The following syntax is used for the incremental option in Sqoop import command.
--incremental <mode>
--check-column <column name>
--last value <last check column value>
Let us assume the newly added data into emp table is as follows:-
The following command is used to perform the incremental import in the emp table.
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp \
--m 1 \
--incremental append \
--check-column id \
46
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
-last value 1205
The following command is used to verify the imported data from emp table to HDFS emp/ directory.
The following command is used to see the modified or newly added rows from the emp table.
It shows you the newly added rows to the emp table with comma (,) separated fields.
Syntax
Example: Let us take an example of importing all tables from the userdb database. The list of tables that the
database userdb contains is as follows.
The following command is used to import all the tables from the userdb database.
$ sqoop import-all-tables \
--connect jdbc:mysql://localhost/userdb \
--username root
Note − If you are using the import-all-tables, it is mandatory that every table in that database must have a
primary key field.
The following command is used to verify all the table data to the userdb database in HDFS.
$ $HADOOP_HOME/bin/hadoop fs –ls
It will show you the list of table names in userdb database as directories.
Output
47
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Example
Let us take an example of the employee data in file, in HDFS. The employee data is available in emp_data
file in „emp/‟ directory in HDFS.
The following query is used to create the table „employee‟ in mysql command line.
$ mysql
mysql> USE db;
mysql> CREATE TABLE employee (
id INT NOT NULL PRIMARY KEY,
name VARCHAR(20),
deg VARCHAR(20),
salary INT,
dept VARCHAR(10));
The following command is used to export the table data (which is in emp_data file on HDFS) to the
employee table in db database of Mysql database server.
$ sqoop export \
--connect jdbc:mysql://localhost/db \
--username root \
--table employee \
--export-dir /emp/emp_data
The following command is used to verify the table in mysql command line.
If the given data is stored successfully, then you can find the following table of given employee data.
+ + + + + +
| Id | Name | Designation | Salary | Dept |
+ + + + + +
| 1201 | gopal | manager | 50000 | TP |
| 1202 | manisha | preader | 50000 | TP |
| 1203 | kalil | php dev | 30000 | AC |
| 1204 | prasanth | php dev | 30000 | AC |
| 1205 | kranthi | admin | 20000 | TP |
48
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
| 1206 | satish p | grp des | 20000 | GR |
+ + + + + +
Sqoop job
Sqoop job creates and saves the import and export commands. It specifies parameters to identify and recall
the saved job. This re-calling or re-executing is used in the incremental import, which can import the
updated rows from RDBMS table to HDFS.
direct.import = true
codegen.input.delimiters.record = 0
hdfs.append.dir = false
db.table = employee
...
incremental.last.value = 1206...
49
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Execute Job (--exec)
„--exec‟ option is used to execute a saved job. The following command is used to execute a saved job called
myjob.
$ sqoop job --exec myjob
50
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Practical 6
$ java –version
If Java is already installed on your system, you get to see the following response:
If java is not installed in your system, then follow the steps given below for installing java.
$ hadoop version
If Hadoop is already installed on your system, then you will get the following response:
If Hadoop is not installed on your system, then install and configure all steps.
Step 3:
Downloading
Hive
Let us assume it gets downloaded onto the /Downloads directory. Here, we download Hive archive named
“apache-hive-0.14.0-bin.tar.gz” for this tutorial. The following command is used to verify the download:
$ cd Downloads
$ ls
apache-hive-0.14.0-bin.tar.gz
51
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
The following steps are required for installing Hive on your system. Let us assume the Hive archive is
downloaded onto the /Downloads directory.
apache-hive-0.14.0-bin apache-hive-0.14.0-bin.tar.gz
$ su -
passwd
:
# cd /home/user/Download
# mv apache-hive-0.14.0-bin
/usr/local/hive# exit
You can set up the Hive environment by appending the following lines to ~/.bashrc file:
export HIVE_HOME=/usr/local/hive
export
PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.
$ source ~/.bashrc
To configure Hive with Hadoop, you need to edit the hive-env.sh file, which is placed in the
$HIVE_HOME/conf directory. The following commands redirect to Hive config folder and copy the
template file:
$ cd $HIVE_HOME/conf
52
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
$ cp hive-env.sh.template hive-env.sh
export HADOOP_HOME=/usr/local/hadoop
Now you require an external database server to configure Metastore. We use Apache Derby database.
Follow the steps given below to download and install Apache Derby:
Downloading Apache Derby
The following command is used to download Apache Derby. It takes some time to download.
$ cd ~
$ wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby-10.4.2.0-bin.tar.gz
$ ls
db-derby-10.4.2.0-bin.tar.gz
The following commands are used for extracting and verifying the Derby archive:
db-derby-10.4.2.0-bin db-derby-10.4.2.0-bin.tar.gz
We need to copy from the super user “su -”. The following commands are used to copy the files from the
extracted directory to the /usr/local/derby directory:
$ su -
passwd
:
# cd /home/user
# mv db-derby-10.4.2.0-bin
/usr/local/derby# exit
53
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
You can set up the Derby environment by appending the following lines to ~/.bashrc file:
export DERBY_HOME=/usr/local/derby
export
PATH=$PATH:$DERBY_HOME/bin
Apache Hive
18
export
CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/derbytools.jar
$ source ~/.bashrc
Create a directory to store Metastore
$ mkdir $DERBY_HOME/data
Configuring Metastore means specifying to Hive where the database is stored. You can do this by editing the
hive-site.xml file, which is in the $HIVE_HOME/conf directory.
First of all, copy the template file using the following command:
$ cd $HIVE_HOME/conf
$ cp hive-default.xml.template hive-site.xml
Edit hive-site.xml and append the following lines between the <configuration> and </configuration> tags:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true </value>
<description>JDBC connect string for a JDBC metastore </description>
</property>
Create a file named jpox.properties and add the following lines into it:
javax.jdo.PersistenceManagerFactoryClass =
org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false
54
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create =
truejavax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine
Before running Hive, you need to create the /tmp folder and a separate Hive folder in HDFS. Here, we use
the /user/hive/warehouse folder. You need to set write permission for these newly created folders as shown
below:
chmod g+w
Now set them in HDFS before verifying Hive. Use the following commands:
$ cd $HIVE_HOME
$ bin/hive
hive> show
tables;
OK
Time taken: 2.798
secondshive>
Query in HIVE:
1) Retrieving information:
55
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
SELECT from_columns FROM table WHERE conditions;
2) All values:
SELECT * FROM tables;
3) Some Values:
SELECT * FROM table WHERE rec_name = "value";
4) Multiple Criteria:-
SELECT * FROM TABLE WHERE rec1 = "value1" AND rec2 = "value2
7) Sorting
SELECT col1, col2 FROM table ORDER BY col2;
8) Sorting backward
SELECT col1, col2 FROM table ORDER BY col2 DESC;
9) Counting rows
SELECT COUNT(*) FROM table;
10) Grouping with counting
SELECT owner, COUNT(*) FROM table GROUP BY owner;
11) Maximum value
SELECT MAX(col_name) AS label FROM table;
16) Selecting from multiple tables(Join same table using alias w/”AS”)
SELECT pet.name, comment FROM pet JOIN event ON(pet.name = event.name)
57
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Practical 7
Aim:- Write Hive Query for the following task for movie dataset. Movie dataset consists
of movie id, movie name, release year, rating, and runtime in seconds. A sample of the
dataset is as follows:
a. The Nightmare Before Christmas,1993,3.9,4568
b. The Mummy,1932,3.5,4388
c. Orphans of the Storm,1921,3.2,9062
d. The Object of Beauty,1991,2.8,6150
e. Night Tide,1963,2.8,5126
Write a hive query for the following
a. Load the data
b. List the movies that are having a rating greater than 4
c. Store the result of previous query into file
d. List the movies that were released between 1950 and 1960
e. List the movies that have duration greater than 2 hours
f. List the movies that have rating between 3 and 4
g. List the movie names and its duration in minutes
h. List all the movies in the ascending order of year.
i. List all the movies in the descending order of year.
j. list the distinct records.
k. Use the LIMIT keyword to get only a limited number for results from relation.
l. Use the sample keyword to get a sample set from your data.
m. M. View the step-by-step execution of a sequence of statements using ILLUSTRATE
command.
Assuming the table name is movies and the columns are movie_name, release_year, rating, and runtime.
58
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
SELECT * FROM movies WHERE release_year BETWEEN 1950 AND 1960;
k. Use the SAMPLE keyword to get a sample set from the data:
SELECT * FROM movies SAMPLE (10);
59
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Practical 8
Prerequisites
It is essential that you have Hadoop and Java installed on your system before you go for Apache Pig.
Downloading Pig
First of all, download the latest version of Apache Pig from the following website − https://pig.apache.org/
Step 1
Create a directory with the name Pig in the same directory where the installation directories of Hadoop,
Java, and other software were installed. (We have created the Pig directory in the user named Hadoop).
$ mkdir Pig
Step 2
Extract the downloaded tar files as shown below.
$ cd Downloads/
$ tar zxvf pig-0.15.0-src.tar.gz
$ tar zxvf pig-0.15.0.tar.gz
Step 3
Move the content of pig-0.15.0-src.tar.gz file to the Pig directory created earlier as shown below.
$ mv pig-0.15.0-src.tar.gz/*
.bashrc file
pig.properties file
In the conf folder of Pig, we have a file named pig.properties. In the pig.properties file, you can set various
parameters as given below.
pig -h properties
Miscellaneous: exectype = mapreduce|tez|local; default is mapreduce. This property is the same as -x switch
pig.additional.jars.uris=<comma seperated list of jars>. Used in place of register command.
udf.import.list=<comma seperated list of imports>. Used to avoid package names in UDF.
stop.on.failure = true|false; default is false. Set to true to terminate on the first error.
pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of the host.
Determines the timezone used to handle datetime datatype and UDFs.
Additionally, any Hadoop property can be specified.
$ pig –version
61
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.
Local Mode
In this mode, all the files are installed and run from your local host and local file system. There is no need of
Hadoop or HDFS. This mode is generally used for testing purpose.
MapReduce Mode
MapReduce mode is where we load or process the data that exists in the Hadoop File System (HDFS) using
Apache Pig. In this mode, whenever we execute the Pig Latin statements to process the data, a MapReduce
job is invoked in the back-end to perform a particular operation on the data that exists in the HDFS.
Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using the Grunt shell. In
this shell, you can enter the Pig Latin statements and get the output (using Dump operator).
Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig Latin script in a single
file with .pig extension.
Embedded Mode (UDF) − Apache Pig provides the provision of defining our own functions (User Defined
Functions) in programming languages such as Java, and using them in our script.
You can invoke the Grunt shell in a desired mode (local/MapReduce) using the −x option as shown below.
Either of these commands gives you the Grunt shell prompt as shown below.
Grunt Shell
grunt >
62
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Sample_script.pig
student=LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING PigStorage(',') as
(id:int,name:chararray,city:chararray);
Local mode: $ pig -x local Sample_script.pig
Step 4
Describe Data
grunt>describe
Step 5
Dump Data
grunt>dump
grunt> fs –ls
grunt> clear
4. Reading Data: Assuming the data resides in HDFS, we need to read data to Pig.
PigStorage() is the function that loads and stores data as structured text files.
6. Dump Operator: This command displays the results on the screen. It usually helps in debugging.
8. Explain: This command helps to review the logical, physical, and map-reduce execution plans.
63
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
grunt> explain college_students;
10. Group: This command works towards grouping data with the same key.
11. COGROUP: It works similarly to the group operator. The main difference between Group &
Cogroupoperators is that the group operator usually uses one relation, while cogroup uses more than one
relation.
Example: To perform self-join, the relation “customer” is loaded from HDFS tp pig commands in two
relations, customers1 & customers2.
13. Cross: This pig command calculates the cross product of two or more relations.
14. Union: It merges two relations. The condition for merging is that the relation’s columns and
domainsmust be identical.
15. Filter: This helps filter out the tuples out of relation based on certain conditions.
16. Distinct: This helps in the removal of redundant tuples from the relation.
17. Foreach: This helps generate data transformation based on column data.
4. Order by: This command displays the result in a sorted order based on one or more fields.
grunt> order_by_data = ORDER college_students BY age DESC;
Practical 9
Prerequisites
It is essential that you have Hadoop and Java installed on your system before you go for Hbase.
Installing HBase
We can install HBase in any of the three modes: Standalone mode, Pseudo Distributed mode, and
FullyDistributed mode.
$cd usr/local/
$wget http://www.interior-dsgn.com/apache/hbase/stable/hbase-0.98.8-hadoop2-bin.tar.gz
$tar -zxvf hbase-0.98.8-hadoop2-bin.tar.gz
Shift to super user mode and move the HBase folder to /usr/local as shown below.
$su
$password: enter your password
heremv hbase-0.99.1/* Hbase/
Before proceeding with HBase, you have to edit the following files and configure HBase.
hbase-env.sh
Set the java Home for HBase and open hbase-env.sh file from the conf folder. Edit JAVA_HOME
environment variable and change the existing path to your current JAVA_HOME variable as shown below.
cd
/usr/local/Hbase/conf
gedit hbase-env.sh
This will open the env.sh file of HBase. Now replace the existing JAVA_HOME value with your current
value as shown below.
export JAVA_HOME=/usr/lib/jvm/java-
1.7.0hbase-site.xml
This is the main configuration file of HBase. Set the data directory to an appropriate location by opening the
65
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
HBase home folder in /usr/local/HBase. Inside the conf folder, you will find several files, open the hbase-
site.xml file as shown below.
#cd
/usr/local/HBase/
#cd conf
# gedit hbase-site.xml
Inside the hbase-site.xml file, you will find the <configuration> and </configuration> tags. Within them, set
the HBase directory under the property key with the name “hbase.rootdir” as shown below.
<configuration>
//Here you have to set the path where you want HBase to store its files.
<property>
<name>hbase.rootdir</name>
<value>file:/home/hadoop/HBase/HFiles</value>
</property>
//Here you have to set the path where you want HBase to store its built in zookeeper files.
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hadoop/zookeeper</value>
</property>
</configuration>
With this, the HBase installation and configuration part is successfully complete. We can start HBase by
using start-hbase.sh script provided in the bin folder of HBase. For that, open HBase Home Folder and run
HBase start script as shown below.
$cd /usr/local/HBase/bin
$./start-hbase.sh
If everything goes well, when you try to run HBase start script, it will prompt you a message saying that
HBase has started.
mode
Before proceeding with HBase, configure Hadoop and HDFS on your local system or on a remote system
and make sure they are running. Stop HBase if it is running.
hbase-site.xml
<property>
<name>hbase.cluster.distributed</name>
66
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
<value>true</value>
</property>
It will mention in which mode HBase should be run. In the same file from the local file system, change the
hbase.rootdir, your HDFS instance address, using the hdfs://// URI syntax. We are running HDFS on the
localhost at port 8030.
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:8030/hbase</value>
</property>
Starting HBase
After configuration is over, browse to HBase home folder and start HBase using the following command.
$cd /usr/local/HBase
$bin/start-hbase.sh
HBase creates its directory in HDFS. To see the created directory, browse to Hadoop bin and type the
following command.
Found 7 items
drwxr-xr-x - hbase users 0 2014-06-25 18:58 /hbase/.tmp
drwxr-xr-x - hbase users 0 2014-06-25 21:49 /hbase/WALs
drwxr-xr-x - hbase users 0 2014-06-25 18:48 /hbase/corrupt
drwxr-xr-x - hbase users 0 2014-06-25 18:58 /hbase/data
-rw-r--r-- 3 hbase users 42 2014-06-25 18:41 /hbase/hbase.id
-rw-r--r-- 3 hbase users 7 2014-06-25 18:41 /hbase/hbase.version
drwxr-xr-x - hbase users 0 2014-06-25 21:49
Using the “local-master-backup.sh” you can start up to 10 servers. Open the home folder of HBase, master
and execute the following command to start it.
$ ./bin/local-master-backup.sh 2 4
To kill a backup master, you need its process id, which will be stored in a file named “/tmp/hbase-USER-X-
master.pid.” you can kill the backup master using the following command.
You can run multiple region servers from a single system using the following command.
67
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
$ .bin/local-regionservers.sh start 2 3
$ .bin/local-regionservers.sh stop 3
Starting HBaseShell
After Installing HBase successfully, you can start HBase Shell. Below given are the sequences of steps that
are to be followed to start the HBase shell. Open the terminal, and login as super user.
Start Hadoop File System: Browse through Hadoop home sbin folder and start Hadoop file system as
shown below.
$cd $HADOOP_HOME/sbin
$start-all.sh
Start HBase: Browse through the HBase root directory bin folder and start HBase.
$cd /usr/local/HBase
$./bin/start-hbase.sh
Start HBase Master Server: This will be the same directory. Start it as shown below.
$./bin/./local-regionservers.sh start 3
Start HBase Shell: You can start HBase shell using the following command.
$cd bin
$./hbase shell
This will give you the HBase Shell Prompt as shown below.
r6cfc8d064754251365e070a10a82eb169956d5fehbase(main):001:0>
HBase Web Interface: To access the web interface of HBase, type the following url in the browser.
http://localhost:60010
This interface lists your currently running Region servers, backup masters and HBase tables.
HBase Tables
We can also communicate with HBase using Java libraries, but before accessing HBase using Java API you
need to set classpath for those libraries.
Before proceeding with programming, set the classpath to HBase libraries in .bashrc file. Open .bashrc
inany of the editors as shown below.
$ gedit ~/.bashrc
Set classpath for HBase libraries (lib folder in HBase) in it as shown below.
69
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
This is to prevent the “class not found” exception while accessing the HBase using java API.
In HBase, interactive shell mode is used to interact with HBase for table operations, table management, and
data modeling. By using Java API model, we can perform all type of table and data operations in HBase.
Wecan interact with HBase using this both methods.
The only difference between these two is Java API use java code to connect with HBase and shell mode
useshell commands to connect with HBase.
To get enter into HBase shell command, first of all, we have to execute the code as mentioned below:
$hbase Shell
General Commands
• status - This command will give details about the system status like a number of servers present in
the cluster, active server count, and average load value. You can also pass any particular
parametersdepending on how detailed status you want to know about the system. The parameters
can be
„summary‟, „simple‟, or „detailed‟, the default parameter provided is “summary”.
Example:Syntax: status
70
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
• whoami - It is used to return the current HBase user information from the HBase cluster.
Syntax: Whoami
Example: hbase(main):010:0>describe
'education'
• disable - Disables a table. If table needs to be deleted or dropped, it has to disable first
71
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Syntax: disable <tablename>
Example: hbase(main):011:0>disable
'education'
• show_filters: It displays all the filters present in HBase like ColumnPrefix Filter,
TimestampsFilter,PageFilter, FamilyFilter, etc.
Syntax: show_filters
This command will verify whether the named table is enabled or not. Difference between enable
andis_enable command are:
• Suppose a table is disabled, to use that table we have to enable it by using enable command
• is_enabled command will check either the table is enabled or not
• alter - Alters a table.
Syntax: alter <tablename>, NAME=><column familyname>, VERSIONS=>5
Example 1: To change or add the „rahul_1‟ column family in table „education‟ from current value
tokeep a maximum of 5 cell VERSIONS
Example 2: You can also operate the alter command on several column families as well.
Forexample, we will define two new column to our existing table “education”.
hbase> alter 'edu', 'rahul_1', {NAME => 'rahul_2', IN_MEMORY => true}, {NAME =>
'rahul_3',VERSIONS => 5}
Example 3: how to delete column family from the table. To delete the „f1‟ column family in table
„education‟.
Example 4: how to change table scope attribute and how to remove the table scope attribute.
72
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
hbase> alter <'tablename'>, MAX_FILESIZE=>'132545224'
• drop_all - Drops the tables matching the „regex‟ given in the command.
Syntax: drop_all<"regex">
• Java Admin API - Prior to all the above commands, Java provides an Admin API to achieve DDL
functionalities through programming. Under org.apache.hadoop.hbase.client package, HBaseAdmin
and HTableDescriptor are the two important classes in this package that provide DDL
functionalities.
Data Manipulation Language
• put - Puts a cell value at a specified column in a specified row in a particular table.
Syntax: put <'tablename'>,<'rowname'>,<'columnvalue'>,<'value'>
Example: hbase> get 'rahul', 'r1', {COLUMN => 'c1'} (Row and Column Values
hbase> get 'rahul', 'r1', {TIMERANGE => [ts1, ts2]} (Row 1 values in the time range ts1
andts2 will be displayed)
hbase> get 'rahul', 'r1', {COLUMN => ['c1', 'c2', 'c3']}( row r1 and column families‟ c1, c2, c3
values will be displayed)
73
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
CACHE=>1000
We can run the count command on table reference also like below
hbase>g.count INTERVAL=>100000
hbase>g.count INTERVAL=>10, CACHE=>1000
edu‟‟
• Java client API - Prior to all the above commands, Java provides a client API to achieve DML
functionalities, CRUD (Create Retrieve Update Delete) operations and more through
programming,under org.apache.hadoop.hbase.client package. HTable Put and Get are the
important classes in thispackage.
74
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Practical 10
Aim:- Write a java program to insert, update and delete records from HBase.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.*;
import org.apache.hadoop.hbase.client.*; import
org.apache.hadoop.hbase.util.Bytes;
import java.io.IOException;
// Insert a record
Put put = new Put(Bytes.toBytes("row1"));
put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("col1"), Bytes.toBytes("value1"));table.put(put);
75
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Delete delete = new Delete(Bytes.toBytes("row1"));
table.delete(delete);
// Verify deletion
Result deletedResult = table.get(get);if
(deletedResult.isEmpty()) {
System.out.println("Record deleted successfully.");
} else {
System.out.println("Record deletion failed.");
}
Note:
• Before you proceed, make sure you have the HBase client libraries in your classpath.
• Remember to replace "your_table_name" with the actual name of the table you want to interact
with. Also, make sure that your HBase server is running and properly configured.
76
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Practical 11
If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
In case you don’t have Scala installed on your system, then proceed to next step for Scala
Download the latest version of Scala by visit the following link https://www.scala-lang.org/download/.
After downloading, you will find the Scala tar file in the download folder.
$ su –
Password
:
# cd /home/Hadoop/Downloads/
# mv scala-2.11.6
/usr/local/scala# exit
$PATH:/usr/local/scala/binVerifying Scala
Installation
Step 6: Installing
77
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
SparkExtracting
Spark tar
$ tar xvf spark-1.3.1-bin-
The following commands for moving the Spark software files to respective directory (/usr/local/spark).
$ su –
Password
:
# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6
/usr/local/spark# exit
export PATH=$PATH:/usr/local/spark/bin
$ source ~/.bashrc
$spark-shell
If spark is installed successfully then you will find the following output.
Spark assembly has been built with Hive, including Datanucleus jars on
classpathUsing Spark's default log4j profile: org/apache/spark/log4j-
defaults.properties 16/09/23 10:50:14 INFO SecurityManager: Changing view
acls to: hadoop 16/09/23 10:50:14 INFO SecurityManager: Changing modify acls
to: hadoop
16/09/23 10:50:14 INFO SecurityManager: SecurityManager: authentication disabled;
ui acls disabled; users with view permissions: Set(hadoop); users with modify
permissions:Set(hadoop)
16/09/23 10:50:15 INFO HttpServer: Starting HTTP Server
16/09/23 10:50:17 INFO Utils: Successfully started service 'HTTP class server' on port 43292.
Welcome to
/ / //
_\ \/ _ \/ _ `/ / '_/
/ / . /\_,_/_/ /_/\_\ version 1.4.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
78
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
1.7.0_71)Type in expressions to have them evaluated.
Spark context available as
scscala>
Apache Spark is a powerful open-source framework for big data processing and analytics. It provides
aninterface for programming entire clusters with implicit data parallelism and fault tolerance.
Starting Spark
Shell:spark-shell
This launches the Spark shell in Scala, where you can execute Spark code interactively.
Creating RDD (Resilient Distributed Dataset):
RDD is the fundamental data structure in Spark. You can create an RDD from various sources, such as
textfiles, HDFS, or by transforming an existing RDD.
sc.textFile(“data.txt”)[/php]
These are three methods to create the RDD. We can use the first method, when data is already available
with the external systems like a local filesystem, HDFS, HBase, Cassandra, S3, etc. One can create an RDD
by calling a textFile method of Spark Context with path / URL as the argument. The secondapproach can
be used with the existing collections and the third one is a way to create new RDD from the existing one.
[php]scala>
data.count()[/php]Filter Operation
Filter the RDD and create new RDD of items which contain word “DataFlair”. To filter, we need to call
transformation filter, which will return a new RDD with subset of items.
79
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
[php]scala>
[php]scala> data.take(5)[/php]
RDD Partitions
An RDD is made up of multiple partitions, to count the number of partitions:
[php]scala> data.partitions.length[/php]
Note: Minimum no. of partitions in the RDD is 2 (by default). When we create RDD from HDFS file then a
number of blocks will be equals to the number of partitions.
[php]scala> data.cache()[/php]
RDD will not be cached once you run above operation, you can visit the web UI:
http://localhost:4040/storage, it will be blank. RDDs are not explicitly cached once we run cache(),
ratherRDDs will be cached once we run the Action, which actually needs data read from the disk.
sc.textFile(“hdfs://localhost:9000/inp”)[/php]
80
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
PRACTICAL 12
Aim:- Write a scala program to process CSV, JSON and TXT File.
import org.apache.spark.sql.SparkSession
object FileProcessingExample {
def main(args: Array[String]): Unit = {
// Create a Spark session
val spark = SparkSession.builder()
.appName("FileProcessingExample")
.master("local[*]") // Run locally using all available cores
.getOrCreate()
In this example:
make sure you have the appropriate Spark dependencies and configuration set up to run the program.
82
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Practical 13
object StringOperations {
def main(args: Array[String]): Unit = {val
str1 = "Hello, World!"
val str2 = "Scala Programming"
83
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
}
Note:
Remember to replace the sample strings with the actual strings you want to use for testing.
84
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
Practical 14
85
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
86
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
87
Subject:- Big Data Analytics(102047803) Enrollent No:-12002080601024
OUTPUT:
88