0% found this document useful (0 votes)
16 views57 pages

CP5261Data Analytics Laboratory

This document outlines the process for installing and configuring a single-node Hadoop cluster on Ubuntu, including steps for installing Java, setting up SSH, and configuring Hadoop's environment variables and configuration files. It provides detailed commands and procedures for creating necessary directories, modifying configuration files, and formatting the Hadoop file system. The document serves as a practical guide for students to complete their practical examination in a relevant subject.

Uploaded by

Pavithradevi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views57 pages

CP5261Data Analytics Laboratory

This document outlines the process for installing and configuring a single-node Hadoop cluster on Ubuntu, including steps for installing Java, setting up SSH, and configuring Hadoop's environment variables and configuration files. It provides detailed commands and procedures for creating necessary directories, modifying configuration files, and formatting the Hadoop file system. The document serves as a practical guide for students to complete their practical examination in a relevant subject.

Uploaded by

Pavithradevi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 57

NAME : …………………………………………………

REGISTERNO:………………………………………………….
YEAR / SEM :.…………………………………………………
DEPARTMENT:…………………………………………………
SUBJECT :………………………………………………….
BONAFIDE CERTIFICATE

Certified that this is the bonafide record work done by

Mr./Ms.…….…………………………Register No…………………..

of .......……....…………………………………………… Department

....……. Year / ….… Semester during the academic year …….….…

in the Sub code/subject ……………………….…………………....……

Staff In-Charge Head of the Department

Submitted for Practical Examination held on……………………......………………….

Internal Examiner External Examiner


INDEX

PAGE
EX.NO DATE NAME OF THE EXPERIMENT SIGN
NO
INSTALL, CONFIGURE AND RUN
1 HADOOP AND HDFS
IMPLEMENT WORD COUNT /
2 FREQUENCY PROGRAMS USING
MAPREDUCE
IMPLEMENT AN MR PROGRAM THAT
3 PROCESSES A WEATHER DATASET
IMPLEMENT LINEAR AND LOGISTIC
4 REGRESSION
IMPLEMENT SVM / DECISION TREE
5 CLASSIFICATION TECHNIQUES
IMPLEMENT CLUSTERING
6 TECHNIQUES
VISUALIZE DATA USING ANY
7 PLOTTING FRAMEWORK
IMPLEMENT AN APPLICATION THAT
STORES BIG DATA IN HBASE /
8 MONGODB / PIG
USING HADOOP / R.

EX.NO: 1 INSTALL, CONFIGURE AND RUN HADOOP AND HDFS


DATE:

AIM:

To install a single-node Hadoop cluster backed by the Hadoop Distributed File


System on Ubuntu.

PROCEDURE:

1. Installing Java

prince@prince-VirtualBox:~$ cd ~

Update the source list

prince@prince-VirtualBox:~$ sudo apt-get update

The OpenJDK project is the default version of Java


that is provided from a supported Ubuntu repository.

prince@prince-VirtualBox:~$ sudo apt-get install default-jdk

prince@prince-VirtualBox:~$ java -version

java version "1.7.0_65"


OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-0ubuntu0.14.04.1)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

2. Adding a dedicated Hadoop user


prince@prince-VirtualBox:~$ sudoaddgrouphadoop

Adding group `hadoop' (GID 1002) ...


Done.

prince@prince-VirtualBox:~$ sudoadduser --ingrouphadoophduser

Adding user `hduser' ...


Adding new user `hduser' (1001) with group `hadoop' ...
Creating home directory `/home/hduser' ...
Copying files from `/etc/skel' ...
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
Changing the user information for hduser
Enter the new value, or press ENTER for the default
Full Name []:
Room Number []:
Work Phone []:
Home Phone []:
Other []:
Is the information correct? [Y/n] Y

3. Installing SSH

ssh has two main components:

1. ssh : The command we use to connect to remote machines - the client.


2. sshd : The daemon that is running on the server and allows clients to connect to the
server.

The ssh is pre-enabled on Linux, but in order to start sshd daemon, we need to install ssh
first. Use this command to do that :

prince@prince-VirtualBox:~$ sudo apt-get install ssh

This will install ssh on our machine. If we get something similar to the following, we can
think it is setup properly:

prince@prince-VirtualBox:~$ which ssh

/usr/bin/ssh

prince@prince-VirtualBox:~$ which sshd


/usr/sbin/sshd

4. Create and Setup SSH Certificates

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus our local
machine. For our single-node setup of Hadoop, we therefore need to configure SSH access to
localhost.

So, we need to have SSH up and running on our machine and configured it to allow SSH
public key authentication.

Hadoop uses SSH (to access its nodes) which would normally require the user to enter a
password. However, this requirement can be eliminated by creating and setting up SSH
certificates using the following commands. If asked for a filename just leave it blank and
press the enter key to continue.

prince@prince-VirtualBox:~$ suhduser
Password:
prince@prince-VirtualBox:~$ ssh-keygen -t rsa -P ""

Generating public/private rsa key pair.


Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
50:6b:f3:fc:0f:32:bf:30:79:c2:41:71:26:cc:7d:e3hduser@prince-VirtualBox
The key's randomart image is:
+--[ RSA 2048]----+
| .oo.o |
| . .o=. o |
| .+. o.|
| o= E |
| S+ |
| .+ |
| O+ |
| Oo |
| o.. |
+-----------------+

hduser@prince-VirtualBox:/home/k$ cat $HOME/.ssh/id_rsa.pub >>


$HOME/.ssh/authorized_keys

The second command adds the newly created key to the list of authorized keys so that
Hadoop can use ssh without prompting for a password.

We can check if ssh works:

hduser@prince-VirtualBox:/home/k$ sshlocalhost

The authenticity of host 'localhost (127.0.0.1)' can't be established.


ECDSA key fingerprint is e1:8b:a0:a5:75:ef:f4:b4:5e:a9:ed:be:64:be:5c:2f.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-40-generic x86_64)
...

5. Install Hadoop
hduser@prince-VirtualBox:~$ wget
http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
hduser@prince-VirtualBox:~$ tar xvzf hadoop-2.6.0.tar.gz

We want to move the Hadoop installation to the /usr/local/hadoop directory using the
following command:

hduser@prince-VirtualBox:~/hadoop-2.6.0$ sudo mv * /usr/local/hadoop

[sudo] password for hduser:


hduser is not in the sudoers file. This incident will be reported.

Oops!... We got:

"hduser is not in the sudoers file. This incident will be reported."

This error can be resolved by logging in as a root user, and then add hduser to sudo:

hduser@prince-VirtualBox:~/hadoop-2.6.0$ su prince
Password:

prince@prince-VirtualBox:/home/hduser$ sudoadduserhdusersudo

[sudo] password for prince:


Adding user `hduser' to group `sudo' ...
Adding user hduser to group sudo
Done.

Now, the hduser has root priviledge, we can move the Hadoop installation to the
/usr/local/hadoop directory without any problem:

prince@prince-VirtualBox:/home/hduser$ sudosuhduser

hduser@prince-VirtualBox:~/hadoop-2.6.0$ sudo mv * /usr/local/hadoop

hduser@prince-VirtualBox:~/hadoop-2.6.0$ sudochown -R hduser:hadoop


/usr/local/hadoop

6. Setup Configuration Files

The following files will have to be modified to complete the Hadoop setup:

i.~/.bashrc

ii./usr/local/hadoop/etc/hadoop/hadoop-env.sh

iii./usr/local/hadoop/etc/hadoop/core-site.xml

iv./usr/local/hadoop/etc/hadoop/mapred-site.xml.template

v./usr/local/hadoop/etc/hadoop/hdfs-site.xml

i. ~/.bashrc:

Before editing the .bashrc file in our home directory, we need to find the path where Java has
been installed to set the JAVA_HOME environment variable using the following command:

hduser@prince-VirtualBox:~$ update-alternatives --config java

There is only one alternative in link group java (providing /usr/bin/java): /usr/lib/jvm/java-7-
openjdk-amd64/jre/bin/java
Nothing to configure.

Now we can append the following to the end of ~/.bashrc:

hduser@prince-VirtualBox:~$ nano ~/.bashrc

#HADOOP VARIABLES START


export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END

hduser@prince-VirtualBox:~$ source ~/.bashrc


note that the JAVA_HOME should be set as the path just before the '.../bin/':

hduser@ubuntu-VirtualBox:~$ javac -version


javac 1.7.0_75

hduser@ubuntu-VirtualBox:~$ which javac


/usr/bin/javac

hduser@ubuntu-VirtualBox:~$ readlink -f /usr/bin/javac


/usr/lib/jvm/java-7-openjdk-amd64/bin/javac

ii. /usr/local/hadoop/etc/hadoop/hadoop-env.sh

We need to set JAVA_HOME by modifying hadoop-env.sh file.

hduser@prince-VirtualBox:~$ nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

Adding the above statement in the hadoop-env.sh file ensures that the value of
JAVA_HOME variable will be available to Hadoop whenever it is started up.

iii. /usr/local/hadoop/etc/hadoop/core-site.xml:

The /usr/local/hadoop/etc/hadoop/core-site.xml file contains configuration properties that


Hadoop uses when starting up.
This file can be used to override the default settings that Hadoop starts with.

hduser@prince-VirtualBox:~$ sudomkdir -p /app/hadoop/tmp

hduser@prince-VirtualBox:~$ sudochownhduser:hadoop /app/hadoop/tmp

Open the file and enter the following in between the <configuration></configuration> tag:

hduser@prince-VirtualBox:~$ nano /usr/local/hadoop/etc/hadoop/core-site.xml

<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
theFileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>

iv. /usr/local/hadoop/etc/hadoop/mapred-site.xml

By default, the /usr/local/hadoop/etc/hadoop/ folder contains


/usr/local/hadoop/etc/hadoop/mapred-site.xml.template
file which has to be renamed/copied with the name mapred-site.xml:

hduser@prince-VirtualBox:~$ cp /usr/local/hadoop/etc/hadoop/mapred-
site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml

The mapred-site.xml file is used to specify which framework is being used for MapReduce.
We need to enter the following content in between the <configuration></configuration> tag:

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>

v. /usr/local/hadoop/etc/hadoop/hdfs-site.xml

The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in


the cluster that is being used.
It is used to specify the directories which will be used as the namenode and the datanode on
that host.

Before editing this file, we need to create two directories which will contain the namenode
and the datanode for this Hadoop installation.
This can be done using the following commands:

hduser@prince-VirtualBox:~$ sudomkdir -p /usr/local/hadoop_store/hdfs/namenode


hduser@prince-VirtualBox:~$ sudomkdir -p /usr/local/hadoop_store/hdfs/datanode
hduser@prince-VirtualBox:~$ sudochown -R hduser:hadoop /usr/local/hadoop_store

Open the file and enter the following content in between the <configuration></configuration>
tag:

hduser@prince-VirtualBox:~$ nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
</configuration>

7. Format the NewHadoop File system

Now, the Hadoop file system needs to be formatted so that we can start to use it. The format
command should be issued with write permission since it creates current directory
under /usr/local/hadoop_store/hdfs/namenode folder:

hduser@prince-VirtualBox:~$ hadoopnamenode -format

DEPRECATED: Use of this script to execute hdfs command is deprecated.


Instead use the hdfs command for it.

15/04/18 14:43:03 INFO namenode.NameNode: STARTUP_MSG:


/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = laptop/192.168.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.6.0
STARTUP_MSG: classpath = /usr/local/hadoop/etc/hadoop
...
STARTUP_MSG: java = 1.7.0_65
************************************************************/
15/04/18 14:43:03 INFO namenode.NameNode: registered UNIX signal handlers for
[TERM, HUP, INT]
15/04/18 14:43:03 INFO namenode.NameNode: createNameNode [-format]
15/04/18 14:43:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable
Formatting using clusterid: CID-e2f515ac-33da-45bc-8466-5b1100a2bf7f
15/04/18 14:43:09 INFO namenode.FSNamesystem: No KeyProvider found.
15/04/18 14:43:09 INFO namenode.FSNamesystem: fsLock is fair:true
15/04/18 14:43:10 INFO blockmanagement.DatanodeManager:
dfs.block.invalidate.limit=1000
15/04/18 14:43:10 INFO blockmanagement.DatanodeManager:
dfs.namenode.datanode.registration.ip-hostname-check=true
15/04/18 14:43:10 INFO blockmanagement.BlockManager:
dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
15/04/18 14:43:10 INFO blockmanagement.BlockManager: The block deletion will start
around 2015 Apr 18 14:43:10
15/04/18 14:43:10 INFO util.GSet: Computing capacity for map BlocksMap
15/04/18 14:43:10 INFO util.GSet: VM type = 64-bit
15/04/18 14:43:10 INFO util.GSet: 2.0% max memory 889 MB = 17.8 MB
15/04/18 14:43:10 INFO util.GSet: capacity = 2^21 = 2097152 entries
15/04/18 14:43:10 INFO blockmanagement.BlockManager:
dfs.block.access.token.enable=false
15/04/18 14:43:10 INFO blockmanagement.BlockManager: defaultReplication =1
15/04/18 14:43:10 INFO blockmanagement.BlockManager: maxReplication = 512
15/04/18 14:43:10 INFO blockmanagement.BlockManager: minReplication =1
15/04/18 14:43:10 INFO blockmanagement.BlockManager: maxReplicationStreams =2
15/04/18 14:43:10 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks =
false
15/04/18 14:43:10 INFO blockmanagement.BlockManager: replicationRecheckInterval =
3000
15/04/18 14:43:10 INFO blockmanagement.BlockManager: encryptDataTransfer = false
15/04/18 14:43:10 INFO blockmanagement.BlockManager: maxNumBlocksToLog =
1000
15/04/18 14:43:10 INFO namenode.FSNamesystem: fsOwner = hduser
(auth:SIMPLE)
15/04/18 14:43:10 INFO namenode.FSNamesystem: supergroup = supergroup
15/04/18 14:43:10 INFO namenode.FSNamesystem: isPermissionEnabled = true
15/04/18 14:43:10 INFO namenode.FSNamesystem: HA Enabled: false
15/04/18 14:43:10 INFO namenode.FSNamesystem: Append Enabled: true
15/04/18 14:43:11 INFO util.GSet: Computing capacity for map INodeMap
15/04/18 14:43:11 INFO util.GSet: VM type = 64-bit
15/04/18 14:43:11 INFO util.GSet: 1.0% max memory 889 MB = 8.9 MB
15/04/18 14:43:11 INFO util.GSet: capacity = 2^20 = 1048576 entries
15/04/18 14:43:11 INFO namenode.NameNode: Caching file names occuring more than 10
times
15/04/18 14:43:11 INFO util.GSet: Computing capacity for map cachedBlocks
15/04/18 14:43:11 INFO util.GSet: VM type = 64-bit
15/04/18 14:43:11 INFO util.GSet: 0.25% max memory 889 MB = 2.2 MB
15/04/18 14:43:11 INFO util.GSet: capacity = 2^18 = 262144 entries
15/04/18 14:43:11 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct =
0.9990000128746033
15/04/18 14:43:11 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes
=0
15/04/18 14:43:11 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension =
30000
15/04/18 14:43:11 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
15/04/18 14:43:11 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap
and retry cache entry expiry time is 600000 millis
15/04/18 14:43:11 INFO util.GSet: Computing capacity for map NameNodeRetryCache
15/04/18 14:43:11 INFO util.GSet: VM type = 64-bit
15/04/18 14:43:11 INFO util.GSet: 0.029999999329447746% max memory 889 MB = 273.1
KB
15/04/18 14:43:11 INFO util.GSet: capacity = 2^15 = 32768 entries
15/04/18 14:43:11 INFO namenode.NNConf: ACLs enabled? false
15/04/18 14:43:11 INFO namenode.NNConf: XAttrs enabled? true
15/04/18 14:43:11 INFO namenode.NNConf: Maximum size of anxattr: 16384
15/04/18 14:43:12 INFO namenode.FSImage: Allocated new BlockPoolId: BP-130729900-
192.168.1.1-1429393391595
15/04/18 14:43:12 INFO common.Storage: Storage directory
/usr/local/hadoop_store/hdfs/namenode has been successfully formatted.
15/04/18 14:43:12 INFO namenode.NNStorageRetentionManager: Going to retain 1 images
with txid>= 0
15/04/18 14:43:12 INFO util.ExitUtil: Exiting with status 0
15/04/18 14:43:12 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at laptop/192.168.1.1
************************************************************/

Note that hadoopnamenode -format command should be executed once before we start
using Hadoop.
If this command is executed again after Hadoop has been used, it'll destroy all the data on the
Hadoop file system.

8. Starting Hadoop

Now it's time to start the newly installed single node cluster.
We can use start-all.sh or (start-dfs.sh and start-yarn.sh)

prince@prince-VirtualBox:~$ cd /usr/local/hadoop/sbin

prince@prince-VirtualBox:/usr/local/hadoop/sbin$ ls

distribute-exclude.sh start-all.cmd stop-balancer.sh


hadoop-daemon.sh start-all.sh stop-dfs.cmd
hadoop-daemons.sh start-balancer.sh stop-dfs.sh
hdfs-config.cmd start-dfs.cmd stop-secure-dns.sh
hdfs-config.sh start-dfs.sh stop-yarn.cmd
httpfs.sh start-secure-dns.sh stop-yarn.sh
kms.sh start-yarn.cmd yarn-daemon.sh
mr-jobhistory-daemon.sh start-yarn.sh yarn-daemons.sh
refresh-namenodes.sh stop-all.cmd
slaves.sh stop-all.sh

prince@prince-VirtualBox:/usr/local/hadoop/sbin$ sudosuhduser

hduser@prince-VirtualBox:/usr/local/hadoop/sbin$ start-all.sh
hduser@prince-VirtualBox:~$ start-all.sh

This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh


15/04/18 16:43:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/hadoop/logs/hadoop-hduser-namenode-
laptop.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hduser-datanode-
laptop.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-hduser-
secondarynamenode-laptop.out
15/04/18 16:43:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable
starting yarn daemons
startingresourcemanager, logging to /usr/local/hadoop/logs/yarn-hduser-resourcemanager-
laptop.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hduser-nodemanager-
laptop.out

We can check if it's really up and running:

hduser@prince-VirtualBox:/usr/local/hadoop/sbin$ jps

9026 NodeManager
7348 NameNode
9766 Jps
8887 ResourceManager
7507 DataNode

The output means that we now have a functional instance of Hadoop running on our VPS
(Virtual private server).

Another way to check is using netstat:

hduser@prince-VirtualBox:~$ netstat -plten | grep java

(Not all processes could be identified, non-owned process info


will not be shown, you would have to be root to see it all.)
tcp 0 0 0.0.0.0:50020 0.0.0.0:* LISTEN 1001 1843372
10605/java
tcp 0 0 127.0.0.1:54310 0.0.0.0:* LISTEN 1001 1841277
10447/java
tcp 0 0 0.0.0.0:50090 0.0.0.0:* LISTEN 1001 1841130
10895/java
tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 1001 1840196
10447/java
tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 1001 1841320
10605/java
tcp 0 0 0.0.0.0:50075 0.0.0.0:* LISTEN 1001 1841646
10605/java
tcp6 0 0 :::8040 :::* LISTEN 1001 1845543 11383/java
tcp6 0 0 :::8042 :::* LISTEN 1001 1845551 11383/java
tcp6 0 0 :::8088 :::* LISTEN 1001 1842110 11252/java
tcp6 0 0 :::49630 :::* LISTEN 1001 1845534
11383/java
tcp6 0 0 :::8030 :::* LISTEN 1001 1842036 11252/java
tcp6 0 0 :::8031 :::* LISTEN 1001 1842005 11252/java
tcp6 0 0 :::8032 :::* LISTEN 1001 1842100 11252/java
tcp6 0 0 :::8033 :::* LISTEN 1001 1842162 11252/java
9. Stopping Hadoop

$ pwd

/usr/local/hadoop/sbin

$ ls

distribute-exclude.sh httpfs.sh start-all.sh start-yarn.cmd stop-dfs.cmd


yarn-daemon.sh
hadoop-daemon.sh mr-jobhistory-daemon.sh start-balancer.sh start-yarn.sh stop-
dfs.sh yarn-daemons.sh
hadoop-daemons.sh refresh-namenodes.sh start-dfs.cmd stop-all.cmd stop-
secure-dns.sh
hdfs-config.cmd slaves.sh start-dfs.sh stop-all.sh stop-yarn.cmd
hdfs-config.sh start-all.cmd start-secure-dns.sh stop-balancer.sh stop-yarn.sh

We run stop-all.sh or (stop-dfs.sh and stop-yarn.sh) to stop all the daemons running on our
machine:

hduser@prince-VirtualBox:/usr/local/hadoop/sbin$ pwd

/usr/local/hadoop/sbin

hduser@prince-VirtualBox:/usr/local/hadoop/sbin$ ls

distribute-exclude.sh httpfs.sh start-all.cmd start-secure-dns.sh stop-balancer.sh


stop-yarn.sh
hadoop-daemon.sh kms.sh start-all.sh start-yarn.cmd stop-dfs.cmd
yarn-daemon.sh
hadoop-daemons.sh mr-jobhistory-daemon.sh start-balancer.sh start-yarn.sh stop-
dfs.sh yarn-daemons.sh
hdfs-config.cmd refresh-namenodes.sh start-dfs.cmd stop-all.cmd stop-secure-
dns.sh
hdfs-config.sh slaves.sh start-dfs.sh stop-all.sh stop-yarn.cmd

hduser@prince-VirtualBox:/usr/local/hadoop/sbin$ stop-all.sh

This script is Deprecated. Instead use stop-dfs.sh and stop-yarn.sh


15/04/18 15:46:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable
Stopping namenodes on [localhost]
localhost: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: no secondarynamenode to stop
15/04/18 15:46:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable
stopping yarn daemons
stoppingresourcemanager
localhost: stopping nodemanager
noproxyserver to stop
10.Hadoop Web Interfaces

Let's start the Hadoop again and see its Web UI:

hduser@prince-VirtualBox:/usr/local/hadoop/sbin$ start-all.sh

http://localhost:50070/ - web UI of the NameNode daemon


OUTPUT:
Secondary Name Node

(Note) I had to restart Hadoop to get this Secondary Namenode.


Data Node
RESULT:

EX.NO: 2 IMPLEMENT WORD COUNT / FREQUENCY PROGRAMS


USING MAPREDUCE
DATE:

AIM:

To write a java program for counting the number of occurrences of each word in a
text file using the Mapreduce concepts.

PROCEDURE:

1. Install hadoop.
2. Start all services using the command.

hduser@prince-VirtualBox:/usr/local/hadoop/bin$ jps
3242 Jps

hduser@prince-VirtualBox:/usr/local/hadoop/bin$ start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
16/09/15 15:38:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/hadoop/logs/hadoop-hduser-namenode-
prince-VirtualBox.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hduser-datanode-
prince-VirtualBox.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-hduser-
secondarynamenode-prince-VirtualBox.out
16/09/15 15:39:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable
starting yarn daemons
startingresourcemanager, logging to /usr/local/hadoop/logs/yarn-hduser-resourcemanager-
prince-VirtualBox.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hduser-nodemanager-
prince-VirtualBox.out

hduser@prince-VirtualBox:/usr/local/hadoop/bin$ jps
16098 NameNode
16214 DataNode
16761 NodeManager
16636 ResourceManager
16429 SecondaryNameNode
19231 Jps

PROGRAM CODING:
hduser@prince-VirtualBox:/usr/local/hadoop/bin$ nano wordcount7.java

importjava.io.IOException;
importjava.util.StringTokenizer;
importorg.apache.hadoop.conf.Configuration;
importorg.apache.hadoop.fs.Path;
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapreduce.Job;
importorg.apache.hadoop.mapreduce.Mapper;
importorg.apache.hadoop.mapreduce.Reducer;
importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;
importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class wordcount7 {

public static class TokenizerMapper


extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);


private Text word = new Text();

public void map(Object key, Text value, Context context


) throws IOException, InterruptedException {
StringTokenizeritr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

public static class IntSumReducer


extends Reducer<Text,IntWritable,Text,IntWritable> {
privateIntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,


Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritableval : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(wordcount7.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

TO COMPILE:
hduser@prince-VirtualBox:/usr/local/hadoop/bin$ hadoopcom.sun.tools.javac.Main
wordcount7.java

TO CREATE A JAR FILE:


hduser@prince-VirtualBox:/usr/local/hadoop/bin$ jar cf wc2.jar wordcount7*.java

TO CREATE A DIRECTORY IN HDFS:


hduser@prince-VirtualBox:/usr/local/hadoop/bin$hadoopdfs -mkdir /deepika

TO LOAD INPUT FILE:


hduser@prince-VirtualBox:/usr/local/hadoop/bin$hdfs -put
/home/prince/Downloads/wc.txt /deepika/wc1

TO EXECUTE:
hduser@prince-VirtualBox:/usr/local/hadoop/bin$ hadoop jar wc2.jar wordcount7
/deepika/wc1.txt /deepika/out2

16/09/16 14:34:16 WARN util.NativeCodeLoader: Unable to load native-hadoop library for


your platform... using builtin-java classes where applicable
16/09/16 14:34:17 INFO Configuration.deprecation: session.id is deprecated. Instead, use
dfs.metrics.session-id
16/09/16 14:34:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
16/09/16 14:34:17 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing
not performed. Implement the Tool interface and execute your application with ToolRunner
to remedy this.
16/09/16 14:34:17 INFO input.FileInputFormat: Total input paths to process : 1
16/09/16 14:34:17 INFO mapreduce.JobSubmitter: number of splits:1
16/09/16 14:34:18 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_local364071501_0001
16/09/16 14:34:18 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
16/09/16 14:34:18 INFO mapreduce.Job: Running job: job_local364071501_0001
16/09/16 14:34:18 INFO mapred.LocalJobRunner: OutputCommitter set in config null
16/09/16 14:34:19 INFO mapred.LocalJobRunner: OutputCommitter is
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
16/09/16 14:34:19 INFO mapred.LocalJobRunner: Waiting for map tasks
16/09/16 14:34:19 INFO mapred.LocalJobRunner: Starting task:
attempt_local364071501_0001_m_000000_0
16/09/16 14:34:19 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/09/16 14:34:19 INFO mapred.MapTask: Processing split:
hdfs://localhost:54310/deepika/wc1:0+712
16/09/16 14:34:19 INFO mapreduce.Job: Job job_local364071501_0001 running in
ubermode : false
16/09/16 14:34:23 INFO mapreduce.Job: map 0% reduce 0%
16/09/16 14:34:24 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
16/09/16 14:34:24 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
16/09/16 14:34:24 INFO mapred.MapTask: soft limit at 83886080
16/09/16 14:34:24 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
16/09/16 14:34:24 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
16/09/16 14:34:24 INFO mapred.MapTask: Map output collector class =
org.apache.hadoop.mapred.MapTask$MapOutputBuffer
16/09/16 14:34:26 INFO mapred.LocalJobRunner:
16/09/16 14:34:26 INFO mapred.MapTask: Starting flush of map output
16/09/16 14:34:26 INFO mapred.MapTask: Spilling map output
16/09/16 14:34:26 INFO mapred.MapTask: bufstart = 0; bufend = 1079; bufvoid =
104857600
16/09/16 14:34:26 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend =
26214032(104856128); length = 365/6553600
16/09/16 14:34:26 INFO mapred.MapTask: Finished spill 0
16/09/16 14:34:26 INFO mapred.Task: Task:attempt_local364071501_0001_m_000000_0 is
done. And is in the process of committing
16/09/16 14:34:26 INFO mapred.LocalJobRunner: map
16/09/16 14:34:26 INFO mapred.Task: Task 'attempt_local364071501_0001_m_000000_0'
done.
16/09/16 14:34:26 INFO mapred.LocalJobRunner: Finishing task:
attempt_local364071501_0001_m_000000_0
16/09/16 14:34:26 INFO mapred.LocalJobRunner: map task executor complete.
16/09/16 14:34:26 INFO mapred.LocalJobRunner: Waiting for reduce tasks
16/09/16 14:34:26 INFO mapred.LocalJobRunner: Starting task:
attempt_local364071501_0001_r_000000_0
16/09/16 14:34:26 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/09/16 14:34:26 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin:
org.apache.hadoop.mapreduce.task.reduce.Shuffle@2ee9ab75
16/09/16 14:34:26 INFO mapreduce.Job: map 100% reduce 0%
16/09/16 14:34:26 INFO reduce.MergeManagerImpl: MergerManager:
memoryLimit=363285696, maxSingleShuffleLimit=90821424, mergeThreshold=239768576,
ioSortFactor=10, memToMemMergeOutputsThreshold=10
16/09/16 14:34:26 INFO reduce.EventFetcher: attempt_local364071501_0001_r_000000_0
Thread started: EventFetcher for fetching Map Completion Events
16/09/16 14:34:26 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map
attempt_local364071501_0001_m_000000_0 decomp: 1014 len: 1018 to MEMORY
16/09/16 14:34:27 INFO reduce.InMemoryMapOutput: Read 1014 bytes from map-output
for attempt_local364071501_0001_m_000000_0
16/09/16 14:34:27 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of
size: 1014, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->1014
16/09/16 14:34:27 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
16/09/16 14:34:27 INFO mapred.LocalJobRunner: 1 / 1 copied.
16/09/16 14:34:27 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory
map-outputs and 0 on-disk map-outputs
16/09/16 14:34:27 INFO mapred.Merger: Merging 1 sorted segments
16/09/16 14:34:27 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left
of total size: 991 bytes
16/09/16 14:34:27 INFO reduce.MergeManagerImpl: Merged 1 segments, 1014 bytes to disk
to satisfy reduce memory limit
16/09/16 14:34:27 INFO reduce.MergeManagerImpl: Merging 1 files, 1018 bytes from disk
16/09/16 14:34:27 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from
memory into reduce
16/09/16 14:34:27 INFO mapred.Merger: Merging 1 sorted segments
16/09/16 14:34:27 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left
of total size: 991 bytes
16/09/16 14:34:27 INFO mapred.LocalJobRunner: 1 / 1 copied.
16/09/16 14:34:27 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead,
use mapreduce.job.skiprecords
16/09/16 14:34:30 INFO mapred.Task: Task:attempt_local364071501_0001_r_000000_0 is
done. And is in the process of committing
16/09/16 14:34:30 INFO mapred.LocalJobRunner: 1 / 1 copied.
16/09/16 14:34:30 INFO mapred.Task: Task attempt_local364071501_0001_r_000000_0 is
allowed to commit now
16/09/16 14:34:30 INFO output.FileOutputCommitter: Saved output of task
'attempt_local364071501_0001_r_000000_0' to
hdfs://localhost:54310/deepika/out2/_temporary/0/task_local364071501_0001_r_000000
16/09/16 14:34:30 INFO mapred.LocalJobRunner: reduce > reduce
16/09/16 14:34:30 INFO mapred.Task: Task 'attempt_local364071501_0001_r_000000_0'
done.
16/09/16 14:34:30 INFO mapred.LocalJobRunner: Finishing task:
attempt_local364071501_0001_r_000000_0
16/09/16 14:34:30 INFO mapred.LocalJobRunner: reduce task executor complete.
16/09/16 14:34:30 INFO mapreduce.Job: map 100% reduce 100%
16/09/16 14:34:31 INFO mapreduce.Job: Job job_local364071501_0001 completed
successfully
16/09/16 14:34:31 INFO mapreduce.Job: Counters: 38
File System Counters
FILE: Number of bytes read=8552
FILE: Number of bytes written=507858
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1424
HDFS: Number of bytes written=724
HDFS: Number of read operations=13
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
Map-Reduce Framework
Map input records=10
Map output records=92
Map output bytes=1079
Map output materialized bytes=1018
Input split bytes=99
Combine input records=92
Combine output records=72
Reduce input groups=72
Reduce shuffle bytes=1018
Reduce input records=72
Reduce output records=72
Spilled Records=144
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=111
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=242360320
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=712
File Output Format Counters
Bytes Written=724

INPUT FILE:
wc1.txt
STEPS:
1. Open an editor and type WordCount program and save as WordCount.java
2. Set the path as export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
3. To compile the program, bin/hadoopcom.sun.tools.javac.Main WordCount.java
4. Create a jar file, jar cf wc.jar WordCount*.class
5. Create input files input.txt,input1.txt and input2.txt and create a directory in hdfs,
/mit/wordcount/input
6. Move these i/p files to hdfs system, bin/hadoopfs –put /opt/hadoop-2.7.0/input.txt
/mit/wordcount/input/input.txt repeat this step for other two i/p files.
7. To execute, bin/hadoop jar wc.jar WordCount /mit/wordcount/input
/mit/wordcount/output.
8. The mapreduce result will be available in the output directory.

OUTPUT:
/mit/wordcount/input 2
/mit/wordcount/input/input.txt 1
/mit/wordcount/output. 1
/opt/hadoop-2.7.0/input.txt 1
1. 1
2. 1
3. 1
4. 1
5. 1
6. 1
7. 1
8. 1
Create 2
HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar 1
Move 1
Open 1
STEPS: 1
Set 1
The 1
To 2
WordCount 2
WordCount*.class 1
WordCount.java 2
a2
an 1
and 4
as 2
available 1
be 1
bin/hadoop 3
cf 1
com.sun.tools.javac.Main 1
compile 1
create 1
directory 1
directory. 1
editor 1
execute, 1
export 1
file, 1
files 2
files. 1
for 1
fs 1
hdfs 1
hdfs, 1
i/p 2
in 2
input 1
input.txt,input1.txt 1
input2.txt 1
jar 3
mapreduce 1
other 1
output 1
path 1
program 1
program, 1
repeat 1
result 1
save 1
step 1
system, 1
the 3
these 1
this 1
to 1
two 1
type 1
wc.jar 2
will 1
–put 1
RESULT:

EX.NO: 3 IMPLEMENT AN MR PROGRAM THAT PROCESSES


AWEATHER DATASET
DATE:

AIM:

To write a java program for processing a weather dataset.

ALGORITHM:

Step 1:

Step 2:

Step 3:

Step 4:

Step 5:

Step 6:

Step7:

Step 8:

PROGRAM CODING:

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.MultipleOutputs;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

/**
* @author devinline
*/
public class CalculateMaxAndMinTemeratureWithTime {
public static String calOutputName = "California";
public static String nyOutputName = "Newyork";
public static String njOutputName = "Newjersy";
public static String ausOutputName = "Austin";
public static String bosOutputName = "Boston";
public static String balOutputName = "Baltimore";

public static class WhetherForcastMapper extends


Mapper<Object, Text, Text, Text> {

public void map(Object keyOffset, Text dayReport, Context con)


throws IOException, InterruptedException {
StringTokenizer strTokens = new StringTokenizer(
dayReport.toString(), "\t");
int counter = 0;
Float currnetTemp = null;
Float minTemp = Float.MAX_VALUE;
Float maxTemp = Float.MIN_VALUE;
String date = null;
String currentTime = null;
String minTempANDTime = null;
String maxTempANDTime = null;

while (strTokens.hasMoreElements()) {
if (counter == 0) {
date = strTokens.nextToken();
} else {
if (counter % 2 == 1) {
currentTime = strTokens.nextToken();
} else {
currnetTemp = Float.parseFloat(strTokens.nextToken());
if (minTemp > currnetTemp) {
minTemp = currnetTemp;
minTempANDTime = minTemp + "AND" + currentTime;
}
if (maxTemp < currnetTemp) {
maxTemp = currnetTemp;
maxTempANDTime = maxTemp + "AND" + currentTime;
}
}
}
counter++;
}
// Write to context - MinTemp, MaxTemp and corresponding time
Text temp = new Text();
temp.set(maxTempANDTime);
Text dateText = new Text();
dateText.set(date);
try {
con.write(dateText, temp);
} catch (Exception e) {
e.printStackTrace();
}

temp.set(minTempANDTime);
dateText.set(date);
con.write(dateText, temp);

}
}
public static class WhetherForcastReducer extends
Reducer<Text, Text, Text, Text> {
MultipleOutputs<Text, Text> mos;

public void setup(Context context) {


mos = new MultipleOutputs<Text, Text>(context);
}

public void reduce(Text key, Iterable<Text> values, Context context)


throws IOException, InterruptedException {
int counter = 0;
String reducerInputStr[] = null;
String f1Time = "";
String f2Time = "";
String f1 = "", f2 = "";
Text result = new Text();
for (Text value : values) {

if (counter == 0) {
reducerInputStr = value.toString().split("AND");
f1 = reducerInputStr[0];
f1Time = reducerInputStr[1];
}

else {
reducerInputStr = value.toString().split("AND");
f2 = reducerInputStr[0];
f2Time = reducerInputStr[1];
}

counter = counter + 1;
}
if (Float.parseFloat(f1) > Float.parseFloat(f2)) {

result = new Text("Time: " + f2Time + " MinTemp: " + f2 + "\t"


+ "Time: " + f1Time + " MaxTemp: " + f1);
} else {

result = new Text("Time: " + f1Time + " MinTemp: " + f1 + "\t"


+ "Time: " + f2Time + " MaxTemp: " + f2);
}
String fileName = "";
if (key.toString().substring(0, 2).equals("CA")) {
fileName = CalculateMaxAndMinTemeratureTime.calOutputName;
} else if (key.toString().substring(0, 2).equals("NY")) {
fileName = CalculateMaxAndMinTemeratureTime.nyOutputName;
} else if (key.toString().substring(0, 2).equals("NJ")) {
fileName = CalculateMaxAndMinTemeratureTime.njOutputName;
} else if (key.toString().substring(0, 3).equals("AUS")) {
fileName = CalculateMaxAndMinTemeratureTime.ausOutputName;
} else if (key.toString().substring(0, 3).equals("BOS")) {
fileName = CalculateMaxAndMinTemeratureTime.bosOutputName;
} else if (key.toString().substring(0, 3).equals("BAL")) {
fileName = CalculateMaxAndMinTemeratureTime.balOutputName;
}
String strArr[] = key.toString().split("_");
key.set(strArr[1]); //Key is date value
mos.write(fileName, key, result);
}

@Override
public void cleanup(Context context) throws IOException,
InterruptedException {
mos.close();
}
}

public static void main(String[] args) throws IOException,


ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Wheather Statistics of USA");
job.setJarByClass(CalculateMaxAndMinTemeratureWithTime.class);

job.setMapperClass(WhetherForcastMapper.class);
job.setReducerClass(WhetherForcastReducer.class);

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

MultipleOutputs.addNamedOutput(job, calOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, nyOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, njOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, bosOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, ausOutputName,
TextOutputFormat.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, balOutputName,
TextOutputFormat.class, Text.class, Text.class);
// FileInputFormat.addInputPath(job, new Path(args[0]));
// FileOutputFormat.setOutputPath(job, new Path(args[1]));
Path pathInput = new Path(
"hdfs://192.168.213.133:54310/weatherInputData/input_temp.txt");
Path pathOutputDir = new Path(
"hdfs://192.168.213.133:54310/user/hduser1/testfs/output_mapred3");
FileInputFormat.addInputPath(job, pathInput);
FileOutputFormat.setOutputPath(job, pathOutputDir);
try {
System.exit(job.waitForCompletion(true) ? 0 : 1);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}}}

OUTPUT:
whether output directory is in place on HDFS. Execute following command to verify the
same.

hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop fs -ls


/user/hduser1/testfs/output_mapred3
Found 8 items
-rw-r--r-- 3 zytham supergroup 438 2015-12-11 19:21
/user/hduser1/testfs/output_mapred3/Austin-r-00000
-rw-r--r-- 3 zytham supergroup 219 2015-12-11 19:21
/user/hduser1/testfs/output_mapred3/Baltimore-r-00000
-rw-r--r-- 3 zytham supergroup 219 2015-12-11 19:21
/user/hduser1/testfs/output_mapred3/Boston-r-00000
-rw-r--r-- 3 zytham supergroup 511 2015-12-11 19:21
/user/hduser1/testfs/output_mapred3/California-r-00000
-rw-r--r-- 3 zytham supergroup 146 2015-12-11 19:21
/user/hduser1/testfs/output_mapred3/Newjersy-r-00000
-rw-r--r-- 3 zytham supergroup 219 2015-12-11 19:21
/user/hduser1/testfs/output_mapred3/Newyork-r-00000
-rw-r--r-- 3 zytham supergroup 0 2015-12-11 19:21
/user/hduser1/testfs/output_mapred3/_SUCCESS
-rw-r--r-- 3 zytham supergroup 0 2015-12-11 19:21
/user/hduser1/testfs/output_mapred3/part-r-00000

hduser1@ubuntu:/usr/local/hadoop2.6.1/bin$ ./hadoop fs -cat


/user/hduser1/testfs/output_mapred3/Austin-r-00000
25-Jan-2018 Time: 12:34:542 MinTemp: -22.3 Time: 05:12:345 MaxTemp: 35.7
26-Jan-2018 Time: 22:00:093 MinTemp: -27.0 Time: 05:12:345 MaxTemp: 55.7
27-Jan-2018 Time: 02:34:542 MinTemp: -22.3 Time: 05:12:345 MaxTemp: 55.7
29-Jan-2018 Time: 14:00:093 MinTemp: -17.0 Time: 02:34:542 MaxTemp: 62.9
30-Jan-2018 Time: 22:00:093 MinTemp: -27.0 Time: 05:12:345 MaxTemp: 49.2
31-Jan-2018 Time: 14:00:093 MinTemp: -17.0 Time: 03:12:187 MaxTemp: 56.0

RESULT:
EX.NO: 4 IMPLEMENT LINEAR AND LOGISTIC REGRESSION

DATE:

AIM:

To implement linear and logistic regression in R studio

PROGRAM CODING:(LINEAR REGRESSION)

The aim of linear regression is to model a continuous variable Y as a mathematical


function of one or more X variable(s), so that we can use this regression model to predict the
Y when only the X is known. This mathematical equation can be generalized as follows:
Y = β1 + β2X + ϵ
For this analysis, we will use the cars dataset that comes with R by default. cars is a
standard built-in dataset, that makes it convenient to demonstrate linear regression in a simple
and easy to understand fashion. You can access this dataset simply by typing in cars in your
R console.
# display the first 6 observations
head(cars)
Scatter Plot:
# scatterplot
scatter.smooth(x=cars$speed, y=cars$dist, main="Dist ~ Speed")

Build Linear Model:


# build linear regression model on full data
linearMod <- lm(dist ~ speed, data=cars)
print(linearMod)
Linear Regression Diagnostics
summary(linearMod)
OUTPUT:
PROGRAM CODING:(LOGISTIC REGRESSION):

The general mathematical equation for logistic regression is


y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...))

The basic syntax for glm() function in logistic regression is


glm(formula,data,family)

The in-built data set "mtcars" describes different models of a car with their various engine
specifications. In "mtcars" data set, the transmission mode (automatic or manual) is described
by the column am which is a binary value (0 or 1). We can create a logistic regression model
between the columns "am" and 3 other columns - hp, wt and cyl.

# Select some columns form mtcars.


input <- mtcars[,c("am","cyl","hp","wt")]
print(head(input))
create regression model:
input <- mtcars[,c("am","cyl","hp","wt")]
am.data = glm(formula = am ~ cyl + hp + wt, data = input, family = binomial)
print(summary(am.data))
OUTPUT:

RESULT:
EX.NO: 5
DATE: IMPLEMENT SVM / DECISION TREE CLASSIFICATION
TECHNIQUES

AIM:

To implement SVM/Decision Tree Classification Techniques

IMPLEMENTATION:(SVM)
To use SVM in R, we have a package e1071. The package is not preinstalled, hence
one needs to run the line “install.packages(“e1071”) to install the package and then import
the package contents using the library command--library(e1071).

R CODE:
x=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
y=c(3,4,5,4,8,10,10,11,14,20,23,24,32,34,35,37,42,48,53,60)

#Create a data frame of the data


train=data.frame(x,y)

#Plot the dataset


plot(train,pch=16)

#Linear regression
model <- lm(y ~ x, train)

#Plot the model using abline


abline(model)

#SVM
library(e1071)

#Fit a model. The function syntax is very similar to lm function


model_svm <- svm(y ~ x , train)

#Use the predictions on the data


pred <- predict(model_svm, train)

#Plot the predictions and the plot to see our model fit
points(train$x, pred, col = "blue", pch=4)

#Linear model has a residuals part which we can extract and directly calculate rmse
error <- model$residuals
lm_error <- sqrt(mean(error^2)) # 3.832974

#For svm, we have to manually calculate the difference between actual values (train$y) with
our predictions (pred)
error_2 <- train$y - pred
svm_error <- sqrt(mean(error_2^2)) # 2.696281

# perform a grid search


svm_tune <- tune(svm, y ~ x, data = train,
ranges = list(epsilon = seq(0,1,0.01), cost = 2^(2:9))
)
print(svm_tune)

#Parameter tuning of ‘svm’:

# - sampling method: 10-fold cross validation

#- best parameters:
# epsilon cost
#0 8

#- best performance: 2.872047

#The best model


best_mod <- svm_tune$best.model
best_mod_pred <- predict(best_mod, train)

error_best_mod <- train$y - best_mod_pred

# this value can be different on your computer


# because the tune method randomly shuffles the data
best_mod_RMSE <- sqrt(mean(error_best_mod^2)) # 1.290738

plot(svm_tune)

plot(train,pch=16)
points(train$x, best_mod_pred, col = "blue", pch=4)
OUTPUT :
IMPLEMENTATION:(DECISION TREE)
Use the below command in R console to install the package. You also have to install the
dependent packages if any.
install.packages("party")
The basic syntax for creating a decision tree in R is
ctree(formula, data)
input data:
We will use the R in-built data set named readingSkills to create a decision tree. It describes
the score of someone's readingSkills if we know the variables "age","shoesize","score" and
whether the person is a native speaker or not.
# Load the party package. It will automatically load other dependent packages.
library(party)

# Print some records from data set readingSkills.


print(head(readingSkills))
# Load the party package. It will automatically load other dependent packages.
We will use the ctree() function to create the decision tree and see its graph.
# Load the party package. It will automatically load other dependent packages.
library(party)
# Create the input data frame.
input.dat <- readingSkills[c(1:105),]
# Give the chart file a name.
png(file = "decision_tree.png")
# Create the tree.
output.tree <- ctree(nativeSpeaker ~ age + shoeSize + score, data = input.dat)
# Plot the tree.
plot(output.tree)
# Save the file.
dev.off()
OUTPUT:

RESULT:
EX.NO:6 IMPLEMENT CLUSTERING TECHNIQUES
DATE:

AIM:
To implement clustering Techniques

PROGRAM CODING:

Installing and loading required R packages

install.packages("factoextra")

install.packages("cluster")

install.packages("magrittr")

library("cluster")

library("factoextra")

library("magrittr")
Data preparation

# Load and prepare the data

data("USArrests")

my_data <- USArrests %>% na.omit() %>% # Remove missing values (NA) scale() #
Scale variables

# View the firt 3 rows

head(my_data, n = 3)
Distance measures

res.dist <- get_dist(USArrests, stand = TRUE, method = "pearson") fviz_dist(res.dist,


gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
PARTITION CLUSTERING:

Determining the optimal number of clusters: use factoextra::fviz_nbclust()


library("factoextra")
fviz_nbclust(my_data, kmeans, method = "gap_stat")

Compute and visualize k-means clustering

set.seed(123)
km.res <- kmeans(my_data, 3, nstart = 25)
# Visualize
library("factoextra")
fviz_cluster(km.res, data = my_data,ellipse.type = "convex", palette = "jc,
ggtheme = theme_minimal())
MODEL BASED CLUSTERING:

# Load the data

library("MASS") data("geyser")

# Scatter plot library("ggpubr")

ggscatter(geyser, x = "duration", y = "waiting")+ geom_density2d() # Add 2D density

library("mclust")

data("diabetes")

head(diabetes, 3)
Model-based clustering can be computed using the function Mclust() as follow:
library(mclust)
df <- scale(diabetes[, -1]) # Standardize the data
mc <- Mclust(df) # Model-based-clustering
summary(mc) # Print a summary

mc$modelName # Optimal selected model ==> "VVV"

mc$G # Optimal number of cluster => 3 head(mc$z, 30) # Probality to belong to a given
cluster head(mc$classification, 30) # Cluster assignement of each observation
VISUALIZING MODEL-BASED CLUSTERING

library(factoextra)

# BIC values used for choosing the number of clusters

fviz_mclust(mc, "BIC", palette = "jco")

# Classification: plot showing the clustering

fviz_mclust(mc, "classification", geom = "point", pointsize = 1.5, palette = "jco")

# Classification uncertainty

fviz_mclust(mc, "uncertainty", palette = "jco")


RESULT:
EX.NO:7 VISUALIZE DATA USING ANY PLOTTING FRAMEWORK
DATE:

AIM:
To implement the visualize data using any plotting framework

IMPLEMENTATION:

Step 1: Define two vectors as truck and car


Step 2: Using plot function specify the y-axis range directly so it will be large enough to fit
the truck data
Step 3: plot cars using a y axis that ranges from 0 to 12
Step 4: plot trucks with red dashed line and square point strucks with red dashed line and
square points
Step 5:Create a title with a red, bold/italic font
Step 6: Create a legend at (1, g_range[2]) that is slightly smaller (cex) and uses the same line
colors and points used by the actual plots.

legend(1, g_range[2], c("cars","trucks"), cex=0.8,col=c("blue","red"), pch=21:22,


lty=1:2);

RESULT:
EX.NO:8 IMPLEMENT AN APPLICATION THAT STORES BIG DATA IN HBASE /
MONGODB / PIG USING HADOOP / R.
DATE:

PROGRAM CODING:

OUTPUT:
RESULT:

You might also like