Hadoop
Hadoop
Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework. Hadoop wiki
HDFS
Hadoop's Distributed File System is designed to reliably store very large files across machines in a large cluster. It is inspired by the Google File System. Hadoop DFS stores each file as a sequence of blocks, all blocks in a file except the last block are the same size. Blocks belonging to a file are replicated for fault tolerance. The block size and replication factor are configurable per file. Files in HDFS are "write once" and have strictly one writer at any time. Hadoop Distributed File System Goals: Store large data sets Cope with hardware failure Emphasize streaming data access
Map Reduce
The Hadoop Map/Reduce framework harnesses a cluster of machines and executes user defined Map/Reduce jobs across the nodes in the cluster. A Map/Reduce computation has two phases, a map phase and a reduce phase. The input to the computation is a data set of key/value pairs. Tasks in each phase are executed in a fault-tolerant manner, if node(s) fail in the middle of a computation the tasks assigned to them are re-distributed among the remaining nodes. Having many map and reduce tasks enables good load balancing and allows failed tasks to be re-run with small runtime overhead. Hadoop Map/Reduce Goals: Process large data sets Cope with hardware failure High throughput
http://labs.google.com/papers/mapreduce.html
Architecture
Like Hadoop Map/Reduce, HDFS follows a master/slave architecture. An HDFS installation consists of a single Namenode, a master server that manages the filesystem namespace and regulates access to files by clients. In addition, there are a number of Datanodes, one per node in the cluster, which manage storage attached to the nodes that they run on. The Namenode makes filesystem namespace operations like opening, closing, renaming etc. of files and directories available via an RPC interface. It also determines the mapping of blocks to Datanodes. The Datanodes are responsible for serving read and write requests from filesystem clients, they also perform block creation, deletion, and replication upon instruction from the Namenode.
Architecture
Configurations
Files to configure:
hadoop-env.sh
Open the file <HADOOP_INSTALL>/conf/hadoop-env.sh in the editor of your choice and set the JAVA_HOME environment variable to the Sun JDK/JRE 1.5.0 directory.
---------------------------------------------------------- hadoop-site.xml
Any site-specific configuration of Hadoop is configured in <HADOOP_INSTALL>/conf/hadoop-site.xml. Here we will configure the directory where Hadoop will store its data files, the ports it listens to, etc. You can leave the settings below as is with the exception of the hadoop.tmp.dir variable which you have to change to the directory of your choice, for example /usr/local/hadoop-datastore/hadoop-${user.name}.
-------------------------------------------------------------------<property> <name>hadoop.tmp.dir</name>
<value>/your/path/to/hadoop/tmp/dir/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description> </property>
----------------------------------------------------------------------
Starting cluster:
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker .
Run the command: hadoop@ubuntu:~$ <HADOOP_INSTALL>/bin/start-all.sh
Stopping cluster:
To stop all the daemons running on your machine,
run the command: hadoop@ubuntu:~$ <HADOOP_INSTALL>/bin/stop-all.sh
Configurations
Now we will modify the Hadoop configuration to make one Ubuntu box the master (which will also act as a slave) and the other Ubuntu box a slave. We will call the designated master machine just the master from now and the slave-only machine the slave. Both machines must be able to reach each other over the network Shutdown each single-node cluster with <HADOOP_INSTALL>/bin/stop-all.sh before continuing if you haven't done so already.
Configurations
Files to configure:
conf/masters (master only) The conf/masters file defines the master nodes of our multi-node cluster. In our case, this is just the master machine. On master, update <HADOOP_INSTALL>/conf/masters that it looks like this: ---------------------master --------------------conf/slaves (master only) This conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will run. We want both the master box and the slave box to act as Hadoop slaves because we want both of them to store and process data. On master, update <HADOOP_INSTALL>/conf/slaves that it looks like this: -----------------Master
slave ------------------If you have additional slave nodes, just add them to the conf/slaves file, one per line.
Configurations
conf/hadoop-site.xml (all machines): Assuming you configured conf/hadoop-site.xml on each machine as described in the single-node cluster tutorial, you will only have to change a few variables. Important: You have to change conf/hadoop-site.xml on ALL machines as follows. First, we have to change the fs.default.name variable which specifies the NameNode (the HDFS master) host and port. In our case, this is the master machine. -----------------------------------------<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value> <description>The name of the default file system. . . </property> ---------------------------------------
Second, we have to change the mapred.job.tracker variable which specifies the JobTracker (MapReduce master) host and port. Again, this is the master in our case.
------------------------------------------------------<property> <name>mapred.job.tracker</name> <value>master:54311</value> <description>The host and port that the MapReduce job tracker runs at . . . </description> </property> -------------------------------------------------
Configurations
Third, we change the dfs.replication variable which specifies the default block replication. It defines how many machines a single file should be replicated to before it becomes available. If you set this to a value higher than the number of slave nodes that you have available, you will start seeing a lot of type errors in the log files. --------------------------------<property> <name>dfs.replication</name> <value>2</value> <description>Default block replication. . .</description> </property> ----------------------------------
------------------------bin/start-dfs.sh --------------------------On slave, you can examine the success or failure of this command by inspecting the log file <HADOOP_INSTALL>/logs/hadoop-hadoop-datanode-slave.log. At this point, the following Java processes should run on master: ----------------------------------hadoop@master:/usr/local/hadoop$ jps 14799 NameNode 15314 Jps 14880 DataNode 14977 SecondaryNameNode ------------------------------------
MapReduce daemons:
Run the command <HADOOP_INSTALL>/bin/start-mapred.sh on the machine you want the jobtracker to run on. This will bring up the MapReduce cluster with the jobtracker running on the machine you ran the previous command on, and tasktrackers on the machines listed in the conf/slaves file.
----------------------------------------------------
-------------------------------------------
MapReduce daemons:
Run the command <HADOOP_INSTALL>/bin/stop-mapred.sh on the jobtracker machine. This will shut down the MapReduce cluster by stopping the jobtracker daemon running on the machine you ran the previous command on, and tasktrackers on the machines listed in the conf/slaves file. In our case, we will run bin/stop-mapred.sh on master: ------------------------------bin/stop-mapred.sh
------------------------------At this point, the following Java processes should run on master: -------------------------------------hadoop@master:/usr/local/hadoop$ jps 14799 NameNode 18386 Jps 14880 DataNode 14977 SecondaryNameNode --------------------------------------------
HDFS daemons:
Run the command <HADOOP_INSTALL>/bin/stop-dfs.sh on the namenode machine. This will shut down HDFS by stopping the namenode daemon running on the machine you ran the previous command on, and datanodes on the machines listed in the conf/slaves file. In our case, we will run bin/stop-dfs.sh on master: --------------------------------bin/stop-dfs.sh --------------------------------At this point, the only following Java processes should run on master: ------------------------------hadoop@master:/usr/local/hadoop$ jps 18670 Jps ------------------------------
Download the ebook as plain text file in us-ascii encoding and store the uncompressed file in a temporary directory of choice, for example /tmp/gutenberg.
hadoop@ubuntu:~$ <HADOOP_INSTALL>/bin/start-all.sh
Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop's HDFS -----------------------------
To inspect the file, you can copy it from HDFS to the local file system. ------------------------------------hadoop@ubuntu:/usr/local/hadoop$ mkdir /tmp/output hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs copyToLocal output/part-00000 /tmp/output ---------------------------------------Alternatively, you can read the file directly from HDFS without copying it to the local file system by using the command : --------------------------------------------hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs cat output/part-00000
Bibliography
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(SingleNode_Cluster)#Running_a_MapReduce_job http://wiki.apache.org/hadoop/