Hadoop Lab Notes: Nicola Tonellotto November 15, 2010
Hadoop Lab Notes: Nicola Tonellotto November 15, 2010
Nicola Tonellotto
November 15, 2010
2
Contents
1 Hadoop Setup 4
1.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3
1 Hadoop Setup
1.1 Prerequisites
1. Gnu/Linux computer
2. Java 1.6 SDK installed
3. SSH must be installed and SSHD must be running
1.2 Installation
1. Create the hadoop user account and login as hadoop user.
2. Download hadoop-??0.20.2.tar.gz in your home dir.
3. Unpack the downloaded Hadoop distribution in your home dir.
4. Check that you can ssh to localhost without a passphrase:
↑Code
hadoop@localhost$ ssh localhost
↓Code
If you can not ssh to localhost without a passphrase, execute the following commands:
↑Code
hadoop@localhost$ ssh-keygen -t dsa -P ’’ -f ~/.ssh/id_dsa
hadoop@localhost$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
↓Code
↑Code
hadoop@localhost$ cd $HOME/hadoop-0.20.2
↓Code
↑Code
hadoop@localhost$ export HADOOP_HOME=‘pwd‘
↓Code
7. Edit the file $HADOOP HOME/conf/hadoop-env.sh to define at least JAVA HOME to be the root of
your Java installation.
8. Try the following command:
↑Code
hadoop@localhost$ bin/hadoop
↓Code
This will display the usage documentation for the hadoop script.
4
1.3 Verification
1. By default, Hadoop is configured to run in a non-distributed mode (standalone mode), as a single
Java process. This is useful for debugging.
2. The following example copies the unpacked conf directory to use as input and then finds and dis-
plays every match of the given regular expression. Output is written to the given output directory.
↑Code
3. Clean up:
↑Code
4. Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop
daemon runs in a separate Java process.
5. Edit the conf/core-site.xml file:
↑Code
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
↓Code
↑Code
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
↓Code
↑Code
5
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
↓Code
↑Code
↑Code
hadoop@localhost$ bin/start-all.sh
↓Code
10. Browse the web interface for the NameNode and the JobTracker; by default they are available at:
• NameNode - http://localhost:50070
• JobTracker - http://localhost:50030
11. Copy the input files into the distributed filesystem:
↑Code
↑Code
13. Copy the output files from the distributed filesystem to the local filesystem and examine them:
↑Code
6
14. Clean up:
↑Code
hadoop@localhost$ rm -r output
hadoop@localhost$ bin/hadoop fs -rmr input output
↓Code
↑Code
hadoop@localhost$ bin/stop-all.sh
↓Code
↑Code
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
7
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values)
sum += val.get();
result.set(sum);
context.write(key, result);
}
}
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(NewMapper.class);
job.setCombinerClass(NewReducer.class);
job.setReducerClass(NewReducer.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
↓Code
↑Code
hadoop@localhost$ cd ${HADOOP_HOME}
hadoop@localhost$ mkdir classes
hadoop@localhost?$ javac -cp hadoop-0.20.2-core.jar -d classes WordCount.java
hadoop@localhost?$ jar -cvf wordcount.jar -C classes/ .
↓Code
↑Code
↑Code
8
↓Code
↑Code
↓Code
↑Code
↑Code
↑Code
↓Code
• Clean up
↑Code
↓Code