0% found this document useful (0 votes)
4 views5 pages

Answers

The document provides a step-by-step guide for setting up a Hadoop environment using Docker on macOS, including pulling the Docker image, running the container, and accessing its shell. It outlines tasks such as HDFS operations, installing and configuring Hive and HBase, running MapReduce jobs, and executing Pig scripts for data analysis. Each task includes specific commands and sub-steps to ensure successful execution within the Docker container environment.

Uploaded by

subramanyau67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views5 pages

Answers

The document provides a step-by-step guide for setting up a Hadoop environment using Docker on macOS, including pulling the Docker image, running the container, and accessing its shell. It outlines tasks such as HDFS operations, installing and configuring Hive and HBase, running MapReduce jobs, and executing Pig scripts for data analysis. Each task includes specific commands and sub-steps to ensure successful execution within the Docker container environment.

Uploaded by

subramanyau67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Alright, let's go step-by-step, from pulling and running the Docker container for Hadoop to

executing all 8 questions. I'll guide you through every step, explaining where you need to be in terms
of directories and the exact commands.

1. Pull and Run the Docker Container for Hadoop

First, let's get your Hadoop environment running in Docker (since we’re doing this on macOS).

Step 1.1: Pull the Docker Image

You can use the official Hadoop Docker image (or a pre-configured one for ease). If you’re new to
Hadoop setup, I recommend using a pre-configured image that will save you time.

docker pull sequenceiq/hadoop-docker:2.7.1

Step 1.2: Run the Docker Container

Now, let's run the container, which will automatically set up a small Hadoop cluster with HDFS.

docker run -it sequenceiq/hadoop-docker:2.7.1 /etc/bootstrap.sh -d

Step 1.3: Access the Container’s Shell

Once the container starts, you should be inside the container’s shell.

docker exec -it <container_id> bash

2. Perform the Hadoop Tasks (Questions)

Let’s break down each task one by one now that your Docker container is up and running.

Question 1: HDFS Operations

Sub-steps:

1. Create directories in HDFS: Use the following command to create a directory in HDFS:

2. hdfs dfs -mkdir /user/<username>/mydir

3. Upload files from local to HDFS: To upload a file from your local system to HDFS:

4. hdfs dfs -put /localpath/filename /user/<username>/mydir

5. Display the file: To display the content of the file you uploaded:

6. hdfs dfs -cat /user/<username>/mydir/filename

7. Delete the file: To delete the file from HDFS:

8. hdfs dfs -rm /user/<username>/mydir/filename

Question 2: Install Hive and Configure with Hadoop


Hive allows you to query HDFS data using SQL-like syntax. In your Docker container, you may need to
install Hive and configure it.

Sub-steps:

1. Install Hive (if not installed):

If Hive is not already installed in your Docker container, follow the installation steps:

# Download Hive

wget http://apache.mirror.digitalpacific.com.au/hive/stable/hive-2.3.7-bin.tar.gz

tar -xvzf hive-2.3.7-bin.tar.gz

mv hive-2.3.7-bin /opt/hive

2. Configure Hive with Hadoop: Set HADOOP_HOME and HIVE_HOME:

3. export HADOOP_HOME=/opt/hadoop

4. export HIVE_HOME=/opt/hive

5. export PATH=$HIVE_HOME/bin:$PATH

6. Start Hive and Configure Hadoop:

To start Hive in the container:

hive

Then create a Hive table, query it, and calculate an average.

CREATE TABLE my_table (id INT, name STRING);

INSERT INTO my_table VALUES (1, 'Hemanth'), (2, 'John');

SELECT * FROM my_table;

SELECT AVG(id) FROM my_table;

Question 3: Word Count Mapper Program

The code I gave you earlier in WordCountMapper.java, WordCountReducer.java, and


WordCountDriver.java should be compiled and executed.

Sub-steps:

1. Compile Java files:

2. javac -classpath `hadoop classpath` -d /path/to/output/ WordCountMapper.java


WordCountReducer.java WordCountDriver.java

3. Create JAR file:

4. jar -cvf wordcount.jar -C /path/to/output/ .

5. Run the MapReduce job (assuming your input file is in HDFS):


6. hadoop jar wordcount.jar WordCountDriver /input_path /output_path

Question 4: Run Word Count MapReduce (Steps)

This is more about the process than just the code. Follow these steps:

1. Format HDFS:

Format HDFS before running MapReduce:

hdfs namenode -format

2. Start Hadoop Services:

Start the services necessary for Hadoop and HDFS:

start-dfs.sh

start-yarn.sh

3. Create an Input Folder in HDFS:

Create the directory where your input data will reside:

hdfs dfs -mkdir /input

4. Compile and Execute the Job:

The compilation and execution steps are the same as in Question 3:

hadoop jar wordcount.jar WordCountDriver /input /output

5. View Results:

After running the job, view the output:

hdfs dfs -cat /output/part-r-00000

6. Clean Up (Remove Directories):

Delete the input and output directories:

hdfs dfs -rm -r /input /output

Question 5: Upload File to HDFS and Run WordCount MapReduce

1. Upload File to HDFS:

Same as in Question 1:

hdfs dfs -put /localpath/filename /user/<username>/input

2. Compile and Execute WordCount Program:

Again, follow the steps in Question 4 to run the job.

3. Verify Output:
4. hdfs dfs -cat /output/part-r-00000

Question 6: Install HBase and Create Table

HBase is a distributed NoSQL database that runs on top of HDFS.

1. Install HBase:

If HBase is not installed, install it like this:

wget https://downloads.apache.org/hbase/2.4.9/hbase-2.4.9-bin.tar.gz

tar -xvzf hbase-2.4.9-bin.tar.gz

mv hbase-2.4.9 /opt/hbase

2. Configure HBase:

Set up HBase by configuring hbase-site.xml.

3. Create a Table and Insert Data:

4. hbase shell

5. create 'employee', 'info'

6. put 'employee', '1', 'info:name', 'John'

7. put 'employee', '2', 'info:name', 'Alice'

8. scan 'employee'

9. Update and Delete Data:

10. put 'employee', '1', 'info:name', 'Updated John'

11. delete 'employee', '2'

12. scan 'employee'

Question 7: Pig Latin Script for Max Temperature

Pig is a high-level platform for Hadoop. Here's a simple Pig script to find the maximum temperature
per year:

1. Create Pig Script (max_temp.pig):

2. data = LOAD '/data/temperature.csv' USING PigStorage(',') AS (year:int, temp:int);

3. grouped = GROUP data BY year;

4. max_temp = FOREACH grouped GENERATE group AS year, MAX(data.temp) AS max_temp;

5. DUMP max_temp;

6. Run the Script:


7. pig max_temp.pig

Question 8: Pig Latin Script for Filtering, Grouping, and Revenue Calculation

1. Create Pig Script (revenue_analysis.pig):

2. data = LOAD '/data/sales.csv' USING PigStorage(',') AS (product:chararray, revenue:int);

3. filtered = FILTER data BY revenue > 1000;

4. grouped = GROUP filtered BY product;

5. revenue = FOREACH grouped GENERATE group AS product, SUM(filtered.revenue) AS


total_revenue;

6. DUMP revenue;

7. Run the Script:

8. pig revenue_analysis.pig

Directory Information:

• You don’t need to change directories for each of these commands in HDFS. You’ll mostly be
using absolute paths or relative paths under /user/<username>/.

• Just ensure you're in the right directory where your input files are located and where you
want to store outputs.

Final Thoughts

This should give you step-by-step guidance for the 8 questions, starting from Docker setup to
Hadoop commands. You can execute all these tasks in the Docker container environment.

You might also like