0% found this document useful (0 votes)
3 views7 pages

Day 7

Uploaded by

sidhrajsz112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views7 pages

Day 7

Uploaded by

sidhrajsz112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Path for Hadoop map reduce

user/lib/Hadoop-map reduce

MR Framework
Initially  MR  JAVA

Hadoop streaming python,shell,c/c++,ruby

Steps to execute map reduce


hadoop jar <jar files path> -files<.py> -input<hdfs
path> -output<hdfs-path>

Hadoop streaming is a utility that comes with Hadoop


that enables you to develop map reduce executables in
languages other than java.
Streaming is implemented in the form of jar file so you
can run it from EMR or command line just like a
standard jar file
Streaming allows you to take advantage of the benefits
of map reduce while using any scripting language you
like

TO SEE WHERE THE JAR FILE ARE IN HADOOP


cd /usr/lib/hadoop-mapreduce
Map code
HOW TO RUN MAP REDUCE CODE USING HADOOP
STREAMING LIBRARY
=================================
===============
Make mr-demo directory
ls
Put the files from s3 bucket from directory cod to EMR
here
aws s3 cp <s3 code file uri of map> <destination> 
aws cli used
aws s3 cp <s3 code file uri of reducer> <destination>
chmod -x <mapcode.py>  making file executable
chmod -x <reducercode.py>
data is in data.txt file in s3  this has data (input)
linux commands
hadoop jar \
/usr/lib/Hadoop-mapreduce/hadoop-streaming.jar \
-files /home/hadoop/mr-demo/mapper.py,
/home/hadoop/mr-demo/reducer.py \
-mapper mapper.py \
-reducer reducer.py \
-input s3://vita-24-artefact/data/ \
-output s3://vita-24-artefact/output/output_class_demo

example
hadoop jar /usr/lib/map-reduce/hadoop-streaming.jar -
files /home/hadoop/mr-demo/mapper.py,
/home/hadoop/mr-demo/reducer.py -mapper mapper.py
-reducer reducer.py -input<data.txt s3 uri> -
output<output folder s3 uri>

Cloud Storage as HDFS


S3 acts like HDFS
Pros- availablility  we put data in s3 so now it aws job
to protect it (durable 99.999999)
Even when emr is terminated data is safe on s3
Cons – slow compared to HDFS  data from s3 and
requires networks so transfer is slower compared to
hdfs which is local in EMR
S3  aws
Bucket  gcp
Blob azure

To put data from linux to s3 bucket


hdfs dfs -put <linux file path> <s3 folder uri path>

to get data from s3 to hdfs


hdfs dfs -cp <s3 uri > <hdfs path>
Hive with S3
In EMR

In hive commands
create external table
Student ( …… …… .)
Location ‘s3:a//vita//data//student’
Select * from Student

You might also like