0% found this document useful (0 votes)
15 views

Big Data Class Activity Assignment 2

This document outlines a class activity focused on Big Data Analytics, specifically accessing HDFS and running MapReduce code using Python and MRJob. It details steps for accessing HDFS through both Ambari UI and command line using Putty, including file management tasks such as creating directories, uploading files, and cleaning up. Additionally, it provides instructions for installing the MRJob package on the Hortonworks Sandbox for future activities in the course.

Uploaded by

s2024393005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Big Data Class Activity Assignment 2

This document outlines a class activity focused on Big Data Analytics, specifically accessing HDFS and running MapReduce code using Python and MRJob. It details steps for accessing HDFS through both Ambari UI and command line using Putty, including file management tasks such as creating directories, uploading files, and cleaning up. Additionally, it provides instructions for installing the MRJob package on the Hortonworks Sandbox for future activities in the course.

Uploaded by

s2024393005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Big Data Analytics

Class Activity and Assignment 2

Objectives

• Access HDFS and learn how to transfer files


• Running MapReduce Code after Installing Python, MRJOB and nano

Introduction:

Accessing HDFS:

let's go ahead and install a Hadoop stack on your PC. It's not that hard. This is a two step process:

1) Access Using Ambari UI


2) Access Using Putty

Access Using Ambari UI

Clicking on HDFS menu from the left Menu Bar. You can see that we're actually running a name node in
a data node both obviously on the same virtual machine but if you're on a real cluster off course those
would be on two different computers.

So we're going to be talking to our name node and our data node you know in sort of a virtual sense
here while we're playing with HDFS now to mess around with it from a UI it's very simple.
Click on the File View from the top encircled ICON. The following screen will appear. This is our entire
HDFS file system that's running on our little Hadoop cluster here. Our cluster of one virtual machine
such as it is but it would work the same way in a real cluster.

So we can for example click on user, the following screen will appear.

and click on maria_dev to see maria_dev's home directory


and let's go ahead and create a folder for our movielens data so we can create a new folder just by
clicking onto new folder and we'll call it ml-100k just like that.

So remember that's the one hundred thousand ratings data set from the movielens Web site. It is like
talking to the name server and saying it. We're gonna create a new directory and the name server says
OK here you go. You have

So we can go into that and we can actually hit the upload button to upload files from our harddrive.
Click on upload button. we can just drag them in there or click there and navigate to it.
So if you want to do that and navigate to wherever you downloaded and extracted the ml-100k data set.
you can for example upload the u.data file into there.

And what happened there like we talked to the name node to create a file. where shall I put it. It said
put it on this data node and there's only one data node so that's where it put it. If we had more than one
machine though it would actually start replicating that data across multiple data nodes and then
acknowledging back through us and then finally back to the name node that that data had been
successfully stored.

Let's also update the upload the u.item file as well and see what you can do.
So you can actually do pretty much anything from the UI that you can do from the command line or
anything else. so I can for example click on u.data and click on Open and actually see what's in it. So here
we have a little view of the contents of the data file. Again it's the user id rated this movie Id given a
rating from 1 to 5 at a given time stamp and scroll down there and look at all 100000 ratings. There's a
couple of cool features though. You can download that to your local hard drive if you want to. Let's go
ahead and do that. Open it up. We got our data back from HDFS so what we've done here is we've
uploaded data from our hard drive into an HDFS cluster of one and then download it back to our hard
drive and we got the same file back.

if you want to you can also rename that file. Let's say we want to call it ratings.data. it basically just
everything works as you would expect it to here.

You can also do something kind of cool if you want to you can actually select more than one file using
the shift button there and concatenate them together. And that would actually download a single file
containing the contents of both of those files. So if you were to open that resulting file up you'll see
something like this where it starts off with all 100000 movie ratings and ends up with the u.item file
which is mapping movie ID's to movie names and genre information and whatnot.
Now let's go ahead and clean up after ourselves so I still have those two selected and I click delete to get
rid of it. Select Delete permanently option as well.

And then I'll go up a level back to my home directory for maria_dev and select ml-100k. Notice I'm
clicking over here as opposed to on the name itself so instead of navigating into it I'm just selecting that
row and then I can hit delete and delete permanently just to make sure we cleaned up properly. You
know we're actually playing around with HDFS through a web interface so this is using an HTTP interface
to HDFS that allows us to manipulate and view all the files in our HDFS file system. It's actually really
really easy.
Access Via Putty:

Now if you're the kind of guy who is more of a hacker or a programmer (I know you are 😊) You might
prefer using a command line interface so actually logging into the computer your master node or your
client node whatever you want to call it and manipulating your HDFS file system from a command line. I
will show you how to do that now. But first we need some sort of a terminal that lets us connect to our
instance. I've actually made sure that I'm running my virtual box already so if you haven't done this
already make sure you've opened up the Oracle VM virtual box selected the Hortonworks sandbox
image and started that off. So I've already done that. Now we are going to use a software called putty on
Windows and on Mac OS you might just use the built in terminal instead. But if you're on Windows you'll
want to look up putty and if you go to putty.org you'll find where to download that from and you can
just download the actual executable there and run it. Follow previous Assignment steps for installation
and access.
So let's actually copy some data into Hadoop. So the way you actually manipulate HDFS from the
command line on a Linux host here is by using the Hadoop command and then fs and then dash
whatever command you want to do.

So for example if I want to do an ls to list what's in the file system. And we can see here the contents of
my home directory and HDFS for maria_dev. Command: hadoop fs -ls
So let's create a directory in here to hold my movielens ratings data. So we'll type in command
hadoop fs -mkdir ml-100k I think most of the students here are going to be of CS background where this
is valuable to access our server through command line interface.

So let's go ahead to make sure that that's actually in there hadoop fs -ls again and we should see our ml-
100k directory.
So let's actually upload some information into this. We're basically doing exactly what we did through
the Ambari UI just from the command line. But in this case we're uploading data into HDFS from the
local file system. But since we're not going through a web browser on our windows machine our local
file system actually means this Linux host itself so you can see if I do pwd by not using Hadoop, it will
show what’s my present working directory that is /home/maria_dev and doing ls is actually looking at
what's in the actual maria_dev home directory here and I have nothing on my local file system.

So the first thing I do is actually get some data on here that I can play around with. Now we've actually
available a copy of the movielens data set on another server. So let's go ahead and grab a copy of that.
You can just type in wget. Which is a linux command that retrieves information from the web
wget http://media.sundog-soft.com/hadoop/ml-100k/u.data just like that make sure you've got the
spelling right. That's just a server that hosts it. This command should go retrieve the copy of u.data from
this server now if I type in ls -la you'll see more details u.data is there and it does appear to be in one
piece.
Let's upload this into HDFS to do that. We're going to do just like every HDFS command “Hadoop fs” and
we're going start with what we're copying which is u.data and it's just going to do that from my current
directory and we're going to put that into our home directory in HDFS So this is going to be a relative
path ml-100k/u.data so the command would be hadoop fs -copyFromLocal u.data ml-100k/u.data so
that has copied from my local file system on this CentOS host into HDFS the u.data file.

so let'sgo ahead and make sure it's actually in there. You can give command hadoop fs -ls ml-100k to
see what's inside the ml-100k folder. It is there showing in one piece.
All right let's go ahead and clean up after ourselves again like we did before to do that you would do
command hadoop fs -rm ml-100k/u.data for remove u.data file and that moved it to the trash

and we can also remove the ml-100k folder itself with hadoop fs -rmdir ml-100k for removing directory
ml-100k and we should see that things are nice and clean again the way we left it we're back to the way
we started.
So brief overview there if you want to see what else you can do from Hadoop fs you can just type in
hadoop fs and it will give you a list of available commands if you scroll up here you can see you can do
pretty much everything you can do from the local file system command line so append concatenate
copy filescopy to local if you want to actually retrieve that information back. We could have done a copy
to local command just like we did copy from local. You can also change permissions just like you can in
Unix if you want to manage that sort of thing. A lot of times you kind of assume that if someone has
access to your cluster to begin with they probably can be trusted. But more and more you got to think
about security these days so make sure that you tie down those permissions as appropriate. You can
make directories, you can move things around. We played with some of these other commands already.
So there you have it using HDFS from the command line.
As you can see it's a lot easier to use it from a UI. So you know you probably will gravitate toward that.
But a command line interface is available and that can be useful if you're writing for example scripts or
anything like that that do need to be run on some periodic basis. Now if you're done for now you can
just go back to this terminal here and type in exit to get out of it. And that will get you out of putty.

And then if you really want to shut down for now you can go back to your virtual machine here go to
machine and use ACPI shutdown to shut that down cleanly and once that window goes away you're OK
to actually close the virtual box window.

All right that's HDFS. We've actually learned about how it works. What's going on under the hood. we've
actually messed around with it got our hands dirty using it through both the Ambari UI through a web
server and also using it directly from a command line interface. You are now at HDFS quasi expert
Congratulations.
MapReduce Coding Task:

Install Python, MRJob, and nano

In this activity, we'll walk you through installing the "mrjob" package for MapReduce on
the HDP Sandbox, which will be needed for subsequent activities in this course.

Because Cloudera has placed their own repositories behind a paywall now, and the
underlying CentOS 7 operating system used in the sandbox image has reached EOL
status, we need to adjust the sandbox's repositories to where they have been archived
before we can install anything new. Follow the steps below:

Installing mrjob on HDP 2.65

First you need to connect to your running Hortonworks Sandbox HDP virtual machine
from a terminal.

On Windows, you can use PuTTY. Enter [email protected] for the hostname,
and 2222 for the port.

On Mac or Linux, open a terminal window and run ssh -p 2222 127.0.0.1

To log in, the password for maria_dev is also maria_dev.

You will now need to escalate your privileges in order to set things up. Enter:

su root

The default root password is hadoop. You will be prompted to change this.
Once you are successfully at the root level (you should see a # prompt instead of $),
enter the following commands:

You might also like