0% found this document useful (0 votes)

15 views

Big Data Class Activity Assignment 2

This document outlines a class activity focused on Big Data Analytics, specifically accessing HDFS and running MapReduce code using Python and MRJob. It details steps for accessing HDFS through both Ambari UI and command line using Putty, including file management tasks such as creating directories, uploading files, and cleaning up. Additionally, it provides instructions for installing the MRJob package on the Hortonworks Sandbox for future activities in the course.

Uploaded by

s2024393005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Big Data Class Activity Assignment 2

Uploaded by

s2024393005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Big Data Analytics

Class Activity and Assignment 2

Objectives

• Access HDFS and learn how to transfer files

• Running MapReduce Code after Installing Python, MRJOB and nano

Introduction:

Accessing HDFS:

let's go ahead and install a Hadoop stack on your PC. It's not that hard. This is a two step process:

1) Access Using Ambari UI

2) Access Using Putty

Access Using Ambari UI

Clicking on HDFS menu from the left Menu Bar. You can see that we're actually running a name node in
a data node both obviously on the same virtual machine but if you're on a real cluster off course those
would be on two different computers.

So we're going to be talking to our name node and our data node you know in sort of a virtual sense
here while we're playing with HDFS now to mess around with it from a UI it's very simple.
Click on the File View from the top encircled ICON. The following screen will appear. This is our entire
HDFS file system that's running on our little Hadoop cluster here. Our cluster of one virtual machine
such as it is but it would work the same way in a real cluster.

So we can for example click on user, the following screen will appear.

and click on maria_dev to see maria_dev's home directory

and let's go ahead and create a folder for our movielens data so we can create a new folder just by
clicking onto new folder and we'll call it ml-100k just like that.

So remember that's the one hundred thousand ratings data set from the movielens Web site. It is like
talking to the name server and saying it. We're gonna create a new directory and the name server says
OK here you go. You have

So we can go into that and we can actually hit the upload button to upload files from our harddrive.
Click on upload button. we can just drag them in there or click there and navigate to it.
So if you want to do that and navigate to wherever you downloaded and extracted the ml-100k data set.
you can for example upload the u.data file into there.

And what happened there like we talked to the name node to create a file. where shall I put it. It said
put it on this data node and there's only one data node so that's where it put it. If we had more than one
machine though it would actually start replicating that data across multiple data nodes and then
acknowledging back through us and then finally back to the name node that that data had been
successfully stored.

Let's also update the upload the u.item file as well and see what you can do.
So you can actually do pretty much anything from the UI that you can do from the command line or
anything else. so I can for example click on u.data and click on Open and actually see what's in it. So here
we have a little view of the contents of the data file. Again it's the user id rated this movie Id given a
rating from 1 to 5 at a given time stamp and scroll down there and look at all 100000 ratings. There's a
couple of cool features though. You can download that to your local hard drive if you want to. Let's go
ahead and do that. Open it up. We got our data back from HDFS so what we've done here is we've
uploaded data from our hard drive into an HDFS cluster of one and then download it back to our hard
drive and we got the same file back.

if you want to you can also rename that file. Let's say we want to call it ratings.data. it basically just
everything works as you would expect it to here.

You can also do something kind of cool if you want to you can actually select more than one file using
the shift button there and concatenate them together. And that would actually download a single file
containing the contents of both of those files. So if you were to open that resulting file up you'll see
something like this where it starts off with all 100000 movie ratings and ends up with the u.item file
which is mapping movie ID's to movie names and genre information and whatnot.
Now let's go ahead and clean up after ourselves so I still have those two selected and I click delete to get
rid of it. Select Delete permanently option as well.

And then I'll go up a level back to my home directory for maria_dev and select ml-100k. Notice I'm
clicking over here as opposed to on the name itself so instead of navigating into it I'm just selecting that
row and then I can hit delete and delete permanently just to make sure we cleaned up properly. You
know we're actually playing around with HDFS through a web interface so this is using an HTTP interface
to HDFS that allows us to manipulate and view all the files in our HDFS file system. It's actually really
really easy.
Access Via Putty:

Now if you're the kind of guy who is more of a hacker or a programmer (I know you are 😊) You might
prefer using a command line interface so actually logging into the computer your master node or your
client node whatever you want to call it and manipulating your HDFS file system from a command line. I
will show you how to do that now. But first we need some sort of a terminal that lets us connect to our
instance. I've actually made sure that I'm running my virtual box already so if you haven't done this
already make sure you've opened up the Oracle VM virtual box selected the Hortonworks sandbox
image and started that off. So I've already done that. Now we are going to use a software called putty on
Windows and on Mac OS you might just use the built in terminal instead. But if you're on Windows you'll
want to look up putty and if you go to putty.org you'll find where to download that from and you can
just download the actual executable there and run it. Follow previous Assignment steps for installation
and access.
So let's actually copy some data into Hadoop. So the way you actually manipulate HDFS from the
command line on a Linux host here is by using the Hadoop command and then fs and then dash
whatever command you want to do.

So for example if I want to do an ls to list what's in the file system. And we can see here the contents of
my home directory and HDFS for maria_dev. Command: hadoop fs -ls
So let's create a directory in here to hold my movielens ratings data. So we'll type in command
hadoop fs -mkdir ml-100k I think most of the students here are going to be of CS background where this
is valuable to access our server through command line interface.

So let's go ahead to make sure that that's actually in there hadoop fs -ls again and we should see our ml-
100k directory.
So let's actually upload some information into this. We're basically doing exactly what we did through
the Ambari UI just from the command line. But in this case we're uploading data into HDFS from the
local file system. But since we're not going through a web browser on our windows machine our local
file system actually means this Linux host itself so you can see if I do pwd by not using Hadoop, it will
show what’s my present working directory that is /home/maria_dev and doing ls is actually looking at
what's in the actual maria_dev home directory here and I have nothing on my local file system.

So the first thing I do is actually get some data on here that I can play around with. Now we've actually
available a copy of the movielens data set on another server. So let's go ahead and grab a copy of that.
You can just type in wget. Which is a linux command that retrieves information from the web
wget http://media.sundog-soft.com/hadoop/ml-100k/u.data just like that make sure you've got the
spelling right. That's just a server that hosts it. This command should go retrieve the copy of u.data from
this server now if I type in ls -la you'll see more details u.data is there and it does appear to be in one
piece.
Let's upload this into HDFS to do that. We're going to do just like every HDFS command “Hadoop fs” and
we're going start with what we're copying which is u.data and it's just going to do that from my current
directory and we're going to put that into our home directory in HDFS So this is going to be a relative
path ml-100k/u.data so the command would be hadoop fs -copyFromLocal u.data ml-100k/u.data so
that has copied from my local file system on this CentOS host into HDFS the u.data file.

so let'sgo ahead and make sure it's actually in there. You can give command hadoop fs -ls ml-100k to
see what's inside the ml-100k folder. It is there showing in one piece.
All right let's go ahead and clean up after ourselves again like we did before to do that you would do
command hadoop fs -rm ml-100k/u.data for remove u.data file and that moved it to the trash

and we can also remove the ml-100k folder itself with hadoop fs -rmdir ml-100k for removing directory
ml-100k and we should see that things are nice and clean again the way we left it we're back to the way
we started.
So brief overview there if you want to see what else you can do from Hadoop fs you can just type in
hadoop fs and it will give you a list of available commands if you scroll up here you can see you can do
pretty much everything you can do from the local file system command line so append concatenate
copy filescopy to local if you want to actually retrieve that information back. We could have done a copy
to local command just like we did copy from local. You can also change permissions just like you can in
Unix if you want to manage that sort of thing. A lot of times you kind of assume that if someone has
access to your cluster to begin with they probably can be trusted. But more and more you got to think
about security these days so make sure that you tie down those permissions as appropriate. You can
make directories, you can move things around. We played with some of these other commands already.
So there you have it using HDFS from the command line.
As you can see it's a lot easier to use it from a UI. So you know you probably will gravitate toward that.
But a command line interface is available and that can be useful if you're writing for example scripts or
anything like that that do need to be run on some periodic basis. Now if you're done for now you can
just go back to this terminal here and type in exit to get out of it. And that will get you out of putty.

And then if you really want to shut down for now you can go back to your virtual machine here go to
machine and use ACPI shutdown to shut that down cleanly and once that window goes away you're OK
to actually close the virtual box window.

All right that's HDFS. We've actually learned about how it works. What's going on under the hood. we've
actually messed around with it got our hands dirty using it through both the Ambari UI through a web
server and also using it directly from a command line interface. You are now at HDFS quasi expert
Congratulations.
MapReduce Coding Task:

Install Python, MRJob, and nano

In this activity, we'll walk you through installing the "mrjob" package for MapReduce on
the HDP Sandbox, which will be needed for subsequent activities in this course.

Because Cloudera has placed their own repositories behind a paywall now, and the
underlying CentOS 7 operating system used in the sandbox image has reached EOL
status, we need to adjust the sandbox's repositories to where they have been archived
before we can install anything new. Follow the steps below:

Installing mrjob on HDP 2.65

First you need to connect to your running Hortonworks Sandbox HDP virtual machine
from a terminal.

On Windows, you can use PuTTY. Enter [email protected] for the hostname,
and 2222 for the port.

On Mac or Linux, open a terminal window and run ssh -p 2222 127.0.0.1

To log in, the password for maria_dev is also maria_dev.

You will now need to escalate your privileges in order to set things up. Enter:

su root

The default root password is hadoop. You will be prompted to change this.
Once you are successfully at the root level (you should see a # prompt instead of $),
enter the following commands:

최신본 - LWB-D Digital Water Bath ENG-2-변환됨
100% (1)
최신본 - LWB-D Digital Water Bath ENG-2-변환됨
8 pages
Lab 1 - Hadoop HDFS and MapReduce
No ratings yet
Lab 1 - Hadoop HDFS and MapReduce
4 pages
How To Set Up A Hadoop Cluster in Docker
No ratings yet
How To Set Up A Hadoop Cluster in Docker
13 pages
Mountain Park Brochure Final
No ratings yet
Mountain Park Brochure Final
77 pages
Hadoop File System: CSC 369 Distributed Computing Alexander Dekhtyar
No ratings yet
Hadoop File System: CSC 369 Distributed Computing Alexander Dekhtyar
5 pages
Hadoop1
No ratings yet
Hadoop1
15 pages
HOL - Exploring HDFS
No ratings yet
HOL - Exploring HDFS
6 pages
Lab2_BigData-HDFSp
No ratings yet
Lab2_BigData-HDFSp
4 pages
Unit 3 Topic 6 Hadoop File System Interfaces
No ratings yet
Unit 3 Topic 6 Hadoop File System Interfaces
21 pages
Lista de Comandos HDFS
No ratings yet
Lista de Comandos HDFS
8 pages
Exp-2 Hadoop Commands
No ratings yet
Exp-2 Hadoop Commands
6 pages
Data Storage Data Processing: Hadoop Distributed File System (HDFS) Mapreduce
No ratings yet
Data Storage Data Processing: Hadoop Distributed File System (HDFS) Mapreduce
35 pages
BDA UNIT -3 Updated (1).docx
No ratings yet
BDA UNIT -3 Updated (1).docx
25 pages
HDFS File System Shell Guide
No ratings yet
HDFS File System Shell Guide
10 pages
HDFS
No ratings yet
HDFS
6 pages
hadoop
No ratings yet
hadoop
6 pages
lab2_BD
No ratings yet
lab2_BD
20 pages
Dsa Practical File
No ratings yet
Dsa Practical File
16 pages
COMMAND Line Interface
No ratings yet
COMMAND Line Interface
26 pages
2 HDFS Commands
No ratings yet
2 HDFS Commands
7 pages
BDA Record (1)
No ratings yet
BDA Record (1)
34 pages
Extracting Real Value From Your Data With Apache Hadoop: Sarah Sproehnle
No ratings yet
Extracting Real Value From Your Data With Apache Hadoop: Sarah Sproehnle
51 pages
Hafs Commands
No ratings yet
Hafs Commands
17 pages
Unit 2-HDFS SGS
No ratings yet
Unit 2-HDFS SGS
29 pages
PDC All Labs
100% (1)
PDC All Labs
129 pages
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
No ratings yet
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
74 pages
Hadoop Commands Only
No ratings yet
Hadoop Commands Only
19 pages
Practical 1 - 1 - Hadoop Commands
No ratings yet
Practical 1 - 1 - Hadoop Commands
3 pages
TP 1 - HDFS
No ratings yet
TP 1 - HDFS
40 pages
3 Hadoop
No ratings yet
3 Hadoop
40 pages
Big Data Record 2024-25
No ratings yet
Big Data Record 2024-25
46 pages
Hadoop Linux Commands
No ratings yet
Hadoop Linux Commands
8 pages
Hadoop Hdfs Commands
No ratings yet
Hadoop Hdfs Commands
2 pages
BIG DATA UNIT -2
No ratings yet
BIG DATA UNIT -2
18 pages
Hadoop Lab
100% (2)
Hadoop Lab
6 pages
Week 1 in Terminal
No ratings yet
Week 1 in Terminal
10 pages
Hadoop Commands
100% (1)
Hadoop Commands
6 pages
Command
No ratings yet
Command
1 page
Hadoop-HDFS-commands
No ratings yet
Hadoop-HDFS-commands
1 page
Big data analytics lab-JD
No ratings yet
Big data analytics lab-JD
49 pages
Exp1 Hirday Merged
No ratings yet
Exp1 Hirday Merged
102 pages
HANDS Hadoop Cloud
No ratings yet
HANDS Hadoop Cloud
10 pages
HDFS (Hadoop Distributed File System) : HDFS Architecture Components of The Architecture
No ratings yet
HDFS (Hadoop Distributed File System) : HDFS Architecture Components of The Architecture
10 pages
Extreme Computing Lab Exercises Session One: 1 Getting Started
No ratings yet
Extreme Computing Lab Exercises Session One: 1 Getting Started
6 pages
BDA Final Compiled_pagenumber
No ratings yet
BDA Final Compiled_pagenumber
71 pages
Big Data AnalyticUnit2
No ratings yet
Big Data AnalyticUnit2
19 pages
bigdatamanual(2)
No ratings yet
bigdatamanual(2)
45 pages
BDA-ALLEXP (2)_merged
No ratings yet
BDA-ALLEXP (2)_merged
149 pages
Big Data Cheat Sheet
No ratings yet
Big Data Cheat Sheet
12 pages
HDFS and HAdoop command
No ratings yet
HDFS and HAdoop command
5 pages
BigData_Lab_Manual
No ratings yet
BigData_Lab_Manual
44 pages
Bigdatamanualfinal 231019063224 d211cb48
No ratings yet
Bigdatamanualfinal 231019063224 d211cb48
45 pages
HDFS Commands v02 PDF
No ratings yet
HDFS Commands v02 PDF
7 pages
ccs 334 bigdata manual
No ratings yet
ccs 334 bigdata manual
45 pages
Experiment No 1
No ratings yet
Experiment No 1
13 pages
HDFS Tutorial
No ratings yet
HDFS Tutorial
5 pages
CCS334-BDA LAB MANUAL final (1)
No ratings yet
CCS334-BDA LAB MANUAL final (1)
46 pages
Create A Directory in HDFS at Given Path(s) .: Upload
No ratings yet
Create A Directory in HDFS at Given Path(s) .: Upload
11 pages
Backend Handbook: for Ruby on Rails Apps
From Everand
Backend Handbook: for Ruby on Rails Apps
Francisco Quintero
1/5 (1)
Introducing VirtualBox & Debian: MyOwnGeek, #1
From Everand
Introducing VirtualBox & Debian: MyOwnGeek, #1
Dan Fleming
No ratings yet
Build Your First Home Server
From Everand
Build Your First Home Server
R.R. Arnob
No ratings yet
More Debian 8 for Beginners
From Everand
More Debian 8 for Beginners
Ed Hurst
No ratings yet
Chile Hackmey Broschure Complete
No ratings yet
Chile Hackmey Broschure Complete
20 pages
Novartis
0% (1)
Novartis
2 pages
Microsoft Project: Guided By: MR - Ajit Desai
No ratings yet
Microsoft Project: Guided By: MR - Ajit Desai
41 pages
Warner Media LATAM - Dubbing Style Guide - 2021
No ratings yet
Warner Media LATAM - Dubbing Style Guide - 2021
24 pages
Physical Samples Management in SAP QM: 12 Likes 15,090 Views 4 Comments
No ratings yet
Physical Samples Management in SAP QM: 12 Likes 15,090 Views 4 Comments
12 pages
Threat Monitoring and Intelligent Data Analytics of Network Traffic
No ratings yet
Threat Monitoring and Intelligent Data Analytics of Network Traffic
8 pages
Material Spare List
No ratings yet
Material Spare List
13 pages
Merge Employee Records
No ratings yet
Merge Employee Records
12 pages
Aritree Saha_11730823014
No ratings yet
Aritree Saha_11730823014
9 pages
SCOLE - Kerala: Application Form For Vocational Higher Secondary Additional Mathematics (2021-2023)
No ratings yet
SCOLE - Kerala: Application Form For Vocational Higher Secondary Additional Mathematics (2021-2023)
2 pages
SD8S4M - Triple-Spectrum Network Speed Dome - Datasheet - V1.0.1 - 20221018
No ratings yet
SD8S4M - Triple-Spectrum Network Speed Dome - Datasheet - V1.0.1 - 20221018
4 pages
250 Kva Data Sheet
No ratings yet
250 Kva Data Sheet
3 pages
IS 7272-1974 Part 1 PDF
No ratings yet
IS 7272-1974 Part 1 PDF
17 pages
Machine Design - U. C. Jindal
0% (5)
Machine Design - U. C. Jindal
224 pages
Invitational Speech Outline
No ratings yet
Invitational Speech Outline
4 pages
Data Processing Manual or CNC
100% (1)
Data Processing Manual or CNC
28 pages
Engineer-to-Order Production With Variant Configura-Tion (4R8 - ES)
No ratings yet
Engineer-to-Order Production With Variant Configura-Tion (4R8 - ES)
20 pages
SPLIT TEE WELDING Procedure
100% (2)
SPLIT TEE WELDING Procedure
42 pages
Caterpillar 906 Compact Wheel Loader | Service Manual | PDF Download
No ratings yet
Caterpillar 906 Compact Wheel Loader | Service Manual | PDF Download
33 pages
Top 25 Guesstimate Questions
No ratings yet
Top 25 Guesstimate Questions
38 pages
Peru Citibank
No ratings yet
Peru Citibank
1 page
Serie+XLT-A HC950 A
No ratings yet
Serie+XLT-A HC950 A
5 pages
L06 Procedure
No ratings yet
L06 Procedure
7 pages
MS 900 Questions
No ratings yet
MS 900 Questions
5 pages
WP Sectional
No ratings yet
WP Sectional
1 page
hx500 QR
No ratings yet
hx500 QR
46 pages
Sega Saturn Magazine 25 (November 1997) (UK)
No ratings yet
Sega Saturn Magazine 25 (November 1997) (UK)
100 pages
Synergy HD4
No ratings yet
Synergy HD4
74 pages

Big Data Class Activity Assignment 2

Uploaded by

Big Data Class Activity Assignment 2

Uploaded by

Big Data Analytics

Class Activity and Assignment 2

• Access HDFS and learn how to transfer files

1) Access Using Ambari UI

Access Using Ambari UI

and click on maria_dev to see maria_dev's home directory

Install Python, MRJob, and nano

Installing mrjob on HDP 2.65

To log in, the password for maria_dev is also maria_dev.

You might also like