0% found this document useful (0 votes)
43 views22 pages

Part 02 AcessingHadoopAtTACC

The document provides information about accessing a Hadoop cluster on the Wrangler supercomputer at TACC, including an overview of Hadoop and its components. It describes how to create a Hadoop reservation through the Wrangler data portal and then access the cluster through secure shell, idev sessions, or the VNC portal to submit jobs. Users can check reservation status and access various Hadoop interfaces through web UIs once connected to the cluster.

Uploaded by

Sahera Shabnam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views22 pages

Part 02 AcessingHadoopAtTACC

The document provides information about accessing a Hadoop cluster on the Wrangler supercomputer at TACC, including an overview of Hadoop and its components. It describes how to create a Hadoop reservation through the Wrangler data portal and then access the cluster through secure shell, idev sessions, or the VNC portal to submit jobs. Users can check reservation status and access various Hadoop interfaces through web UIs once connected to the cluster.

Uploaded by

Sahera Shabnam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Big Data Analysis Workshop:

Accessing Hadoop Cluster


on Wrangler

Drs. Weijia Xu, Ruizhu Huang and Amit Gupta


Data Mining & Statistics Group
Texas Advanced Computing Center
University of Texas at Austin

Sept. 28~29, 2017


Atlanta, GA
Hadoop
•  Hadoop is an open source implementation of
MapReduce programming model in JAVA with
interface to other programming language such as C/
C++, python.
•  The top 6 vendors offering Big Data Hadoop solution
are:
•  Cloudera
•  HortonWorks
•  Amazon Web Services Elastic MapReduce Hadoop
Distribution
•  MicrosoftMapR
•  IBM InfoSphere Insights
Hadoop includes
– HDFS, a distributed file system based on googel file system
(GFS), as its shared file system.
– YARN, A resource manager to assign resources to the
computational tasks
– MapReduce, a library to enable efficient distributed data
processing easily.
– Mahout, scalable machine learning and data mining library
– Hadoop streaming, enable processing with other language.
–…
Wrangler
TACC Indiana
Mass Storage Subsystem
Mass Storage Subsystem
10 PB
10 PB
(Replicated)
(Replicated)

IB Interconnect 120 Lanes


(56 Gb/s) non-blocking
40 Gb/s
Ethernet Access &
Access &
•  A direct attached Analysis System 100 Gbps Analysis System
Public 24 Nodes
PCI interface 96 Nodes
Network 128 GB+ Memory
128 GB+ Memory
allows access to Haswell CPUs
Haswell CPUs
Globus
the NAND flash. Interconnect with
1 TB/s throughput

•  Not limited by High Speed Storage System


500+ TB •  The Hadoop cluster can be dynamically
networking 1 TB/s created over 2 to 48 nodes for each
250M+ IOPS
connection project to use in allocated ;me
•  Each node have access to 4 TB flash
•  Flash storage not storage across four channels
tied to individual •  Accessible via the Hadoop cluster via
nodes idev, batch job submission and VNC
sessions.
4

Hadoop Cluster on Wrangler
•  Started dynamically up on Hadoop reservation request.
Usually, you need two steps:

•  Step 1: create a Hadoop reservation through Wrangler


data portal
What do you need?
Any web browser

•  Step 2: Access your Hadoop cluster and submit jobs


What do you need?
Secure Shell Client
Any VNC client

5
However, for this course,

Step 1: create a Hadoop reservation through Wrangler data


portal Done !
Res !!
erva;o
n Name
What do you need? : hado
op+TRA
Any web browser INING-O
PEN+23
75

Step 2: Access your Hadoop cluster and submit jobs


What do you need?
Secure Shell Client
Any VNC client

6
Multi-Factor Authentication
•  Multi-Factor Authentication with Duo
https://portal.xsede.org/mfa
Check Hadoop Reservation
•  log on to Wrangler login node from your SSH client

>ssh [email protected]

•  user can check the reserva;on status with `scontrol` command:



>scontrol show reservation

•  The reserva;on will include all users from the projects

•  The first node in the reserva;on will be used as namenode

8
Access Hadoop Reservation
Once the reservation status is “active”, a user can access
through slurm job:

•  VNC job: starts a vnc server session on one of the node in


Hadoop cluster,
Check cluster information and hadoop job status
Application with Graphical/Web user interface

•  idev job: Assign one node in Hadoop cluster to user


Manage data in and out hadoop cluster,
Submit Hadoop jobs via command line
Code testing

•  Batch job: submitting jobs to YARN resource manager in


Hadoop cluster.
Submit large analysis job
Submit batch of processing jobs to run sequentially
Start other applications , e.g. Zeppelin

9
Access Hadoop Cluster with VNC
Please visit: vis.tacc.utexas.edu

Choose “TACC User


Portal User”

Enter creden;al

10
1. Choose Wrangler Tab

TRAINING-OPEN

3. Fill in reserva;on name:


hadoop+TRAINING-
OPEN+2375 And choose
“hadoop” queue

0. Set VNC password, (Only need once)

11
An VNC Session Enable
Access to WebUI
•  There several Web UIs run on different port
namenode
•  Cluster information port 50070
•  E.g. c252-101:50070
•  Job information port 8088
•  E.g c252-101:8088

•  Other application may have its own UI running


•  Spark Job UI
•  Hive UI

•  Web UI may not be required, as all information can be


accessed through command line as well.

12
13
14
15
16
Access Hadoop Reservation via idev
Session
User can submit idev session to hadoop cluster
reservation
Ø idev –r hadoop+TRAINING-OPEN+2375

It defaults to use your default project,


The –A allocaiton_name option to specify allocation to use

The default duration for idev is 30 minutes


The -m minutes option can specify the time of the idev session

Please limit your usage to Hadoop related tasks, you can also
submit idev without using reservation for non-hadoop tasks.

17
Slurm
Slurm is an open source, fault-tolerant, and highly scalable cluster
management and job scheduling system for large and small Linux
clusters.
•  sbatch is used to submit a job script for later execution.
•  sbatch myHadoopJob.slurm
•  scancel is used to cancel a pending or running job or job step.
•  scancel 1234
•  scontrol is the administrative tool used to view and/or modify
Slurm state.
•  scontrol show reservation
•  sinfo reports the state of partitions and nodes managed by Slurm.
•  squeue reports the state of jobs or job steps.
•  squeue -u $USER
Batch Job Script
https://portal.tacc.utexas.edu/user-guides/
wrangler#hadoop-hdfs-jobs-on-wrangler
myHadoopJob.slurm

login1$ sbatch myHadoopJob.slurm


Recap
•  Access by secure shell client
ssh [email protected]
idev –r hadoop+TRAINING-OPEN+2375 –m 240 –p hadoop

•  Access by vis portal


–  Go to vis.tacc.utexas.edu using web browser
–  Login with your creden;al
–  Goto Wrangler tab to start VNC sessions using
reserva;on hadoop+TRAINING-OPEN+2375 and
using Hadoop queue

20
FYI: How to Create Hadoop
Reservation
Wrangler data portal: portal.wrangler.tacc.utexas.edu

21
On project page choose: Manage -> Create Hadoop Reserva;on

the number of
nodes (1 ~10) to be Schedule
used for the Start ;me
Hadoop cluster.


Dura;on
(1-30 Days)

22

You might also like