0% found this document useful (0 votes)
186 views44 pages

Cloudera User Manual

This document provides guidance on accessing and using a Cloudera test drive environment on Azure. It describes how Cloudera Director and Cloudera Manager are used to deploy and manage a Hadoop cluster with 1 master node and 3 worker nodes. It outlines steps to run a Spark WordCount application on Azure Data Lake Store, create Hive and Impala tables on the output, and query the data using Hue, Hive, and Impala. The objective is to allow users to test Azure Data Lake Store integration and running Hadoop/Spark, Hive, and Impala jobs on sample data.

Uploaded by

charpentier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
186 views44 pages

Cloudera User Manual

This document provides guidance on accessing and using a Cloudera test drive environment on Azure. It describes how Cloudera Director and Cloudera Manager are used to deploy and manage a Hadoop cluster with 1 master node and 3 worker nodes. It outlines steps to run a Spark WordCount application on Azure Data Lake Store, create Hive and Impala tables on the output, and query the data using Hue, Hive, and Impala. The objective is to allow users to test Azure Data Lake Store integration and running Hadoop/Spark, Hive, and Impala jobs on sample data.

Uploaded by

charpentier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

1

Contents
1. About Cloudera .................................................................................................................................... 3
1.1. Cloudera Director ........................................................................................................................ 3
1.2. Cloudera Manager ...................................................................................................................... 4
2. Objective ............................................................................................................................................... 4
3. Architecture ........................................................................................... Error! Bookmark not defined.
4. Getting Started ..................................................................................................................................... 5
4.1 Accessing Cloudera Backend cluster details ........................................................................... 5
4.2. Configure SOCKS Proxy .............................................................................................................. 8
4.3. Configure Your Browser to Access Proxy ................................................................................ 9
4.4. Accessing Cloudera Manager from Cloudera Director Web UI ......................................... 12
4.5. Hue............................................................................................................................................... 18
4.6. Apache Spark (Run Spark App) ............................................................................................... 21
4.7. Viewing Jobs in UI ..................................................................................................................... 24
4.8. Hive .............................................................................................................................................. 26
4.9. Impala .......................................................................................................................................... 28
5. Power BI integration with Data Lake Store and Impala (Optional) ........................................... 31
5.1 Integrating with Data Lake Store ............................................................................................. 31
5.2 Integrating with Impala ............................................................................................................. 38
6. Reference ............................................................................................................................................ 41
6.1 Configure SOCKS Proxy ............................................................................................................ 41
6.2 Restart Cloudera Management Service ................................................................................. 42
6.3 Error Messages While Running the Spark Job...................................................................... 45

2
1. About Cloudera
Cloudera is an open-source Apache Hadoop distribution, CDH (Cloudera Distribution Including
Apache Hadoop) targets enterprise-class deployments of that technology.

Cloudera provides a scalable, flexible, integrated platform that makes it easy to manage rapidly
increasing volumes and varieties of data in your enterprise. Cloudera products and solutions
enable you to deploy and manage Apache Hadoop and related projects, manipulate and
analyze your data, and keep that data secure and protected.

Cloudera develops a Hadoop platform that integrates the most popular Apache Hadoop open
source software within one place. Hadoop is an ecosystem, and setting a cluster manually is a
pain. Going through each node, deploying the configuration though the cluster, deploying your
services, and restarting them on a wide cluster is a major drawback of distributed system and
require lot of automation for administration. Cloudera developed a big data Hadoop
distribution that handles installation and updates on a cluster in few clicks.

Cloudera also develop their own projects such as Impala or Kudu that improve hadoop
integration and responsiveness in the industry.

1.1. Cloudera Director


Cloudera Director enables reliable self-service for using CDH and Cloudera Enterprise Data
Hub in the cloud.

Cloudera Director provides a single-pane-of-glass administration experience for central IT to


reduce costs and deliver agility, and for end-users to easily provision and scale clusters.
Advanced users can interact with Cloudera Director programmatically through the REST API
or the CLI to maximize time-to-value for an enterprise data hub in cloud environments.

Cloudera Director is designed for both long running and transient clusters. With long running
clusters, you deploy one or more clusters that you can scale up or down to adjust to demand.
With transient clusters, you can launch a cluster, schedule any jobs, and shut the cluster down
after the jobs complete.

The Cloudera Director server is designed to run in a centralized setup, managing multiple
Cloudera Manager instances and CDH clusters, with multiple users and user accounts. The
server works well for launching and managing large numbers of clusters in a production
environment.

3
1.2. Cloudera Manager
Cloudera Manager is a sophisticated application used to deploy, manage, monitor, and
diagnose issues with your CDH deployments. Cloudera Manager provides the Admin Console,
a web-based user interface that makes administration of your enterprise data simple and
straightforward. It also includes the Cloudera Manager API, which you can use to obtain
cluster health information and metrics, as well as configure Cloudera Manager.

2. Objective
NOTE: As this test drive provides access to the full Cloudera Director platform, deployment can
sometimes take up to 45 minutes. While you wait, please feel free to review to helpful content
in this manual and on Cloudera’s Azure Marketplace product page, or on the Cloudera website.
Please also consider watching the demo video showcased on the test drive launch page on the
Azure Marketplace web site.

The test drive provisions Cloudera Director, the environment, Cloudera Manager, and a cluster
consisting of 1 master node and 3 worker nodes. The test drive also integrates with Azure Data
Lake Store.

The use case scenario for this test drive is to provide users with a test Azure Data Lake Store and:

1. Run the WordCount app with Hadoop/Spark on ADLS.


2. Create a Hive table on the output, and query Hive from Hue.

4
3. Create an Impala table on the ADLS output and query Impala from Hue or Power BI.

The following diagram shows how the data in this test case flows from a .TXT file via Hue
to ADLS, processed by Spark.

3. Getting Started
Once you have signed up or signed in to the test drive from the Azure Marketplace, the test
drive will deploy. Once it is finished deploying, you will receive all the necessary access
information and credentials like URLs, usernames, and passwords, delivered via email.

4.1 Accessing Cloudera Backend cluster details

Since the IP Address of the Master and Worker nodes change for each deployment, you
must access Cloudera backend cluster details to get the Node Details.

5
1. Log in to the Cloudera Director VM using the Cloudera Director FQDN address
provided in the test drive access information, and an SSH tool like PuTTY (or Terminal
on Mac), which we’ll refer to in this walkthrough. (Download PuTTY here)
E.g. cldrhyic.eastus.cloudapp.azure.com

* Mac users may visit section 6.1 in the Reference section in this guide to learn how to setup
a tunnel using Terminal.

2. In the left navigation panel of PuTTY, navigate to Connection > SSH >Tunnels.
Enter the source port as ‘1080’, select Dynamic, click on Add, and then click Open
to connect to the VM:

6
7
3. Once connected, login to the Cloudera Director VM using the Director Username and
then the Director Password from the provided test drive access credentials.
(Note: Passwords are hidden when typed or pasted in Linux terminals)

4. All the Cloudera Backend cluster details are present in NodeDetails file. Copy the
NodeDetails into a text file or Word document for reference, these details will be
used later.

To open the NodeDetails file use the following command.

cat NodeDetails

The NodeDetails file contains Node and URI details used by the Cloudera test drive
environment. These are gathered using a script which pulls required data using the API calls.

4.2. Configure SOCKS Proxy


For security purposes, Cloudera recommends that you connect to your cluster using a SOCKS
proxy. A SOCKS proxy changes your browser to perform lookups directly from your Microsoft
Azure network and allows you to connect to services using private IP addresses and internal
fully qualified domain names (FQDNs).

This approach does the following:

• Sets up a single SSH tunnel to one of the hosts on the network (the Cloudera Director
host) and create a SOCKS proxy on that host.

8
• Changes the browser configuration to do all lookups through that SOCKS proxy host.
Note: In the previous section (4.1), the SSH tunnel has already been setup on source
port 1080.

1. Run the following command in the terminal, which turns the Linux machine into Socks
Proxy. (copy/paste may not work):

ssh -f -N -D 0.0.0.0:1080 localhost

When asked “Are you sure you want to continue connecting (yes/no)?”, type “yes” and
hit Enter. Then enter the Director Password mentioned in the provided access
credentials for the test drive.

Verify the Socks Proxy is running using command ps -ef | grep ssh on the console.

4.3. Configure Your Browser to Access Proxy


Follow the steps below to configure either a Chrome (1) or Edge (2) web browser to use
a SOCKS proxy.

1. Configuring for Chrome Web Browser:

a. Open Chrome settings and navigate to Show advanced settings > Network >
Change proxy settings. Click on LAN settings.

9
b. Check Proxy Server checkbox and click on Advanced.

c. Enter socks details:


Proxy address to use: localhost
Port: 1080

10
d. Click on OK to save the changes.

2. Configuring for Edge Web Browser:

a. Navigate to Settings > View advanced settings > Open proxy settings.

b. Set “Use a proxy server” to On. Enter Address as http://localhost with port as 1080
and click on Save.

11
3. Configuring Chrome for Linux

/usr/bin/google-chrome \
--user-data-dir="$HOME/chrome-with-proxy" \
--proxy-server="socks5://localhost:1080"

4. Configuring Chrome for Mac OS X

"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" \


--user-data-dir="$HOME/chrome-with-proxy" \
--proxy-server="socks5://localhost:1080"

Note: Please visit section 6.1 in the Reference section later in this guide for additional
details and help for any error messages you may encounter.

4.4. Accessing Cloudera Manager from Cloudera Director Web UI

After deploying a cluster, you can manage it using Cloudera Manager:

12
1. Access the Cloudera Director Web UI using Cloudera Director Access URL provided in
the Access Information.

Eg: cldrhyic.eastus.cloudapp.azure.com:7189

2. Accept the End User License Terms and Conditions and click on Continue.

3. Login to the Cloudera Director web console using CD-WEB UI Username and
Password from the Access Information.

13
4. The Cloudera Director console should open. Click on the Cloudera Manager link from
the Cloudera Director Dashboard, as shown below.

5. Copy the Cloudera Manager Host address, along with the port number, and paste it in
new browser tab.

14
6. Login to the Cloudera Manager Console using CM-WEB UI Username and CM-WEB UI
Password from the Access Information.

Note: The next step is to Restart Stale Services. We must do this to get the Azure
Service Principle updated to the configuration file site-core.xml, which is required to
integrate with Azure Data Lake Store.

15
7. In Cloudera Manager, click on the HDFS-1 service to Restart Stale Services.

8. Click on Restart Stale Services as shown in the below screenshot.

16
9. Click on Restart Stale Services so the cluster can read the new configuration information.

10. Click on the Restart Now button.

11. Wait until all requested services are restarted. Once all the services are restarted, click on
the Finish button.

17
12. Now we have the Cloudera Director ready, with Cloudera Manager and Cluster (1
master and 3 workers).

Note: Please visit section 6.2 in the Reference section later in this guide for additional
details and help for any error messages you may encounter.

4.5. Hue
Hue is a set of web applications that enable you to interact with a CDH cluster. Hue
applications let you browse HDFS and manage a Hive metastore. They also let you run Hive
and Cloudera Impala queries, HBase and Sqoop commands, Pig scripts, MapReduce jobs, and
Oozie workflows.

1. Copy the Cloudera Hue Web URL from the saved NodeDetails file and paste it in
browser – which opens the Hue console.

2. Create a Hue Account by giving Cloudera Hue Web UI Username/Password from the
NodeDetails file.

18
3. You will login into the Hue dashboard. On the right side of the page, click on the HDFS
browser icon, as shown in the below screenshot.

Note: CDH 5.12 has a new Hue UI. We recommend switching to Hue 3 from the admin
tab (see screenshot below).

19
4. Copy the data of inputfile from the below link. Give any name to the file (Eg: ‘data’ or
‘input’), then save it in .txt format.

https://aztdrepo.blob.core.windows.net/clouderadirector/inputfile.txt

Once ready, click on Upload on the Hue file browser page (see below).

Note: Please ensure the inputfile is uploaded to the path /user/admin (see below):

5. Select the saved .txt file to upload.

20
6. The .txt file is now uploaded to Hue. The Spark application will use this data as input
and provide the output to ADLS.

4.6. Apache Spark (Run Spark App)


Spark is the open standard for flexible in-memory data processing that enables batch, real-
time, and advanced analytics on the Apache Hadoop platform.

21
To use it properly, it is also a good idea to install “dos2unix”. dos2unix is a program that
converts DOS to UNIX text file format, ensuring everything will run in a Linux environment.

1. Login to the Master VM by typing in the below command (copy/paste may not work):

ssh –i sshKeyForAzureVM cloudera@<Master Node private IPAddress>

Note: Provide <Master Node private IPAddress>from NodeDetails file.

2. Download the following script file using the below command.


The script contains the spark app (WordCount). The application counts the number of
occurrences of each letter in words which have more characters than a given threshold.

wget
https://raw.githubusercontent.com/sysgain/clouderatd/master/scripts
/ClouderaSparkSetup.sh

22
3. To install dos2unix, run the following command:

sudo yum install -y dos2unix

4. To give permissions to ClouderaSparkSetup.sh file, run the following commands:

dos2unix /home/cloudera/ClouderaSparkSetup.sh
chmod 755 /home/cloudera/ClouderaSparkSetup.sh

5. Run the following command to execute the ClouderaSparkSetup.sh script:

sh ClouderaSparkSetup.sh <Datalake Directory> <Master Node private


IPAddress> <inputfile.txt> <Datalake Endpoint for the testdrive>

Note: Replace the above values from NodeDetails and give the Name of the input file
that you have just uploaded in Hue in the place of <inputfile.txt>.

Example: sh ClouderaSparkSetup.sh demotdah6k 10.3.0.6 inputfile.txt


adl://cddatalakeah6k.azuredatalakestore.net

23
6. By executing the above script, the data has been stored to ADLS using Spark application.

Note: Please visit section 6.3 in the Reference section for additional details and help for
any error messages you may encounter.

4.7. Viewing Jobs in UI


Next, navigate to the Yarn/Spark UI to see the WordCount Spark job.

1. Go to http://<Manager Node private IPAddress>:7180/cmf/home

Example: http://10.3.0.5:7180/cmf/home

2. Click on YARN-1.

3. Click on Applications tab in the top navigation menu to view the available jobs.

24
Each job has Summary and Detail information. A job Summary includes the following
attributes: start & end timestamps, query name (if the job is part of a Hive query),
queue, job type, job ID, and user.

4. You can also see the available applications by navigating to the Spark UI:

1. Go to http://<Manager Node private IPAddress>:7180/cmf/home

Example: http://10.3.0.5:7180/cmf/home

2. Click on SPARK_ON_YARN-1. (May appear as ‘SPARK_ON…’)

25
3. Navigate to the History Server WEB UI by going to http://<Master Node private
IPAddress>:18088

Example: http://10.3.0.8:18088/

Note: Please visit section 6.3 in the Reference section for additional details and help for
any error messages you may encounter.

4.8. Hive
Apache Hive is a data warehouse software project built on top of Apache Hadoop for
providing data summarization, query, and analysis. Hive gives a SQL-like interface to
query data stored in various databases and file systems that integrate with Hadoop.

Now we will create a Hive table from the output of the Spark application stored on ADLS and
run a Hive query from Hue.

1. Navigate to the Query Editors drop-down menu in the Hue WEB UI and click on Hive.

26
2. In the default database, execute the below query:

create external table <tablename> (character varchar(1), frequency


varchar(10)) row format delimited fields terminated by ',' lines
terminated by '\n' stored as textfile location "<Output Data files on
Datalake for the testdrive>";

Note: Add any name for <tablename> and replace the <Output Data files on
Datalake for the testdrive> placeholder with the corresponding data from the
NodeDetails file.

3. View the table by giving the query:

Select * from <tablename>

27
4.9. Impala
Impala is an open source, massively parallel processing query engine on top of clustered
systems like Apache Hadoop. It is an interactive SQL like query engine that runs on top of
Hadoop Distributed File System (HDFS). It integrates with HIVE metastore to share the table
information between both the components.

1. Note: Impala now integrates with ADLS from version CDH 5.12.

2. Navigate to the Query Editor drop-down menu and click on Impala.

3. Execute the below query in the default database to sync the data from Hive to Impala:

INVALIDATE METADATA;

29
4. View the table by giving the query:

Select * from <tablename>

5. You have now successfully run the Impala query using Hue!

30
4. Power BI integration with Data Lake Store and Impala
(Optional)
5.1 Integrating with Data Lake Store

1. Launch Power BI Desktop on your computer.

2. From the Home ribbon, click Get Data, and then click More. In the Get Data
dialog box, click Azure, click Azure Data Lake Store, and then click Connect.

3. In the Microsoft Azure Data Lake Store dialog box, provide the URL to your Data Lake
Store account, and then click OK.

Note: Get the URL - Datalake Endpoint from the NodeDetails file. (Refer to section 4.1)

31
4. In the next dialog box, click Sign in to sign into Data Lake Store account. You will be
redirected to your organization's sign in page. Follow the prompts to sign into the account.

After you have successfully signed in, click Connect.

32
5. The next dialog box shows the file that you uploaded to your Data Lake Store account.
Verify the info and then click Load.

33
6. After the data has been successfully loaded into Power BI, you will see the available fields
in the Fields tab.

7. However, to visualize and analyze the data, you might prefer the data be available as per
your requirements. To do so, follow the steps below:

8. Select Edit Query from the top menu bar:

Under the content column, right click on Table and select Add as New Query, you will see
a new query added in the queries column:

34
9. Once again, right click and select Add as New Query to convert the table content to binary
form.

35
10. Right click and create a new query to get the data from the table as shown below:

11. You will see a file icon that represents the file that you uploaded. Right-click the file,
and click CSV.

36
12. Your data is now available in a format that you can use to create visualizations.

13. From the Home ribbon, click Close and Apply, and then click Close and Apply.

14. Once the query is updated, the Fields tab will show the new fields available for
visualization.

37
15. You can create a pie chart to represent your data. To do so, make the following selections:

a) From the Visualizations tab, click the symbol for a pie chart (see below).

b) Drag the columns that you want to use and represent in your pie-chart from the Fields tab
to Visualizations tab, as shown below:

16. From the file menu, click Save to save the visualization as a Power BI Desktop file.

5.2 Integrating with Impala

1. Go to point 7 of section 4.7, where you ran a query from the table created using the
output from ADLS copied to local HDFS.

38
2. Click the Export Results button in the Hue Impala UI, as seen in the above screenshot,
to download the output as a CSV file.

39
3. From the Home ribbon in Power BI, click Get Data, and then click More. In the Get
Data dialog box, click File, click Text/CSV, and then click Connect.

4. Select the CSV file exported from Impala in Step 2 and click on Open.

5. Click on Load.

40
6. Select the Data button to visualize the content.

You have successfully visualized the content exported from impala using power BI.

5. Reference
6.1 Configure SOCKS Proxy

Please refer to below documentation from Cloudera to help setup a SOCKS Proxy.
https://www.cloudera.com/documentation/director/latest/topics/director_get_started_azure
_socks.html#concept_a35_k4l_zw

41
6.2 Restart Cloudera Management Service

You may need to restart Cloudera Management Service for the below errors:

Error:
• Request to the Service Monitor failed. This may cause slow page responses. View the
status of the Service Monitor.
• Request to the Host Monitor failed. This may cause slow page responses. View the
status of the Host Monitor.

Restarting Cloudera Management Service:

1. Go to http://<Manager Node private IPAddress>:7180/cmf/home.

Example: http://10.3.0.5:7180/cmf/home

42
2. Go to Cloudera Management Service and select MGMT.

3. Click on the drop down menu and select Restart.

43
4. Confirm by clicking the Restart button.

5. Click on Close to complete the process.

Note: If you performed this restart in response to errors, please now re-run section 4.3
after performing the above steps.

44
6.3 Error Messages While Running the Spark Job

1. You may see a few errors popping up while executing the Spark job that can safely
be ignored, such as the ones below.

Note: The permissions get properly set in the .sh file.

sh ClouderaSparkSetup.sh demotdweti 10.3.0.6


mkdir: Permission denied: user=cloudera, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x --
2017-07-11 16:55:54-- https://aztdrepo.blob.core.windows.net/clouderadirector/wordcount.jar
Resolving aztdrepo.blob.core.windows.net... 52.238.56.168
Connecting to aztdrepo.blob.core.windows.net|52.238.56.168|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6371588 (6.1M) [application/octet-stream]
Saving to: “/home/cloudera/wordcount.jar”

2. Searching for Cloudera Navigator – this error can safely be ignored.

INFO scheduler.DAGScheduler: Job 1 finished: saveAsTextFile at SparkWordCount.scala:32,


took 1.811055 s
INFO spark.SparkContext: Invoking stop() from shutdown hook
ERROR scheduler.LiveListenerBus: Listener ClouderaNavigatorListener threw an exception
java.io.FileNotFoundException: Lineage is enabled but lineage directory
/var/log/spark/lineage doesn't exist
at
com.cloudera.spark.lineage.ClouderaNavigatorListener.checkLineageEnabled(ClouderaNavigatorLis
tener.scala:122)
at com.cloudera.spark.lineage.

Note: You may refer to the Spark section of the Cloudera release notes for further
details (link below).

https://www.cloudera.com/documentation/enterprise/releasenotes/topics/cn_rn_known_issues.html

45

You might also like