Cloudera User Manual
Cloudera User Manual
Contents
1. About Cloudera .................................................................................................................................... 3
1.1. Cloudera Director ........................................................................................................................ 3
1.2. Cloudera Manager ...................................................................................................................... 4
2. Objective ............................................................................................................................................... 4
3. Architecture ........................................................................................... Error! Bookmark not defined.
4. Getting Started ..................................................................................................................................... 5
4.1 Accessing Cloudera Backend cluster details ........................................................................... 5
4.2. Configure SOCKS Proxy .............................................................................................................. 8
4.3. Configure Your Browser to Access Proxy ................................................................................ 9
4.4. Accessing Cloudera Manager from Cloudera Director Web UI ......................................... 12
4.5. Hue............................................................................................................................................... 18
4.6. Apache Spark (Run Spark App) ............................................................................................... 21
4.7. Viewing Jobs in UI ..................................................................................................................... 24
4.8. Hive .............................................................................................................................................. 26
4.9. Impala .......................................................................................................................................... 28
5. Power BI integration with Data Lake Store and Impala (Optional) ........................................... 31
5.1 Integrating with Data Lake Store ............................................................................................. 31
5.2 Integrating with Impala ............................................................................................................. 38
6. Reference ............................................................................................................................................ 41
6.1 Configure SOCKS Proxy ............................................................................................................ 41
6.2 Restart Cloudera Management Service ................................................................................. 42
6.3 Error Messages While Running the Spark Job...................................................................... 45
2
1. About Cloudera
Cloudera is an open-source Apache Hadoop distribution, CDH (Cloudera Distribution Including
Apache Hadoop) targets enterprise-class deployments of that technology.
Cloudera provides a scalable, flexible, integrated platform that makes it easy to manage rapidly
increasing volumes and varieties of data in your enterprise. Cloudera products and solutions
enable you to deploy and manage Apache Hadoop and related projects, manipulate and
analyze your data, and keep that data secure and protected.
Cloudera develops a Hadoop platform that integrates the most popular Apache Hadoop open
source software within one place. Hadoop is an ecosystem, and setting a cluster manually is a
pain. Going through each node, deploying the configuration though the cluster, deploying your
services, and restarting them on a wide cluster is a major drawback of distributed system and
require lot of automation for administration. Cloudera developed a big data Hadoop
distribution that handles installation and updates on a cluster in few clicks.
Cloudera also develop their own projects such as Impala or Kudu that improve hadoop
integration and responsiveness in the industry.
Cloudera Director is designed for both long running and transient clusters. With long running
clusters, you deploy one or more clusters that you can scale up or down to adjust to demand.
With transient clusters, you can launch a cluster, schedule any jobs, and shut the cluster down
after the jobs complete.
The Cloudera Director server is designed to run in a centralized setup, managing multiple
Cloudera Manager instances and CDH clusters, with multiple users and user accounts. The
server works well for launching and managing large numbers of clusters in a production
environment.
3
1.2. Cloudera Manager
Cloudera Manager is a sophisticated application used to deploy, manage, monitor, and
diagnose issues with your CDH deployments. Cloudera Manager provides the Admin Console,
a web-based user interface that makes administration of your enterprise data simple and
straightforward. It also includes the Cloudera Manager API, which you can use to obtain
cluster health information and metrics, as well as configure Cloudera Manager.
2. Objective
NOTE: As this test drive provides access to the full Cloudera Director platform, deployment can
sometimes take up to 45 minutes. While you wait, please feel free to review to helpful content
in this manual and on Cloudera’s Azure Marketplace product page, or on the Cloudera website.
Please also consider watching the demo video showcased on the test drive launch page on the
Azure Marketplace web site.
The test drive provisions Cloudera Director, the environment, Cloudera Manager, and a cluster
consisting of 1 master node and 3 worker nodes. The test drive also integrates with Azure Data
Lake Store.
The use case scenario for this test drive is to provide users with a test Azure Data Lake Store and:
4
3. Create an Impala table on the ADLS output and query Impala from Hue or Power BI.
The following diagram shows how the data in this test case flows from a .TXT file via Hue
to ADLS, processed by Spark.
3. Getting Started
Once you have signed up or signed in to the test drive from the Azure Marketplace, the test
drive will deploy. Once it is finished deploying, you will receive all the necessary access
information and credentials like URLs, usernames, and passwords, delivered via email.
Since the IP Address of the Master and Worker nodes change for each deployment, you
must access Cloudera backend cluster details to get the Node Details.
5
1. Log in to the Cloudera Director VM using the Cloudera Director FQDN address
provided in the test drive access information, and an SSH tool like PuTTY (or Terminal
on Mac), which we’ll refer to in this walkthrough. (Download PuTTY here)
E.g. cldrhyic.eastus.cloudapp.azure.com
* Mac users may visit section 6.1 in the Reference section in this guide to learn how to setup
a tunnel using Terminal.
2. In the left navigation panel of PuTTY, navigate to Connection > SSH >Tunnels.
Enter the source port as ‘1080’, select Dynamic, click on Add, and then click Open
to connect to the VM:
6
7
3. Once connected, login to the Cloudera Director VM using the Director Username and
then the Director Password from the provided test drive access credentials.
(Note: Passwords are hidden when typed or pasted in Linux terminals)
4. All the Cloudera Backend cluster details are present in NodeDetails file. Copy the
NodeDetails into a text file or Word document for reference, these details will be
used later.
cat NodeDetails
The NodeDetails file contains Node and URI details used by the Cloudera test drive
environment. These are gathered using a script which pulls required data using the API calls.
• Sets up a single SSH tunnel to one of the hosts on the network (the Cloudera Director
host) and create a SOCKS proxy on that host.
8
• Changes the browser configuration to do all lookups through that SOCKS proxy host.
Note: In the previous section (4.1), the SSH tunnel has already been setup on source
port 1080.
1. Run the following command in the terminal, which turns the Linux machine into Socks
Proxy. (copy/paste may not work):
When asked “Are you sure you want to continue connecting (yes/no)?”, type “yes” and
hit Enter. Then enter the Director Password mentioned in the provided access
credentials for the test drive.
Verify the Socks Proxy is running using command ps -ef | grep ssh on the console.
a. Open Chrome settings and navigate to Show advanced settings > Network >
Change proxy settings. Click on LAN settings.
9
b. Check Proxy Server checkbox and click on Advanced.
10
d. Click on OK to save the changes.
a. Navigate to Settings > View advanced settings > Open proxy settings.
b. Set “Use a proxy server” to On. Enter Address as http://localhost with port as 1080
and click on Save.
11
3. Configuring Chrome for Linux
/usr/bin/google-chrome \
--user-data-dir="$HOME/chrome-with-proxy" \
--proxy-server="socks5://localhost:1080"
Note: Please visit section 6.1 in the Reference section later in this guide for additional
details and help for any error messages you may encounter.
12
1. Access the Cloudera Director Web UI using Cloudera Director Access URL provided in
the Access Information.
Eg: cldrhyic.eastus.cloudapp.azure.com:7189
2. Accept the End User License Terms and Conditions and click on Continue.
3. Login to the Cloudera Director web console using CD-WEB UI Username and
Password from the Access Information.
13
4. The Cloudera Director console should open. Click on the Cloudera Manager link from
the Cloudera Director Dashboard, as shown below.
5. Copy the Cloudera Manager Host address, along with the port number, and paste it in
new browser tab.
14
6. Login to the Cloudera Manager Console using CM-WEB UI Username and CM-WEB UI
Password from the Access Information.
Note: The next step is to Restart Stale Services. We must do this to get the Azure
Service Principle updated to the configuration file site-core.xml, which is required to
integrate with Azure Data Lake Store.
15
7. In Cloudera Manager, click on the HDFS-1 service to Restart Stale Services.
16
9. Click on Restart Stale Services so the cluster can read the new configuration information.
11. Wait until all requested services are restarted. Once all the services are restarted, click on
the Finish button.
17
12. Now we have the Cloudera Director ready, with Cloudera Manager and Cluster (1
master and 3 workers).
Note: Please visit section 6.2 in the Reference section later in this guide for additional
details and help for any error messages you may encounter.
4.5. Hue
Hue is a set of web applications that enable you to interact with a CDH cluster. Hue
applications let you browse HDFS and manage a Hive metastore. They also let you run Hive
and Cloudera Impala queries, HBase and Sqoop commands, Pig scripts, MapReduce jobs, and
Oozie workflows.
1. Copy the Cloudera Hue Web URL from the saved NodeDetails file and paste it in
browser – which opens the Hue console.
2. Create a Hue Account by giving Cloudera Hue Web UI Username/Password from the
NodeDetails file.
18
3. You will login into the Hue dashboard. On the right side of the page, click on the HDFS
browser icon, as shown in the below screenshot.
Note: CDH 5.12 has a new Hue UI. We recommend switching to Hue 3 from the admin
tab (see screenshot below).
19
4. Copy the data of inputfile from the below link. Give any name to the file (Eg: ‘data’ or
‘input’), then save it in .txt format.
https://aztdrepo.blob.core.windows.net/clouderadirector/inputfile.txt
Once ready, click on Upload on the Hue file browser page (see below).
Note: Please ensure the inputfile is uploaded to the path /user/admin (see below):
20
6. The .txt file is now uploaded to Hue. The Spark application will use this data as input
and provide the output to ADLS.
21
To use it properly, it is also a good idea to install “dos2unix”. dos2unix is a program that
converts DOS to UNIX text file format, ensuring everything will run in a Linux environment.
1. Login to the Master VM by typing in the below command (copy/paste may not work):
wget
https://raw.githubusercontent.com/sysgain/clouderatd/master/scripts
/ClouderaSparkSetup.sh
22
3. To install dos2unix, run the following command:
dos2unix /home/cloudera/ClouderaSparkSetup.sh
chmod 755 /home/cloudera/ClouderaSparkSetup.sh
Note: Replace the above values from NodeDetails and give the Name of the input file
that you have just uploaded in Hue in the place of <inputfile.txt>.
23
6. By executing the above script, the data has been stored to ADLS using Spark application.
Note: Please visit section 6.3 in the Reference section for additional details and help for
any error messages you may encounter.
Example: http://10.3.0.5:7180/cmf/home
2. Click on YARN-1.
3. Click on Applications tab in the top navigation menu to view the available jobs.
24
Each job has Summary and Detail information. A job Summary includes the following
attributes: start & end timestamps, query name (if the job is part of a Hive query),
queue, job type, job ID, and user.
4. You can also see the available applications by navigating to the Spark UI:
Example: http://10.3.0.5:7180/cmf/home
25
3. Navigate to the History Server WEB UI by going to http://<Master Node private
IPAddress>:18088
Example: http://10.3.0.8:18088/
Note: Please visit section 6.3 in the Reference section for additional details and help for
any error messages you may encounter.
4.8. Hive
Apache Hive is a data warehouse software project built on top of Apache Hadoop for
providing data summarization, query, and analysis. Hive gives a SQL-like interface to
query data stored in various databases and file systems that integrate with Hadoop.
Now we will create a Hive table from the output of the Spark application stored on ADLS and
run a Hive query from Hue.
1. Navigate to the Query Editors drop-down menu in the Hue WEB UI and click on Hive.
26
2. In the default database, execute the below query:
Note: Add any name for <tablename> and replace the <Output Data files on
Datalake for the testdrive> placeholder with the corresponding data from the
NodeDetails file.
27
4.9. Impala
Impala is an open source, massively parallel processing query engine on top of clustered
systems like Apache Hadoop. It is an interactive SQL like query engine that runs on top of
Hadoop Distributed File System (HDFS). It integrates with HIVE metastore to share the table
information between both the components.
1. Note: Impala now integrates with ADLS from version CDH 5.12.
3. Execute the below query in the default database to sync the data from Hive to Impala:
INVALIDATE METADATA;
29
4. View the table by giving the query:
5. You have now successfully run the Impala query using Hue!
30
4. Power BI integration with Data Lake Store and Impala
(Optional)
5.1 Integrating with Data Lake Store
2. From the Home ribbon, click Get Data, and then click More. In the Get Data
dialog box, click Azure, click Azure Data Lake Store, and then click Connect.
3. In the Microsoft Azure Data Lake Store dialog box, provide the URL to your Data Lake
Store account, and then click OK.
Note: Get the URL - Datalake Endpoint from the NodeDetails file. (Refer to section 4.1)
31
4. In the next dialog box, click Sign in to sign into Data Lake Store account. You will be
redirected to your organization's sign in page. Follow the prompts to sign into the account.
32
5. The next dialog box shows the file that you uploaded to your Data Lake Store account.
Verify the info and then click Load.
33
6. After the data has been successfully loaded into Power BI, you will see the available fields
in the Fields tab.
7. However, to visualize and analyze the data, you might prefer the data be available as per
your requirements. To do so, follow the steps below:
Under the content column, right click on Table and select Add as New Query, you will see
a new query added in the queries column:
34
9. Once again, right click and select Add as New Query to convert the table content to binary
form.
35
10. Right click and create a new query to get the data from the table as shown below:
11. You will see a file icon that represents the file that you uploaded. Right-click the file,
and click CSV.
36
12. Your data is now available in a format that you can use to create visualizations.
13. From the Home ribbon, click Close and Apply, and then click Close and Apply.
14. Once the query is updated, the Fields tab will show the new fields available for
visualization.
37
15. You can create a pie chart to represent your data. To do so, make the following selections:
a) From the Visualizations tab, click the symbol for a pie chart (see below).
b) Drag the columns that you want to use and represent in your pie-chart from the Fields tab
to Visualizations tab, as shown below:
16. From the file menu, click Save to save the visualization as a Power BI Desktop file.
1. Go to point 7 of section 4.7, where you ran a query from the table created using the
output from ADLS copied to local HDFS.
38
2. Click the Export Results button in the Hue Impala UI, as seen in the above screenshot,
to download the output as a CSV file.
39
3. From the Home ribbon in Power BI, click Get Data, and then click More. In the Get
Data dialog box, click File, click Text/CSV, and then click Connect.
4. Select the CSV file exported from Impala in Step 2 and click on Open.
5. Click on Load.
40
6. Select the Data button to visualize the content.
You have successfully visualized the content exported from impala using power BI.
5. Reference
6.1 Configure SOCKS Proxy
Please refer to below documentation from Cloudera to help setup a SOCKS Proxy.
https://www.cloudera.com/documentation/director/latest/topics/director_get_started_azure
_socks.html#concept_a35_k4l_zw
41
6.2 Restart Cloudera Management Service
You may need to restart Cloudera Management Service for the below errors:
Error:
• Request to the Service Monitor failed. This may cause slow page responses. View the
status of the Service Monitor.
• Request to the Host Monitor failed. This may cause slow page responses. View the
status of the Host Monitor.
Example: http://10.3.0.5:7180/cmf/home
42
2. Go to Cloudera Management Service and select MGMT.
43
4. Confirm by clicking the Restart button.
Note: If you performed this restart in response to errors, please now re-run section 4.3
after performing the above steps.
44
6.3 Error Messages While Running the Spark Job
1. You may see a few errors popping up while executing the Spark job that can safely
be ignored, such as the ones below.
Note: You may refer to the Spark section of the Cloudera release notes for further
details (link below).
https://www.cloudera.com/documentation/enterprise/releasenotes/topics/cn_rn_known_issues.html
45