0% found this document useful (0 votes)

205 views39 pages

Cloudera Msazure Hadoop Deployment Guide

This document provides a tutorial on getting started with Hadoop deployment. It discusses ingesting and querying relational transaction data from a MySQL database using Apache Sqoop and Apache Impala. Sqoop is used to import the data from MySQL into HDFS in Parquet format while preserving the schema. Impala is then used to issue SQL queries on the imported data, such as finding the most popular product categories or top revenue generating products. The tutorial demonstrates how to perform the same types of analyses on big data that are typically done relationally, at larger scale and lower cost using Hadoop.

Uploaded by

Kristof

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

205 views39 pages

Cloudera Msazure Hadoop Deployment Guide

Uploaded by

Kristof

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

CLOUDER A DE PLOY M E N T G U IDE

Getting Started with

Hadoop Tutorial
Table of contents
Setup ...................................................................................................... 2-10

Showing big data value ......................................................................... 11-15

Showing data hub value .............................................................................. 16

Advanced analytics on the same platform ............................................. 17-29

Data governance and compliance ......................................................... 30-37

The End Game ............................................................................................. 38

Setup
For the remainder of this tutorial, we will present examples in the context of a fictional corporation called DataCo. Our mission is to help this organization get
better insight by asking bigger questions.

SCENARIO:
Your Management: is talking euphorically about Big Data.

You: are carefully skeptical, as it will most likely all land on your desk anyway.
Or, it has already landed on you, with the nice project description of: Go figure
this Hadoop thing out.

PREPARATION:
Verify your environment. Go to Cloudera Manager in your demo environment and
make sure the following services are up and running (have a green status dot next
to them in the Cloudera Manager HOME Status view):
• Apache Impala - You will use this for interactive query
• Apache Hive - You will use for structure storage (i.e. tables in the Hive
metastore)
• HUE - You will use for end user query access
• HDFS - You will use for distributed data storage
• YARN – This is the processing framework used by Hive (includes MR2)

If any of the services show yellow or red, restart the service or reach out to
this discussion forum for further assistance.
Setup
For the remainder of this tutorial, we will present examples in the context of a fictional corporation called DataCo. Our mission is to help this organization get
better insight by asking bigger questions.

STARTING / RESTARTING A SERVICE:

1. Click on the dropdown menu to the right of the service name.
2. Click on Start or Restart.
3. Wait for your service to turn to green

Now that you have verified that your services are healthy and showing green,
you can continue.
Exercise 1: Ingest and query
relational data
In this scenario, DataCo’s business question is: What products do our
customers like to buy? To answer this question, the first thought might be to
look at the transaction data, which should indicate what customers do buy
and like to buy, right?

This is probably something you can do in your regular RDBMS environment, but
a benefit of Apache Hadoop is that you can do it at greater scale at lower cost,
on the same system that you may also use for many other types of analysis.

What this exercise demonstrates is how to do the same thing you already
know how to do, but in CDH. Seamless integration is important when
evaluating any new infrastructure. Hence, it’s important to be able to do what
you normally do, and not break any regular BI reports or workloads over the
dataset you plan to migrate.

To analyze the transaction data in the new platform, we need to ingest it into
the Hadoop Distributed File System (HDFS). We need to find a tool that easily
transfers structured data from a RDBMS to HDFS, while preserving structure.
That enables us to query the data, but not interfere with or break any regular
workload on it.
Exercise 1: Ingest and query
relational data
Apache Sqoop, which is part of CDH, is that tool. The nice thing about This command may take a while to complete, but it is doing a lot. It is
Sqoop is that we can automatically load our relational data from MySQL into launching MapReduce jobs to pull the data from our MySQL database and
HDFS, while preserving the structure. With a few additional configuration write the data to HDFS in parallel, distributed across the cluster in Apache
parameters, we can take this one step further and load this relational data Parquet format. It is also creating tables to represent the HDFS files in Impala/
directly into a form ready to be queried by Apache Impala, the MPP analytic Apache Hive with matching schema.
database included with CDH, and other workloads.
Parquet is a format designed for analytical applications on Hadoop. Instead of
You should first log in to the Master Node of your cluster via a terminal. Then, grouping your data into rows like typical data formats, it groups your data into
launch the Sqoop job: columns. This is ideal for many analytical queries where instead of retrieving
data from specific records, you’re analyzing relationships between specific
variables across many records. Parquet is designed to optimize data storage
> sqoop import-all-tables \ and retrieval in these scenarios.
-m {{cluster_data.worker_node_hostname.length}} \
--connect jdbc:mysql://{{cluster_data.manager_
node_hostname}}:3306/retail_db \
--username=retail_dba \
--password=cloudera \
--compression-codec=snappy \
--as-parquetfile \
--warehouse-dir=/user/hive/warehouse \
--hive-import
Exercise 1: Ingest and query
relational data
VERIFICATION
When this command is complete, confirm that your data files exist in HDFS.

> hadoop fs -ls /user/hive/warehouse/

> hadoop fs -ls /user/hive/warehouse/categories/

These commands to your right will show the directories and the files inside
them that make up your tables.

Note: The number of .parquet files shown will be equal to what was passed
to Sqoop with the -m parameter. This is the number of ‘mappers’ that Sqoop
will use in its MapReduce jobs. It could also be thought of as the number of
simultaneous connections to your database, or the number of disks / Data
Nodes you want to spread the data across. So, on a single-node you will just
see one, but larger clusters will have a greater number of files.
Exercise 1: Ingest and query
relational data
Hive and Impala also allow you to create tables by defining a schema To save time during queries, Impala does not poll constantly for metadata
over existing files with ‘CREATE EXTERNAL TABLE’ statements, similar to changes. So, the first thing we must do is tell Impala that its metadata is out of
traditional relational databases. But Sqoop already created these tables for date. Then we should see our tables show up, ready to be queried:
us, so we can go ahead and query them.

We’re going to use Hue’s Impala app to query our tables. Hue provides a web- invalidate metadata;
based interface for many of the tools in CDH and can be found on port 8888 show tables;
of your Manager Node. In the QuickStart VM, the administrator username for
Hue is ‘cloudera’ and the password is ‘cloudera’.

Once you are inside of Hue, click on Query Editors, and open the Impala You can also click on the “Refresh Table List” icon on the left to see your new
Query Editor. tables in the side menu.
Exercise 1: Ingest and query
relational data
Now that your transaction data is readily available for structured queries in You should see results of the following form:
CDH, it’s time to address DataCo’s business question. Copy and paste or type
in the following standard SQL example queries for calculating total revenue
per product and showing the top 10 revenue generating products:

-- Most popular product categories

select c.category_name, count(order_item_quantity) as count
from order_items oi
inner join products p on oi.order_item_product_id =
p.product_id
inner join categories c on c.category_id = p.product_
category_id
group by c.category_name
order by count desc
limit 10;
Exercise 1: Ingest and query
relational data
Clear out the previous query, and replace it with the following: You should see results similar to this:

-- top 10 revenue generating products

select p.product_id, p.product_name, r.revenue
from products p inner join
(select oi.order_item_product_id, sum(cast(oi.order_item_
subtotal as float)) as revenue
from order_items oi inner join orders o
on oi.order_item_order_id = o.order_id
where o.order_status <> ‘CANCELED’
and o.order_status <> ‘SUSPECTED_FRAUD’
group by order_item_product_id) r
on p.product_id = r.order_item_product_id
order by r.revenue desc
limit 10;

You may notice that we told Sqoop to import the data into Hive but used
Impala to query the data. This is because Hive and Impala can share both
data files and the table metadata. Hive works by compiling SQL queries into
MapReduce jobs, which makes it very flexible, whereas Impala executes
queries itself and is built from the ground up to be as fast as possible, which
makes it better for interactive analysis. We’ll use Hive later for an ETL (extract-
transform-load) workload.
Exercise 1: Ingest and query
relational data
CONCLUSION
Now that you have gone through the first basic steps to Sqoop structured data into HDFS, transform it
into Parquet file format, and create hive tables for use when you query this data.

You have also learned how to query tables using Impala and that you can use regular interfaces and tools
(such as SQL) within a Hadoop environment as well. The idea here being that you can do the same reports
you usually do, but where the architecture of Hadoop vs traditional systems provides much larger scale
and flexibility.
Showing big data value

SCENARIO:
Your Management: is indifferent, you produced what you always produce - a
report on structured data, but you really didn’t prove any additional value.

You: are either also indifferent and just go back to what you have always
done... or you have an ace up your sleeve.

PREPARATION:
Go to Cloudera Manager’s home page and verify the following services are up:
• Impala
• Hive
• HDFS
• Hue
Exercise 2: Correlate structured data
with unstructured data
Since you are a pretty smart data person, you realize another interesting BULK UPLOAD DATA
business question would be: are the most viewed products also the most For your convenience, we have pre-loaded some sample access log data into
sold? Since Hadoop can store unstructured and semi-structured data /opt/examples/log_data/access.log.2. Let’s move this data from the local
alongside structured data without remodeling an entire database, you can filesystem, into HDFS.
just as well ingest, store, and process web log events. Let’s find out what site
visitors have viewed the most.
> sudo -u hdfs hadoop fs -mkdir /user/hive/warehouse/
For this, you need the web clickstream data. The most common way to original_access_logs
ingest web clickstream is to use Apache Flume. Flume is a scalable real-time > sudo -u hdfs hadoop fs -copyFromLocal /opt/examples/
ingest framework that allows you to route, filter, aggregate, and do “mini- log_files/access.log.2 /user/hive/warehouse/original_
operations” on data on its way in to the scalable processing platform. access_logs

In Exercise 4, later in this tutorial, you can explore a Flume configuration

example, to use for real-time ingest and transformation of our sample web
clickstream data. However, for the sake of tutorial-time, in this step, we will The copy command may take several minutes to complete. Verify that your
not have the patience to wait for three days of data to be ingested. Instead, data is in HDFS by executing the following command:
we prepared a web clickstream data set (just pretend you fast forwarded
three days) that you can bulk upload into HDFS directly.

> hadoop fs -ls /user/hive/warehouse/original_access_logs

Exercise 2: Correlate structured data
with unstructured data
Now you can build a table in Hive and query the data via Apache Impala and
Hue. You’ll build this table in 2 steps. First, you’ll take advantage of Hive’s flexible
‘input.regex’ = ‘([^ ]*) - - \\[([^\\]]*)\\]
SerDes (serializers / deserializers) to parse the logs into individual fields using a
“([^\ ]*) ([^\ ]*) ([^\ ]*)” (\\d*) (\\d*) “([^”]*)”
regular expression. Second, you’ll transfer the data from this intermediate table
“([^”]*)”’,
to one that does not require any special SerDe. Once the data is in this table, you
‘output.format.string’ = “%1$$s %2$$s %3$$s %4$$s
can query it much faster and more interactively using Impala.
%5$$s %6$$s %7$$s %8$$s %9$$s”)
LOCATION ‘/user/hive/warehouse/original_access_logs’;
We’ll use the Hive Query Editor app in Hue to execute the following queries:

CREATE EXTERNAL TABLE tokenized_access_logs (

ip STRING,
CREATE EXTERNAL TABLE intermediate_access_logs ( date STRING,
ip STRING, method STRING,
date STRING, url STRING,
method STRING, http_version STRING,
url STRING, code1 STRING,
http_version STRING, code2 STRING,
code1 STRING, dash STRING,
code2 STRING, user_agent STRING)
dash STRING, ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
user_agent STRING) LOCATION ‘/user/hive/warehouse/tokenized_access_logs’;
ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib. ADD JAR {{lib_dir}}/hive/lib/hive-contrib.jar;
serde2.RegexSerDe’ INSERT OVERWRITE TABLE tokenized_access_logs SELECT *
WITH SERDEPROPERTIES ( FROM intermediate_access_logs;
Exercise 2: Correlate structured data
with unstructured data
The final query will take a minute to run. It is using a MapReduce job, just By introspecting the results you quickly realize that this list contains many of
like our Sqoop import did, to transfer the data from one table to the other in the products on the most sold list from previous tutorial steps, but there is one
parallel. You can follow the progress in the log below, and you should see the product that did not show up in the previous result. There is one product that
message ‘The operation has no results.’ when it’s done. seems to be viewed a lot, but never purchased. Why?

Again, we need to tell Impala that some tables have been created through Well, in our example with DataCo, once these odd findings are presented to
a different tool. Switch back to the Impala Query Editor app, and enter the your manager, it is immediately escalated. Eventually, someone figures out
following command: that on that view page, where most visitors stopped, the sales path of the
product had a typo in the price for the item. Once the typo was fixed, and a
correct price was displayed, the sales for that SKU started to rapidly increase.

invalidate metadata;

Now, if you enter the ‘show tables;’ query or refresh the table list in the
left-hand column, you should see the two new external tables in the default
database. Paste the following query into the Query Editor:

select count(*),url from tokenized_access_logs

where url like ‘%\/product\/%’
group by url order by count(*) desc;
Exercise 2: Correlate structured data
with unstructured data
CONCLUSION
If you had lacked an efficient and interactive tool enabling analytics on high-volume semi-structured
data, this loss of revenue would have been missed for a long time. There is risk of loss if an organization
looks for answers within partial data. Correlating two data sets for the same business question showed
value and being able to do so within the same platform made life easier for you and for the organization.
If you’d like to dive deeper into Hive, Impala, and other tools for data analysis in Cloudera’s platform, you
may be interested in Data Analyst Training.

For now, we’ll explore some different techniques.

Showing data hub value

SCENARIO:
Your Management: can’t believe the magic you do with data and is about to
promote you and invest in a new team under your lead... when all hell breaks
loose. You get an emergency call - as you are now the go-to person - and your
manager is screaming about the loss of sales over the last three days.

You: from slightly excited to under the gun in seconds...well, lucky for you,
there might be a quick way to find out what is happening.

PREPARATION:
Go to Cloudera Manager and verify these services are running:
• HDFS
• Hue
• Solr
• YARN
Advanced analytics on the
same platform
SCENARIO:
Your Management: is of course thrilled with the recent discoveries you
helped them with—you basically saved them a lot of money! They start giving
you bigger questions, and more funding (we really hope the latter!)

You: are excited to dive into more advanced use cases, but you know that you’ll
need even more funding by the organization. You decide to really show off!

PREPARATION:
Go to Cloudera Manager and verify these services are running:
• HDFS
• Spark
• YARN / MR2
Exercise 3: Explore log events
interactively
Since sales are dropping and nobody knows why, you want to provide a way Solr organizes data similarly to the way a SQL database does. Each record
for people to interactively and flexibly explore data from the website. We can is called a ‘document’ and consists of fields defined by the schema: just
do this by indexing it for use in Apache Solr, where users can do text searches, like a row in a database table. Instead of a table, Solr calls it a ‘collection’
drill down through different categories, etc. Data can be indexed by Solr in of documents. The difference is that data in Solr tends to be more loosely
batch using MapReduce, or you can index tables in Apache HBase and get structured. Fields may be optional, and instead of always matching exact
real-time updates. To analyze data from the website, however, we’re going to values, you can also enter text queries that partially match a field, just like
stream the log data in using Flume. you’re searching for web pages. You’ll also see Hue refer to ‘shards’ - and
that’s just the way Solr breaks collections up to spread them around the
The web log data is a standard web server log which may look something like this: cluster so you can search all your data in parallel.
Exercise 3: Explore log events
interactively
Here is how you can start real-time-indexing via Cloudera Search and Flume over the sample web server log data and use the Search UI in Hue to explore it:

CREATE YOUR SEARCH INDEX

Ordinarily when you are deploying a new search schema, there are four steps:

1. Creating an empty configuration 3. Uploading your configuration

> cd /opt/examples/flume
> solrctl --zk {{zookeeper_connection_string}}/solr > solrctl --zk {{zookeeper_connection_string}}/solr
instancedir --generate solr_configs instancedir --create live_logs ./solr_configs

The result of this command is a skeleton configuration that you can customize 4. Creating your collection
to your liking via the conf/schema.xml.

2. Edit your schema solrctl --zk {{zookeeper_connection_string}}/solr

The most likely area in conf/schema.xml that you would be interested in collection --create live_logs -s {{ number of solr
is the <fields></fields> section. From this area you can define the fields servers }}
that are present and searchable in your index.
Exercise 3: Explore log events
interactively
You may need to replace the IP addresses with those of your three data nodes. Then click on Indexes from the top right to see all the indexes/collections.
You can verify that you successfully created your collection in Solr by going to
Hue, and clicking Search in the top menu
Exercise 3: Explore log events
interactively
Now you can see the collection that we just created, live_logs, click on it. You are now viewing the fields that we defined in our schema.xml file.
Exercise 3: Explore log events
interactively
Now that you have verified that your search collection/index was created successfully, we can start
putting data into it using Flume and Morphlines. Flume is a tool for ingesting streams of data into your
cluster from sources such as log files, network streams, and more. Morphlines is a Java library for doing
ETL on-the-fly, and it’s an excellent companion to Flume. It allows you to define a chain of tasks like
reading records, parsing and formatting individual fields, and deciding where to send them, etc. We’ve
defined a morphline that reads records from Flume, breaks them into the fields we want to search on, and
loads them into Solr (You can read more about Morphlines here).

This example Morphline is defined at /opt/examples/flume/conf/morphline.conf, and we’re going to use

it to index our records in real-time as they’re created and ingested by Flume.
Exercise 3: Explore log events
interactively
APACHE FLUME AND THE MORPHLINE This will start running the Flume agent in the foreground. Once it has started,
Now that we have an empty Solr index, and live log events coming in to our and is processing records, you should see something like:
fake access.log, we can use Flume and morphlines to load the index with the
real-time log data.

The key player in this tutorial is Flume. Flume is a system for collecting,
aggregating, and moving large amounts of log data from many different
sources to a centralized data source.

With a few simple configuration files, we can use Flume and a morphline (a
simple way to accomplish on-the-fly ETL,) to load our data into our Solr index.
(Note: You can use Flume to load many other types of data stores; Solr is just
the example we are using for this tutorial.)

Start the Flume agent by executing the following command:

> flume-ng agent \

--conf /opt/examples/flume/conf \
--conf-file /opt/examples/flume/conf/flume.conf \
--name agent1 \
-Dflume.root.logger=DEBUG,INFO,console
Exercise 3: Explore log events
interactively
Now you can go back to the Hue UI, and click ‘Search’ from the collection’s page: You’ll be able to search, drill down into, and browse events that have been indexed.

If one of these steps fails, please reach out to the Discussion Forum and get help. Otherwise, you can start exploring the log data and understand what is going on.

For our story’s sake, we pretend that you started indexing data the same time as you started ingesting it (via Flume) to the platform, so that when your manager
escalated the issue, you could immediately drill down into data from the last three days and explore what happened. For example, perhaps you noted a lot of
DDOS events and could take the right measures to preempt the attack. Problem solved! Management is fantastically happy with your recent contributions, which
of course leads to a great bonus or something similar. :D
Exercise 4: Building a dashboard

To get started with building a dashboard with Hue, click on the pen icon. This will take you into the edit-mode where you can choose different widgets
and layouts that you would like to see. You can choose a few options and
configurations here, but for now, just drag a barchart into the top gray row.
Exercise 4: Building a dashboard

This will bring up the list of fields that are present in our index so that you can For the sake of our display, choose +15MINUTES for the INTERVAL.
choose which field you would like to group by. Let’s choose request_date.
Exercise 4: Building a dashboard

You aren’t limited to a single column; you can view this as a two-column This time, let’s choose department as the field that we want to group by for our
display as well. Select the two-column layout from the top left. pie chart.

While you’re here, let’s drag a pie chart to the newly created row in the left column.
Exercise 4: Building a dashboard

Things are really starting to take shape! Let’s add a Facet filter to the left- Now that we are satisfied with our changes, let’s click on the pencil icon to
hand side and select product as the facet. exit edit mode.
Exercise 4: Building a dashboard

And save our dashboard. At the Hue project’s blog you can find a wide selection of video tutorials for
accomplishing other tasks in Hue. For instance, you can watch a video of a
similar Search dashboard to this example being created here.

You may also be interested in more advanced training on the technologies

used in this tutorial, and other ways to index data in real-time and query it
with Solr in Cloudera’s Search Training course.
Data governance and compliance

DataCo has moved into bigger business thanks to the Big Data projects you’ve Some people are questioning exactly how the decision was made to change
contributed to. As more and more users start using the Enterprise Data Hub the pricing on the website. You realize this is the perfect chance to prove
you built, it starts getting more complicated to manage and trace data and yourself again.
access to data. In addition, as your previous deliveries created such success,
the company has decided to build out a full EDH strategy, and as a result You need to demonstrate that you can:
lots of sensitive data is headed for your cluster too: credit card transactions, • Easily show who has queried the data
social security data, and other financial records for DataCo. • Show exactly what has been done with it since it was created
• Enforce policies about how it gets managed in the future
Your Management: is worried about security controls and ability to audit the
access for compliance. Cloudera Navigator provides a solution to all these problems. If using
Cloudera Live, you can find a link in the Govern Your Data section of your
You: need to resolve their concerns and you want to make it easier to manage Guidance Page, which your welcome email will direct you to. In using the
who does what on your cluster for back-charging purposes too. QuickStart VM, the username is ‘cloudera’ and the password is ‘cloudera’.
Exercise 5: Cloudera Navigator

DISCOVERY You know that the old web server log data you analyzed was a Hive table, so
The first thing you see when you log into Cloudera Navigator is a search tool. It’s select ‘Hive’ under ‘Source Type’, and ‘Table’ under ‘Type’. You’re also pretty
an excellent way to find data on your cluster, even when you don’t know exactly sure it had ‘access log’ in the name, so enter this search query at the top and
what you’re looking for. Go ahead and click the link to ‘explore your data’. hit enter:

*access*log*

When the results appear, you immediately recognize the tokenized_access_

table. That must be the one you queried!
Exercise 5: Cloudera Navigator

LINEAGE As you click on the nodes in this graph, more detail will appear. If you click on
Now that you’ve found the data you were looking for, click on the table and the tokenized_access_logs table and the intermediate_access_logs table,
you’ll see a graph of the data’s lineage. You’ll see the tokenized_access_logs you’ll see arrows for each individual field running through that query. You can
table on the right and the underlying file with the same name in HDFS in blue. see how quickly you could trace the origin of datasets even in a much busier
You’ll also see the other Hive table you created from the original file and the and more complicated environment!
query you ran to transform the data between the two. (The different colors
represent different source types: yellow data comes from Hive, blue data
comes directly from HDFS.)
Exercise 5: Cloudera Navigator

AUDITING As you can see, there are hundreds of events that have been recorded, each with
Now you’ve shown where the data came from, but we still need to show what’s details of what was done, by whom, and when. Let’s narrow down what we’re
been done with it. Go to the ‘Audits’ tab, using the link in the top-right corner. looking for again. Open the “Filters” menu from below the “Audit Events” heading.
Exercise 5: Cloudera Navigator

Click the + icon twice to add two new filters. For the first filter, set the property You can also view and create reports based on the results of these searches
to ‘Username’ and fill in ‘admin’ as the value. For the second filter, set the on the left-hand corner. There’s already a report called “Recent Denied
property to ‘Operation’ and fill in ‘QUERY’ as the value. Then click ‘Apply’. Accesses”. If you checked that report now, you may see that in the course
of this tutorial, some tools have tried to access a directory called ‘/user/
As you click on the individual results, you can see the exact queries that were anonymous’ that we haven’t set up, and that the services don’t have
executed and all related details. permission to create.
Exercise 5: Cloudera Navigator

POLICIES Click the + icon to add a new policy, name your policy “Tag Insecure Data”.
It’s a relief to be able to audit access to your cluster and see there’s no Check the box to enable the policy, and enter the following as the search query:
unexpected or unauthorized activity going on. But wouldn’t it be even better
if you could automatically apply policies to data? Let’s open the policies tab in
the top-right hand corner and create a policy to make the data we just audited (permissions:”rwxrwxrwx”) AND (sourceType:hdfs) AND
easier to find in the future. (type:file OR type:directory) AND (deleted:false)
Exercise 5: Cloudera Navigator

This query will detect any files in HDFS that allow anyone to read, write, To apply this tag on existing data, set the schedule to “Immediate”, and check the
and execute. It’s common for people to set these permissions to make sure box “Assign Metadata”. Under tags, enter “insecure”, and then click “Add Tag”.
everything works, but your organization may want to refine this practice as you Save the policy.
move into production or implement more strict practices for some data sets.
Exercise 5: Cloudera Navigator

CONCLUSION
You’ve now experienced how to use Cloudera Navigator for discovery of data and metadata. This
powerful tool makes it easy to audit access, trace data lineage, and enforce policies.

With more data, and more data formats available in a multi-tenant environment, data lineage and
governance are getting challenging. Cloudera Navigator provides enterprise-grade governance that’s
built into the foundation of Apache Hadoop.

You can learn more about the various management features provided by Cloudera Manager in the
Cloudera Administrator Training for Apache Hadoop.
The End Game

We hope you have enjoyed this basic tutorial, and that you: NEXT STEPS
• Have a better understanding of some of the popular tools in CDH If you’re ready to install Cloudera’s platform on your own cluster (on premise
• Know how to setup some basic and familiar BI use cases, as well as web or in the public cloud), there are a few options:
log analytics and real-time search • Try the AWS Quick Start for easy deployment of Cloudera’s platform on
• Can explain to your manager why you deserve a raise! AWS clusters via your own account (promo credit available)
• Try the Azure Test Drive for Cloudera Director (3-hour sandbox to
provision EDH clusters on Azure)

Manual Becker CAD 14
100% (2)
Manual Becker CAD 14
883 pages
Water Balance Lab Report
No ratings yet
Water Balance Lab Report
20 pages
C TS452 2022-Demo
100% (1)
C TS452 2022-Demo
5 pages
A Road Map To System Administrator Certification: The Exam Formats
No ratings yet
A Road Map To System Administrator Certification: The Exam Formats
7 pages
Earth Education Sourcebook-2017-04-08
No ratings yet
Earth Education Sourcebook-2017-04-08
34 pages
CROSS ET AL 2002 Musical Behaviours and The Archaeological Record A Preliminary Study PDF
No ratings yet
CROSS ET AL 2002 Musical Behaviours and The Archaeological Record A Preliminary Study PDF
7 pages
Sistema Portuario de Guaymas
No ratings yet
Sistema Portuario de Guaymas
11 pages
Prospects and Potential of Tourism in Chinta Valley, Bhaderwah, Doda, Jammu and Kahmir
No ratings yet
Prospects and Potential of Tourism in Chinta Valley, Bhaderwah, Doda, Jammu and Kahmir
6 pages
Ilham Rizzky TRACER STUDY Tracer Study Bekerja Sesuai Bidang Keilmuan-1
No ratings yet
Ilham Rizzky TRACER STUDY Tracer Study Bekerja Sesuai Bidang Keilmuan-1
10 pages
Kasaragod District Kalolsavam 2012 - 2013: SL No Name School Grade
No ratings yet
Kasaragod District Kalolsavam 2012 - 2013: SL No Name School Grade
190 pages
Exploring The PICK Operating Sys
No ratings yet
Exploring The PICK Operating Sys
356 pages
The Identification of Haemonchus Species and Diagnosis of Haemonchosis
No ratings yet
The Identification of Haemonchus Species and Diagnosis of Haemonchosis
36 pages
Brochure 2
No ratings yet
Brochure 2
4 pages
ConnorSellmanResume PDF
No ratings yet
ConnorSellmanResume PDF
2 pages
Attachment 1
No ratings yet
Attachment 1
6 pages
Modeling, State of Charge Estimation, and Charging of Lithium-Ion Battery in Electric Vehicle: A Review
No ratings yet
Modeling, State of Charge Estimation, and Charging of Lithium-Ion Battery in Electric Vehicle: A Review
25 pages
Annual KYC Updation Form - Individual
100% (1)
Annual KYC Updation Form - Individual
2 pages
WSP Cover Letter Summer 2017
No ratings yet
WSP Cover Letter Summer 2017
1 page
Chem Fall Workbook 2016
No ratings yet
Chem Fall Workbook 2016
139 pages
Index TQ
No ratings yet
Index TQ
10 pages
CH 08
No ratings yet
CH 08
75 pages
Теория авиации (Б. В. Висленев и Д. В. Кузьменко)
No ratings yet
Теория авиации (Б. В. Висленев и Д. В. Кузьменко)
384 pages
DWM Unit-IV
No ratings yet
DWM Unit-IV
27 pages
EAC Classification - TOWN HOTELS PDF
No ratings yet
EAC Classification - TOWN HOTELS PDF
35 pages
DDS v61-1 Compressed
No ratings yet
DDS v61-1 Compressed
444 pages
Customer Relationship Management in Airtel
No ratings yet
Customer Relationship Management in Airtel
5 pages
Aniruddh Thakur, B-06, Fsa, Indus Tower
No ratings yet
Aniruddh Thakur, B-06, Fsa, Indus Tower
47 pages
Curriculum Vitae: Tejas H. Panchal
100% (1)
Curriculum Vitae: Tejas H. Panchal
8 pages
Final PR2
No ratings yet
Final PR2
78 pages
Oocha DXN
No ratings yet
Oocha DXN
16 pages
Zaa Revised Consitituition
No ratings yet
Zaa Revised Consitituition
15 pages
Bcs Preliminary Seat Plan
No ratings yet
Bcs Preliminary Seat Plan
10 pages
Unit II TCS 750
No ratings yet
Unit II TCS 750
54 pages
Shet Shontrash
No ratings yet
Shet Shontrash
330 pages
Dilmah Ceylon PLC 2018-19 Final
No ratings yet
Dilmah Ceylon PLC 2018-19 Final
160 pages
A Wearable Electrochemical Biosensor For The Monitoring of Metabolites and Nutrients
100% (1)
A Wearable Electrochemical Biosensor For The Monitoring of Metabolites and Nutrients
13 pages
...
No ratings yet
...
2 pages
TEM EHS 018 - Contractor's Work Permit
No ratings yet
TEM EHS 018 - Contractor's Work Permit
2 pages
1st Year Syll
No ratings yet
1st Year Syll
57 pages
Demonstrating The Value of Beaches For Adaptation To Future Coastal Ood Risk
No ratings yet
Demonstrating The Value of Beaches For Adaptation To Future Coastal Ood Risk
11 pages
Ummy Video Downloader User Manual English
No ratings yet
Ummy Video Downloader User Manual English
5 pages
GDPR SmartThings File Description en US v113 210731 164502
No ratings yet
GDPR SmartThings File Description en US v113 210731 164502
28 pages
Phoenicians and The Iron Age Mediterran
No ratings yet
Phoenicians and The Iron Age Mediterran
13 pages
ㅁㄴㅇㅁㄴㅇㄴㅁㅇ
No ratings yet
ㅁㄴㅇㅁㄴㅇㄴㅁㅇ
102 pages
Solutions Problem Set 3
No ratings yet
Solutions Problem Set 3
10 pages
Lecture 0 INT306
No ratings yet
Lecture 0 INT306
38 pages
Proekt Orion 1983 Arkhiv KGB
No ratings yet
Proekt Orion 1983 Arkhiv KGB
210 pages
A Literature Review Factors Affecting The Adoption of Mobile Payment in The Philippines
No ratings yet
A Literature Review Factors Affecting The Adoption of Mobile Payment in The Philippines
10 pages
2020-21 HRM Blackbook Final PDF
No ratings yet
2020-21 HRM Blackbook Final PDF
84 pages
Orthodontics 2 - Lec (Finals)
No ratings yet
Orthodontics 2 - Lec (Finals)
60 pages
Research Output Strategic Business Analysis
No ratings yet
Research Output Strategic Business Analysis
14 pages
Pair of Straight Lines-Full
100% (1)
Pair of Straight Lines-Full
40 pages
09 20n3a12
No ratings yet
09 20n3a12
7 pages
Vut Pqec It
No ratings yet
Vut Pqec It
4,185 pages
One Year Diploma Course
No ratings yet
One Year Diploma Course
3 pages
Sessional Tour Report
No ratings yet
Sessional Tour Report
12 pages
STD X - Chapter - 6 Natural Vegetation Notes
No ratings yet
STD X - Chapter - 6 Natural Vegetation Notes
8 pages
SO-2023-2038 Additional
No ratings yet
SO-2023-2038 Additional
138 pages
Array (List in Python)
No ratings yet
Array (List in Python)
22 pages
FijiTimes - Aug 32012 PDF
No ratings yet
FijiTimes - Aug 32012 PDF
48 pages
DSCI 5350 - Lecture 4 PDF
No ratings yet
DSCI 5350 - Lecture 4 PDF
33 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
This Study Resource Was: 1. Created and Expanded Its Product Offerings?
No ratings yet
This Study Resource Was: 1. Created and Expanded Its Product Offerings?
4 pages
Milestone Systems: Quick Guide: Failover Clustering
No ratings yet
Milestone Systems: Quick Guide: Failover Clustering
15 pages
21mis1162 Swe2019 Da-2
No ratings yet
21mis1162 Swe2019 Da-2
7 pages
SWE - Software Requirements Specification
No ratings yet
SWE - Software Requirements Specification
27 pages
Bishop Fox Cybersecurity Style Guide V2
No ratings yet
Bishop Fox Cybersecurity Style Guide V2
162 pages
MacBook Pro - Tech Specs - Apple (MY)
No ratings yet
MacBook Pro - Tech Specs - Apple (MY)
13 pages
CH 4 - Informed Search
No ratings yet
CH 4 - Informed Search
58 pages
Tso Short Reference Notes: Default Function and PF Key Settings
No ratings yet
Tso Short Reference Notes: Default Function and PF Key Settings
15 pages
Hon 62138
No ratings yet
Hon 62138
2 pages
Unit 3 Cpps
No ratings yet
Unit 3 Cpps
13 pages
Ds4windows Log 20240623.0
No ratings yet
Ds4windows Log 20240623.0
2 pages
Tutorial: Create An Excel Dashboard: Download The Example Dashboard
No ratings yet
Tutorial: Create An Excel Dashboard: Download The Example Dashboard
12 pages
Data Engineering Brochure FXSr63lN9T
No ratings yet
Data Engineering Brochure FXSr63lN9T
14 pages
(2023) A Survey On Language Models For Code
No ratings yet
(2023) A Survey On Language Models For Code
55 pages
Ieee Thesis Reference Format
100% (2)
Ieee Thesis Reference Format
8 pages
Polarmods - Patcher Ooyhajiajaji
No ratings yet
Polarmods - Patcher Ooyhajiajaji
71 pages
CRM WebApp Documentation
No ratings yet
CRM WebApp Documentation
22 pages
PVT - Aci Anywhere Deploying Vpod: Technical and Configuration Examples For Vpod
No ratings yet
PVT - Aci Anywhere Deploying Vpod: Technical and Configuration Examples For Vpod
48 pages
Proland Staffing - Company Profile
No ratings yet
Proland Staffing - Company Profile
10 pages
Lecture Notes DPA Chapter 1
No ratings yet
Lecture Notes DPA Chapter 1
2 pages
Product Acceptance Plan
No ratings yet
Product Acceptance Plan
5 pages
GE Elective
No ratings yet
GE Elective
14 pages
Fluent-Intro 17.0 Module06 Parameters and Design Points
No ratings yet
Fluent-Intro 17.0 Module06 Parameters and Design Points
21 pages
Read Me
No ratings yet
Read Me
2 pages
Cs Lab Report 2
No ratings yet
Cs Lab Report 2
6 pages
Arista and Pole Star Solution Guide
No ratings yet
Arista and Pole Star Solution Guide
9 pages
Aj PDF
No ratings yet
Aj PDF
2 pages
Writing A Job Application
No ratings yet
Writing A Job Application
31 pages

Cloudera Msazure Hadoop Deployment Guide

Uploaded by

Cloudera Msazure Hadoop Deployment Guide

Uploaded by

CLOUDER A DE PLOY M E N T G U IDE

Getting Started with

Showing big data value ......................................................................... 11-15

Showing data hub value .............................................................................. 16

Advanced analytics on the same platform ............................................. 17-29

Data governance and compliance ......................................................... 30-37

The End Game ............................................................................................. 38

STARTING / RESTARTING A SERVICE:

> hadoop fs -ls /user/hive/warehouse/

-- Most popular product categories

-- top 10 revenue generating products

In Exercise 4, later in this tutorial, you can explore a Flume configuration

> hadoop fs -ls /user/hive/warehouse/original_access_logs

CREATE EXTERNAL TABLE tokenized_access_logs (

select count(*),url from tokenized_access_logs

For now, we’ll explore some different techniques.

CREATE YOUR SEARCH INDEX

1. Creating an empty configuration 3. Uploading your configuration

2. Edit your schema solrctl --zk {{zookeeper_connection_string}}/solr

This example Morphline is defined at /opt/examples/flume/conf/morphline.conf, and we’re going to use

Start the Flume agent by executing the following command:

> flume-ng agent \

You may also be interested in more advanced training on the technologies

When the results appear, you immediately recognize the tokenized_access_

You might also like