Cloudera Msazure Hadoop Deployment Guide
Cloudera Msazure Hadoop Deployment Guide
SCENARIO:
Your Management: is talking euphorically about Big Data.
You: are carefully skeptical, as it will most likely all land on your desk anyway.
Or, it has already landed on you, with the nice project description of: Go figure
this Hadoop thing out.
PREPARATION:
Verify your environment. Go to Cloudera Manager in your demo environment and
make sure the following services are up and running (have a green status dot next
to them in the Cloudera Manager HOME Status view):
• Apache Impala - You will use this for interactive query
• Apache Hive - You will use for structure storage (i.e. tables in the Hive
metastore)
• HUE - You will use for end user query access
• HDFS - You will use for distributed data storage
• YARN – This is the processing framework used by Hive (includes MR2)
If any of the services show yellow or red, restart the service or reach out to
this discussion forum for further assistance.
Setup
For the remainder of this tutorial, we will present examples in the context of a fictional corporation called DataCo. Our mission is to help this organization get
better insight by asking bigger questions.
Now that you have verified that your services are healthy and showing green,
you can continue.
Exercise 1: Ingest and query
relational data
In this scenario, DataCo’s business question is: What products do our
customers like to buy? To answer this question, the first thought might be to
look at the transaction data, which should indicate what customers do buy
and like to buy, right?
This is probably something you can do in your regular RDBMS environment, but
a benefit of Apache Hadoop is that you can do it at greater scale at lower cost,
on the same system that you may also use for many other types of analysis.
What this exercise demonstrates is how to do the same thing you already
know how to do, but in CDH. Seamless integration is important when
evaluating any new infrastructure. Hence, it’s important to be able to do what
you normally do, and not break any regular BI reports or workloads over the
dataset you plan to migrate.
To analyze the transaction data in the new platform, we need to ingest it into
the Hadoop Distributed File System (HDFS). We need to find a tool that easily
transfers structured data from a RDBMS to HDFS, while preserving structure.
That enables us to query the data, but not interfere with or break any regular
workload on it.
Exercise 1: Ingest and query
relational data
Apache Sqoop, which is part of CDH, is that tool. The nice thing about This command may take a while to complete, but it is doing a lot. It is
Sqoop is that we can automatically load our relational data from MySQL into launching MapReduce jobs to pull the data from our MySQL database and
HDFS, while preserving the structure. With a few additional configuration write the data to HDFS in parallel, distributed across the cluster in Apache
parameters, we can take this one step further and load this relational data Parquet format. It is also creating tables to represent the HDFS files in Impala/
directly into a form ready to be queried by Apache Impala, the MPP analytic Apache Hive with matching schema.
database included with CDH, and other workloads.
Parquet is a format designed for analytical applications on Hadoop. Instead of
You should first log in to the Master Node of your cluster via a terminal. Then, grouping your data into rows like typical data formats, it groups your data into
launch the Sqoop job: columns. This is ideal for many analytical queries where instead of retrieving
data from specific records, you’re analyzing relationships between specific
variables across many records. Parquet is designed to optimize data storage
> sqoop import-all-tables \ and retrieval in these scenarios.
-m {{cluster_data.worker_node_hostname.length}} \
--connect jdbc:mysql://{{cluster_data.manager_
node_hostname}}:3306/retail_db \
--username=retail_dba \
--password=cloudera \
--compression-codec=snappy \
--as-parquetfile \
--warehouse-dir=/user/hive/warehouse \
--hive-import
Exercise 1: Ingest and query
relational data
VERIFICATION
When this command is complete, confirm that your data files exist in HDFS.
These commands to your right will show the directories and the files inside
them that make up your tables.
Note: The number of .parquet files shown will be equal to what was passed
to Sqoop with the -m parameter. This is the number of ‘mappers’ that Sqoop
will use in its MapReduce jobs. It could also be thought of as the number of
simultaneous connections to your database, or the number of disks / Data
Nodes you want to spread the data across. So, on a single-node you will just
see one, but larger clusters will have a greater number of files.
Exercise 1: Ingest and query
relational data
Hive and Impala also allow you to create tables by defining a schema To save time during queries, Impala does not poll constantly for metadata
over existing files with ‘CREATE EXTERNAL TABLE’ statements, similar to changes. So, the first thing we must do is tell Impala that its metadata is out of
traditional relational databases. But Sqoop already created these tables for date. Then we should see our tables show up, ready to be queried:
us, so we can go ahead and query them.
We’re going to use Hue’s Impala app to query our tables. Hue provides a web- invalidate metadata;
based interface for many of the tools in CDH and can be found on port 8888 show tables;
of your Manager Node. In the QuickStart VM, the administrator username for
Hue is ‘cloudera’ and the password is ‘cloudera’.
Once you are inside of Hue, click on Query Editors, and open the Impala You can also click on the “Refresh Table List” icon on the left to see your new
Query Editor. tables in the side menu.
Exercise 1: Ingest and query
relational data
Now that your transaction data is readily available for structured queries in You should see results of the following form:
CDH, it’s time to address DataCo’s business question. Copy and paste or type
in the following standard SQL example queries for calculating total revenue
per product and showing the top 10 revenue generating products:
You may notice that we told Sqoop to import the data into Hive but used
Impala to query the data. This is because Hive and Impala can share both
data files and the table metadata. Hive works by compiling SQL queries into
MapReduce jobs, which makes it very flexible, whereas Impala executes
queries itself and is built from the ground up to be as fast as possible, which
makes it better for interactive analysis. We’ll use Hive later for an ETL (extract-
transform-load) workload.
Exercise 1: Ingest and query
relational data
CONCLUSION
Now that you have gone through the first basic steps to Sqoop structured data into HDFS, transform it
into Parquet file format, and create hive tables for use when you query this data.
You have also learned how to query tables using Impala and that you can use regular interfaces and tools
(such as SQL) within a Hadoop environment as well. The idea here being that you can do the same reports
you usually do, but where the architecture of Hadoop vs traditional systems provides much larger scale
and flexibility.
Showing big data value
SCENARIO:
Your Management: is indifferent, you produced what you always produce - a
report on structured data, but you really didn’t prove any additional value.
You: are either also indifferent and just go back to what you have always
done... or you have an ace up your sleeve.
PREPARATION:
Go to Cloudera Manager’s home page and verify the following services are up:
• Impala
• Hive
• HDFS
• Hue
Exercise 2: Correlate structured data
with unstructured data
Since you are a pretty smart data person, you realize another interesting BULK UPLOAD DATA
business question would be: are the most viewed products also the most For your convenience, we have pre-loaded some sample access log data into
sold? Since Hadoop can store unstructured and semi-structured data /opt/examples/log_data/access.log.2. Let’s move this data from the local
alongside structured data without remodeling an entire database, you can filesystem, into HDFS.
just as well ingest, store, and process web log events. Let’s find out what site
visitors have viewed the most.
> sudo -u hdfs hadoop fs -mkdir /user/hive/warehouse/
For this, you need the web clickstream data. The most common way to original_access_logs
ingest web clickstream is to use Apache Flume. Flume is a scalable real-time > sudo -u hdfs hadoop fs -copyFromLocal /opt/examples/
ingest framework that allows you to route, filter, aggregate, and do “mini- log_files/access.log.2 /user/hive/warehouse/original_
operations” on data on its way in to the scalable processing platform. access_logs
Again, we need to tell Impala that some tables have been created through Well, in our example with DataCo, once these odd findings are presented to
a different tool. Switch back to the Impala Query Editor app, and enter the your manager, it is immediately escalated. Eventually, someone figures out
following command: that on that view page, where most visitors stopped, the sales path of the
product had a typo in the price for the item. Once the typo was fixed, and a
correct price was displayed, the sales for that SKU started to rapidly increase.
invalidate metadata;
Now, if you enter the ‘show tables;’ query or refresh the table list in the
left-hand column, you should see the two new external tables in the default
database. Paste the following query into the Query Editor:
SCENARIO:
Your Management: can’t believe the magic you do with data and is about to
promote you and invest in a new team under your lead... when all hell breaks
loose. You get an emergency call - as you are now the go-to person - and your
manager is screaming about the loss of sales over the last three days.
You: from slightly excited to under the gun in seconds...well, lucky for you,
there might be a quick way to find out what is happening.
PREPARATION:
Go to Cloudera Manager and verify these services are running:
• HDFS
• Hue
• Solr
• YARN
Advanced analytics on the
same platform
SCENARIO:
Your Management: is of course thrilled with the recent discoveries you
helped them with—you basically saved them a lot of money! They start giving
you bigger questions, and more funding (we really hope the latter!)
You: are excited to dive into more advanced use cases, but you know that you’ll
need even more funding by the organization. You decide to really show off!
PREPARATION:
Go to Cloudera Manager and verify these services are running:
• HDFS
• Spark
• YARN / MR2
Exercise 3: Explore log events
interactively
Since sales are dropping and nobody knows why, you want to provide a way Solr organizes data similarly to the way a SQL database does. Each record
for people to interactively and flexibly explore data from the website. We can is called a ‘document’ and consists of fields defined by the schema: just
do this by indexing it for use in Apache Solr, where users can do text searches, like a row in a database table. Instead of a table, Solr calls it a ‘collection’
drill down through different categories, etc. Data can be indexed by Solr in of documents. The difference is that data in Solr tends to be more loosely
batch using MapReduce, or you can index tables in Apache HBase and get structured. Fields may be optional, and instead of always matching exact
real-time updates. To analyze data from the website, however, we’re going to values, you can also enter text queries that partially match a field, just like
stream the log data in using Flume. you’re searching for web pages. You’ll also see Hue refer to ‘shards’ - and
that’s just the way Solr breaks collections up to spread them around the
The web log data is a standard web server log which may look something like this: cluster so you can search all your data in parallel.
Exercise 3: Explore log events
interactively
Here is how you can start real-time-indexing via Cloudera Search and Flume over the sample web server log data and use the Search UI in Hue to explore it:
> cd /opt/examples/flume
> solrctl --zk {{zookeeper_connection_string}}/solr > solrctl --zk {{zookeeper_connection_string}}/solr
instancedir --generate solr_configs instancedir --create live_logs ./solr_configs
The result of this command is a skeleton configuration that you can customize 4. Creating your collection
to your liking via the conf/schema.xml.
The key player in this tutorial is Flume. Flume is a system for collecting,
aggregating, and moving large amounts of log data from many different
sources to a centralized data source.
With a few simple configuration files, we can use Flume and a morphline (a
simple way to accomplish on-the-fly ETL,) to load our data into our Solr index.
(Note: You can use Flume to load many other types of data stores; Solr is just
the example we are using for this tutorial.)
If one of these steps fails, please reach out to the Discussion Forum and get help. Otherwise, you can start exploring the log data and understand what is going on.
For our story’s sake, we pretend that you started indexing data the same time as you started ingesting it (via Flume) to the platform, so that when your manager
escalated the issue, you could immediately drill down into data from the last three days and explore what happened. For example, perhaps you noted a lot of
DDOS events and could take the right measures to preempt the attack. Problem solved! Management is fantastically happy with your recent contributions, which
of course leads to a great bonus or something similar. :D
Exercise 4: Building a dashboard
To get started with building a dashboard with Hue, click on the pen icon. This will take you into the edit-mode where you can choose different widgets
and layouts that you would like to see. You can choose a few options and
configurations here, but for now, just drag a barchart into the top gray row.
Exercise 4: Building a dashboard
This will bring up the list of fields that are present in our index so that you can For the sake of our display, choose +15MINUTES for the INTERVAL.
choose which field you would like to group by. Let’s choose request_date.
Exercise 4: Building a dashboard
You aren’t limited to a single column; you can view this as a two-column This time, let’s choose department as the field that we want to group by for our
display as well. Select the two-column layout from the top left. pie chart.
While you’re here, let’s drag a pie chart to the newly created row in the left column.
Exercise 4: Building a dashboard
Things are really starting to take shape! Let’s add a Facet filter to the left- Now that we are satisfied with our changes, let’s click on the pencil icon to
hand side and select product as the facet. exit edit mode.
Exercise 4: Building a dashboard
And save our dashboard. At the Hue project’s blog you can find a wide selection of video tutorials for
accomplishing other tasks in Hue. For instance, you can watch a video of a
similar Search dashboard to this example being created here.
DataCo has moved into bigger business thanks to the Big Data projects you’ve Some people are questioning exactly how the decision was made to change
contributed to. As more and more users start using the Enterprise Data Hub the pricing on the website. You realize this is the perfect chance to prove
you built, it starts getting more complicated to manage and trace data and yourself again.
access to data. In addition, as your previous deliveries created such success,
the company has decided to build out a full EDH strategy, and as a result You need to demonstrate that you can:
lots of sensitive data is headed for your cluster too: credit card transactions, • Easily show who has queried the data
social security data, and other financial records for DataCo. • Show exactly what has been done with it since it was created
• Enforce policies about how it gets managed in the future
Your Management: is worried about security controls and ability to audit the
access for compliance. Cloudera Navigator provides a solution to all these problems. If using
Cloudera Live, you can find a link in the Govern Your Data section of your
You: need to resolve their concerns and you want to make it easier to manage Guidance Page, which your welcome email will direct you to. In using the
who does what on your cluster for back-charging purposes too. QuickStart VM, the username is ‘cloudera’ and the password is ‘cloudera’.
Exercise 5: Cloudera Navigator
DISCOVERY You know that the old web server log data you analyzed was a Hive table, so
The first thing you see when you log into Cloudera Navigator is a search tool. It’s select ‘Hive’ under ‘Source Type’, and ‘Table’ under ‘Type’. You’re also pretty
an excellent way to find data on your cluster, even when you don’t know exactly sure it had ‘access log’ in the name, so enter this search query at the top and
what you’re looking for. Go ahead and click the link to ‘explore your data’. hit enter:
*access*log*
LINEAGE As you click on the nodes in this graph, more detail will appear. If you click on
Now that you’ve found the data you were looking for, click on the table and the tokenized_access_logs table and the intermediate_access_logs table,
you’ll see a graph of the data’s lineage. You’ll see the tokenized_access_logs you’ll see arrows for each individual field running through that query. You can
table on the right and the underlying file with the same name in HDFS in blue. see how quickly you could trace the origin of datasets even in a much busier
You’ll also see the other Hive table you created from the original file and the and more complicated environment!
query you ran to transform the data between the two. (The different colors
represent different source types: yellow data comes from Hive, blue data
comes directly from HDFS.)
Exercise 5: Cloudera Navigator
AUDITING As you can see, there are hundreds of events that have been recorded, each with
Now you’ve shown where the data came from, but we still need to show what’s details of what was done, by whom, and when. Let’s narrow down what we’re
been done with it. Go to the ‘Audits’ tab, using the link in the top-right corner. looking for again. Open the “Filters” menu from below the “Audit Events” heading.
Exercise 5: Cloudera Navigator
Click the + icon twice to add two new filters. For the first filter, set the property You can also view and create reports based on the results of these searches
to ‘Username’ and fill in ‘admin’ as the value. For the second filter, set the on the left-hand corner. There’s already a report called “Recent Denied
property to ‘Operation’ and fill in ‘QUERY’ as the value. Then click ‘Apply’. Accesses”. If you checked that report now, you may see that in the course
of this tutorial, some tools have tried to access a directory called ‘/user/
As you click on the individual results, you can see the exact queries that were anonymous’ that we haven’t set up, and that the services don’t have
executed and all related details. permission to create.
Exercise 5: Cloudera Navigator
POLICIES Click the + icon to add a new policy, name your policy “Tag Insecure Data”.
It’s a relief to be able to audit access to your cluster and see there’s no Check the box to enable the policy, and enter the following as the search query:
unexpected or unauthorized activity going on. But wouldn’t it be even better
if you could automatically apply policies to data? Let’s open the policies tab in
the top-right hand corner and create a policy to make the data we just audited (permissions:”rwxrwxrwx”) AND (sourceType:hdfs) AND
easier to find in the future. (type:file OR type:directory) AND (deleted:false)
Exercise 5: Cloudera Navigator
This query will detect any files in HDFS that allow anyone to read, write, To apply this tag on existing data, set the schedule to “Immediate”, and check the
and execute. It’s common for people to set these permissions to make sure box “Assign Metadata”. Under tags, enter “insecure”, and then click “Add Tag”.
everything works, but your organization may want to refine this practice as you Save the policy.
move into production or implement more strict practices for some data sets.
Exercise 5: Cloudera Navigator
CONCLUSION
You’ve now experienced how to use Cloudera Navigator for discovery of data and metadata. This
powerful tool makes it easy to audit access, trace data lineage, and enforce policies.
With more data, and more data formats available in a multi-tenant environment, data lineage and
governance are getting challenging. Cloudera Navigator provides enterprise-grade governance that’s
built into the foundation of Apache Hadoop.
You can learn more about the various management features provided by Cloudera Manager in the
Cloudera Administrator Training for Apache Hadoop.
The End Game
We hope you have enjoyed this basic tutorial, and that you: NEXT STEPS
• Have a better understanding of some of the popular tools in CDH If you’re ready to install Cloudera’s platform on your own cluster (on premise
• Know how to setup some basic and familiar BI use cases, as well as web or in the public cloud), there are a few options:
log analytics and real-time search • Try the AWS Quick Start for easy deployment of Cloudera’s platform on
• Can explain to your manager why you deserve a raise! AWS clusters via your own account (promo credit available)
• Try the Azure Test Drive for Cloudera Director (3-hour sandbox to
provision EDH clusters on Azure)