0% found this document useful (0 votes)
155 views52 pages

Talend Big Data Sandbox. Big Data Insights Cookbook

The document provides an overview of the Talend Big Data Sandbox, which is a virtual environment that combines the Talend Real-Time Big Data Platform with sample integration scenarios. It requires a virtual machine player and recommends a host machine with at least 8-10GB RAM and 20GB disk space. The document outlines the steps to download, set up, and configure the sandbox virtual machine as well as start Talend Studio for the first time.

Uploaded by

faiz alfada
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
155 views52 pages

Talend Big Data Sandbox. Big Data Insights Cookbook

The document provides an overview of the Talend Big Data Sandbox, which is a virtual environment that combines the Talend Real-Time Big Data Platform with sample integration scenarios. It requires a virtual machine player and recommends a host machine with at least 8-10GB RAM and 20GB disk space. The document outlines the steps to download, set up, and configure the sandbox virtual machine as well as start Talend Studio for the first time.

Uploaded by

faiz alfada
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Talend Big Data Sandbox

Big Data Insights Cookbook


Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites

Hadoop
Setup & Demo
Distribution
Configuration (Scenario)
Download
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

About this cookbook

What is the Talend Cookbook?

Using the Talend Real-Time Big Data The demos are built on real world Whether batch, streaming or real-
Platform, this Cookbook provides use-casees and demonstrate how time integration, you will begin to
step-by-step instructions to build and Talend, Spark, NoSQL and real-time understand how Talend can be used
run an end-to-end integration messaging can be easily integrated to address your big data challenges
scenario. into your daily business. and move your business into the
Data-Driven Age.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

What is the Big Data Sandbox?

Virtual Environment

Sandbox Examples
Hadoop
Distribution Talend Real-
Ready-to-run Real-time
(hosted by Time Big Data Data
scenarios decisions
Docker Platform
Containers)

The Talend Real-Time Big Data See how Talend can turn data into
Sandbox is a virtual environment that real-time decisions through sandbox
combines the Talend Real-Time Big examples that integrate Apache
Data Platform with some sample Kafka, Spark, Spark Streaming,
scenarios pre-built and ready-to-run. Hadoop and NoSQL.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

What Pre-requisites are required to run Sandbox?

Talend Platform for Big Data includes a graphical IDE (Talend Studio),
teamwork management, data quality, and advanced big data features.

Internet connection required for the entire setup process


You will need a Virtual Machine player such as VMWare
To see a full list of features please visit Talend’s Website: or Virtualbox, which can be downloaded here:
http://www.talend.com/products/real-time-big-data • VMware Player Site
• Virtualbox Site

Follow the VM Player install instructions from the provider

The recommended host machine should have:

Memory Disk Space


8-10GB 20GB (5GB is for the
image download)

Download the Sandbox Virtual Machine file


https://info.talend.com/prodevaltpbdrealtimesandbox.html
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

How do I set-up & configure Sandbox?

Follow the steps below to install and configure your Big Data Sandbox:
• Save the downloaded Virtual Machine file to a location on your local PC that is easy to access (e.g. C:/TalendSandbox)
• Follow the instructions below based on the Virtual Machine Player and matching Sandbox file that you are using

Virtualbox VMware Player

1. Open Virtualbox. 1. Open VMware Player.


2
2. From the menu bar, select File 2. Click on “Open a Virtual
> Import Appliance… Machine”

3. Navigate to the .ova file that 2


3. Navigate to the .ova file
you downloaded. Select it and that you downloaded.
click Next. 3a Select it and click Open.
3b
4. Accept the default Appliance 4. Select the Storage path
Settings by clicking Import. for the new Virtual 4a
Machine (e.g.
C:/TalendSandbox/vmwa
re) and then click Import. 4b
4
Note: The Talend Big Data Sandbox Virtual Machines come pre-configured to run with 8GB RAM and 2 CPU’s. You may need to adjust
these settings based on your PC’s capabilities. While not pre-configured, it is also recommended to enable a Sound Card/Devise before
starting the VM to take advantage of Tutorial Videos within the Virtual Environment.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Starting the VM for the first time…

• When you start the Talend Big Data Sandbox for the first time, the virtual machine will begin a 10-step process to build
the Virtual Environment.

• This process can take 10-20 mins depending on internet connection speeds and network traffic. Popup messages will be
present on screen to keep you informed of the progress.

• Once the Sandbox has completed it’s build process, it will


automatically reboot.

Login Info
User: talend
Password: talend
Sudo Password: talend
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Starting the VM for the first time (cont.)

• Once the Virtual Machine reboots, the Docker Components that were installed during the build process will need to
initialize.

• Additional Popup messages will appear to inform you of the progress.

• When complete, a message will show that the System is Ready!

Login Info
User: talend
Password: talend
Sudo Password: talend
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Starting Talend Studio for the first time…

1. Start Talend Studio by double-clicking on the desktop Icon


or single clicking on the Unity Bar Icon.
2. Click I Accept the End User License 1
Agreement.
3. Click on Manage Connections and enter your 2
email address, then Click OK.
4. Select the Base_Project – java project and
click Finish. 3
5. Once Studio Loads, Close the Welcome
Screen.
6. Install Additional Talend Packages. Select 6a
Required third-party libraries and Optional
third-party libraries and click Finish.
7. A popup will display all 3rd party licenses 6b
that need acceptance. Click the “I accept
the terms of the selected license 4
agreement” radio button and click Accept
all.
8. Let the downloads complete before
continuing. A second popup may require
you to repeat Step 7. 7
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Choosing a Distribution…

Note: It is not necessary to download a Distribution to


evaluate the Sandbox. Click Here to begin now!

Follow the steps below to install a Hadoop Distribution


in the Talend Big Data Sandbox:

1
1. Start Firefox

2. Choose the Distribution you would like to evaluate


with the Talend Platform.
2
3
3. Be patient as the Virtual Machine downloads and
installs the selected Distributions. Each distribution
is approx. 2.5GB (Notifications will indicate
progress.)

4. The Virtual Machine will reboot when the


installation is complete.

5. Upon reboot, watch the Distribution and other 5


Docker Containers come online through 4
WeaveScope. A link is available in the bookmarks.

Note: Be sure to watch the available Tutorial Videos for


more information on the Sandbox
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Note: This demo is available in Local Mode and Distribution Mode. In Local Mode it utilizes Talend’s Local Spark Engine and Local File
System. In Distribution Mode, it utilizes the selected Distro’s Yarn Resource Manager and HDFS File System.

Overview:
In this Demo you will see a simple Customers Channels
version of making your website an
Intelligent Application. Email Streaming

You will experience: Website


Store
• Building a Spark Recommendation
Model
Shopping Cart Spark Engine
• Setting up a new Kafka topic to help NOSQL Window Updates
(Recommendations) (Recommendation)
simulate live web traffic coming from
Live web users browsing a retail web Internal Systems
store.
POS
Streaming
• Most important you will see first-
hand with Talend how you can take Clickstream
streaming data and turn it into real- …….
time recommendations to help
improve shopping cart sales.

The following Demo will help you see the value that using Talend can bring to your big data projects:
The Retail Recommendation Demo is designed to illustrate the simplicity and flexibility Talend brings to using Spark in your Big Data Architecture.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

This Demo highlights:

Kafka Machine Learning Spark Streaming

Create a Kafka Topic to Produce Create a Spark recommendation Stream live recommendations to a
and Consume real-time streaming model based on specific user Cassandra NoSQL database for
data actions “Fast Data” access for a WebUI

If you are familiar with the ALS model, you can update the ALS parameters to enhance the model or just leave the default values.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Demo Setup:
1
Before you can execute the Retail Recommendation Demo, you
will need to generate the source data and pre-populate the 2
Cassandra Lookup Tables. 3

1. Navigate to the Job Designs folder.


2. Click on Standard Jobs > Realtime_Recommendation_Demo
3. Double click on Step_1a_Recommendation_DemoSetup 0.1
This opens the job in the designer window.
4. From the Run tab, click on Run to execute.
5. When the job is finished, repeat steps 1-4 for
Step_1b_Recommendation_DemoSetup 0.1

4
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Execute the Retail Recommendation Demo:

Create a Kafka Topic:


1
1. Navigate to the Job Designs folder: 2
2. Click on Standard Jobs >
Realtime_Recommendation_Demo
3. Double click on 3
Step_2_Recommendation_Create_KafkaTopic 0.1
This opens the job in the designer window.
4. From the Run tab, click on Run to execute.

Now you can generate the recommendation model by


loading the product ratings data into the Alternating
Least Squares (ALS) Algorithm. Rather than coding a
complex algorithm with Scala, a single Spark component
available in Talend Studio simplifies the model creation
process. The resultant model can be stored in HDFS or in
this case, locally.
4
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Execute the Retail Recommendation Demo:


1
Generate a Recommendation Model using Spark.

1. Navigate to the Job Designs folder.


2. Click on Big Data Batch >
Realtime_Recommendations_Demo 2
3. Double click on 3
Step_3_Recommendation_Build_Model_Spark 0.1.
This opens the job in the designer window.
4. From the Run tab, click on Run to execute

With the Recommendation model created, your lookup


tables populated and your Kafka topic ready to consume
data, you can now stream your Clickstream data into
your Recommendation model and put the results into
your Cassandra tables for reference from a WebUI.
4

If you are familiar with the ALS model, you can update the ALS parameters to enhance the model or just leave the default values.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Execute the Retail Recommendation Demo:

1. Navigate to the Job Designs folder.


2
2. Click on Standard Jobs >
Realtime_Recommendations_Demo
3. Double click on
Step_4a_Recommendation_Push_to_Kafka 0.1.
This opens the job in the designer window.
3
4. From the Run tab, click on Run to execute.

This job is setup to simulate real-time streaming of


web traffic and clickstream data into a Kafka topic
that will then be consumed by our recommendation
engine to produce our recommendations.

After starting the Push to Kafka, continue


to the next steps of the demo.
4
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Execute the Retail Recommendation Demo:

In this job:
• A Kafka Consumer reads in Clickstream Data.
• The data is fed into the Recommendation Engine, producing Real-time “offers” based on the current user’s activity.
• The tWindow component controls how often recommendations are generated.
• The recommendations are sent to 3 output streams
 Execution window for viewing purposes 1
 File System for later processing in your Big Data Analytics environment
 Cassandra for use in a “Fast Data” layer by a WebUI
With data streaming to the Kafka Topic…
Start the recommendation pipeline.
2
1. Navigate to the Job Designs folder.
2. Click on Big Data Streaming > 3
Realtime_Recommendation_Demo
3. Double click on
Step_4b_Recommendation_Realtime_Engine
_Pipeline 0.1. This opens the job in the
designer window.
4. Click on Run to Start Recommendation Engine 4
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Execute the Retail Recommendation Demo:

Watch the execution output window. You will


now see your real-time data coming through
with recommended products based on your
Recommendation Model.

Recommendations are also written to a


Cassandra database so they can be
referenced by a WebUI to offer, for instance,
last minute product suggestions when a
customer is about to check-out.

 Once you have seen the results, you can Kill


the Recommendation Engine and the Push to
Kafka jobs to stop the streaming
recommendations.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Note: Execution of this demo requires a Hadoop distribution. If a distro hasn’t been selected, click here.

Overview:
In this example we will utilize real-
time streaming of data through a
Kafka queue to track on-field player
movements at a sporting event.
You will experience: Database

• Creating and populating a Kafka


queue with real-time streaming data
from an IoT device (i.e. field camera
sensors).

• Using Spark Streaming technology to Ingest Process Store Deliver Visualize


calculate speed and distance traveled
by individual players.
• Charting player speed and distance in
a real-time web-based dashboard.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

This Demo highlights:

REST Service to
IoT data to Kafka Spark Streaming
Live Dashboard

Capture IoT data in XML files, then Use Spark Streaming Technology to Use a restful web service to track
load that data to a Kafka Queue for quickly calculate player distance player movements in a web-based
real-time processing. and speed as their positions change dashboard.
on the playing field.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

2
Execute the Sport Stats Demo:

Create a Kafka topic from which live data will stream 3


2

1. Navigate to the Job Designs folder.


2. Click on Standard > Realtime_SportStats_Demo
3. Double click on
Step_1_SportStats_Create_KafkaTopic 0.1. This
opens the job in the designer window.
4. From the Run tab, click on Run to execute.
2
4
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Execute the Sport Stats Demo: 1


Read data from an XML file (generated by sensor
readings, for example) and populate the Kafka topic

1. Navigate to the Job Designs folder. 2


2. Click on Standard > Realtime_SportStats_Demo
3. Double click on Step_2_SportStats_Read_Dataset
0.1. This opens the job in the designer window. 3
4. From the Run tab, click on Run to execute.

This step simulates live player-tracking data being 4


fed to a Kafka topic.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Execute the Sport Stats Demo:

In this job:
• A Kafka Consumer reads the sensor data.
• A tWindow component controls how often data is read from the Kafka topic – in this case, 10 seconds worth of data is read every 10 seconds.
• The data is normalized for easier processing.
• Using the tCache components the process calculates distance and speed based on current and previous player positions.
• The resultant data is sent to 2 output streams
 Execution window for viewing purposes
 MySQL Database where it will be read by a web service to generate dashboard graphics. (MySQL is running on a Docker container)

1. Navigate to the Job Designs folder.


1
2. Click on Big Data Streaming >
Realtime_SportStats_Demo
3. Double click on Step_3_SportStats_LiveStream 0.1.
This opens the job in the designer window.
4. From the Run tab, click on Run to execute.

2 3

After starting the SportStats Live Stream, 4


continue to the next steps of the demo.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Execute the Sport Stats Demo:


1
Start the Web Service to populate the Sport
Stats web-based dashboard

1. Navigate to the Job Designs folder. 2


2. Click on Standard > 3
Realtime_SportStats_Demo
3. Double click on
Step_4_SportStats_WebService 0.1. This
opens the job in the designer window.
4. From the Run tab, click on Run to execute.

4
With the Web Service running, continue
to the next step in this demo.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Execute the Sport Stats Demo:

Watch the Live Dashboard reflect player 2


movements with real-time updates

1. Open Firefox Web Browser.


2. On the Bookmarks toolbar, click on Demos
> SportStats Demo

 Once you have seen the results, back in


Talend Studio, you can Kill both the Web
Service job and the Live Streaming job.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Note: Execution of this demo requires a Hadoop distribution. If a distro hasn’t been selected, click here.

Overview:
In this example we demonstrate
using native Map Reduce to enrich a
dataset and aggregate the results for
different web-based dashboards.
You will experience: Clickstream

• Data loading to HDFS.


• Using MapReduce to enrich and
aggregate data within the Hadoop
Environment.
Ingest Process Store Deliver Visualize
• Use of 3rd party graphing tools to
generate a web-based dashboard of
the calculated results.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

This Demo will highlight:

HDFS Native MapReduce Insights

Read and Write data to HDFS with Use Talend’s MapReduce Feed your analysis data to a
simple components from Talend components to enrich and analyze graphing tool such as Microsoft
data, natively, in Hadoop Excel or Tableau for stunning
displays of the results.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Demo Setup:
1
Load data to HDFS.

1. Navigate to the Job Designs folder. 2


2. Click on Standard > 3
Clickstream_Scenario >
Pre_Requirements
3. Double click on LoadWeblogs 0.1.
This opens the job in the designer
window.
4. From the Run tab, click on Run to
execute

When this job completes, look-up files will


be uploaded to HDFS for use by the 4
MapReduce jobs.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Execute the Clickstream Demo:

The result of this process is aggregated data


indicating the product interests of different 1
areas across the United States for
visualization within a Google Chart. 2

1. Navigate to the Job Designs folder.


2. Click on Standard >
Clickstream_Scenario 3
3. Double click on
Step_1_Clickstream_MasterJob 0.1. This
opens the job in the designer window.
4. From the Run tab, click on Run to
execute.

Note: If asked to Download and Install 4


additional Jar files, Click on Download
and Install.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Execute the Apache Weblog Demo:

View the data in HDFS.


3
1. Open Firefox
2. Click on the bookmarked link titled 2
HDFS Browser
3. In the Utilities Dropdown, select
Browse the File System and navigate to
/user/talend/clickstream_demo/output/
results
4
4. To view the data file, you must
download it from HDFS. This can be
done right within the web browser by
clicking on part-00000 and choosing
download
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Execute the Clickstream Demo:

View the analytical analysis dashboard.

1. Open Firefox Web Browser.


2. On the Bookmarks toolbar, click on Demos >
Clickstream Demo
3. Mouse over the states to see the counts.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Execute the Clickstream Demo:

Additional analysis can be done to calculate the age 1


and gender of users accessing specific links.

1. Navigate to the Job Designs folder.


2. Click on Big Data Batch > Clickstream_Scenario
3. Double click on
Step_1_Clickstream_SimpleFile_Omniture_MR 2 3
0.1. This opens the job in the designer window.
4. From the Run tab, click on Run to execute

The results of this job can be found in the HDFS File


Browser:
/user/talend/clickstream_demo/output/results

4
With our analysis data in HDFS, we can load
into a Hive Table for further querying or import
to a visualization tool. Continue to the next
steps of the demo to see how this can be done.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Execute the Clickstream Demo:

1 2
With our analysis complete, we can pull the raw file
from HDFS or even put it into a Hive table for further
querying.

3
1. Navigate to the Job Designs folder.
2. Navigate to Standard > Clickstream_Scenario >
Clickstream_Step_by_Step
3. Double click on
Step3_Clickstream_Get_WeblogAnalytics 0.1. This
opens the job in the designer window.
4. From the Run tab, click on Run to execute

The results of this job can be found on the local VM file


system:
/home/talend/Documents/Clickstream/webloganalytics.csv
4

 This file could be imported to MS Excel or other BI tools like Tableau (not included in the Big Data Sandbox) to generate additional
dashboards.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Note: Execution of this demo requires a Hadoop distribution. If a distro hasn’t been selected, click here.

Overview:
In this example we demonstrate how
using Talend with Hadoop can speed
up and simplify processing large
volumes of 3rd Party Data. The
sample data is simulating a Life Warehouse
Sciences Prescribing habits data file
from a 3rd Party vendor.
3rd Party
You will experience: Files
Back to
• optimizing your data warehouse by 3rd Party
off-loading the ETL overhead to
Hadoop and HDFS.
• Fast, Pre-load analytics on large
volume datasets. Ingest Process Store Deliver Decide

• Multiple Reports from same datasets


to make informed and intelligent
business decision that could decrease
spend or increase revenue.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

This Demo will highlight:

Large volume
Pre-load Analytics ETL Off-loading
processing

With Talend and Hadoop, you can By analyzing large volumes of data Utilizing Talend with a Hadoop
process Gigabytes and Terabytes of BEFORE loading it to your Data Cluster, you can optimize your Data
data in a fraction of the time. Warehouse, you eliminate the Warehouse by removing the costly
overhead of costly data anomalies overhead of data processing.
in the Data Warehouse.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Note: To quickly and easily see the value of the ETL Off-Load Demo, proceed with the below steps. If you would like a more in-depth
experience and more control over the source data, Click here…

1
Demo Setup: 2
To Execute this demo, you must first generate
the source files for processing within Hadoop
3
1. Navigate to the Job Designs folder.
2. Click on Standard > ETL_OffLoad_Scenario
3. Double click on
Step_1_ProductAnalysis_DemoSetup 0.1.
This opens the job in the designer window.
4. From the Run tab, click on Run to execute

When this job completes, Source files will reside


on the Virtual Machine to be processed by the
demo. Additionally, within HDFS an initial
report will have been generated by which the 4
demo will compare for analysis.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Execute the ETL Off-Load “One-Click” Demo:

In this “One-Click” version of the demo:


• Source files are placed in HDFS.
• MapReduce is used to collectively analyze all the compressed files.
• The resultant analysis is then compared to the previous months results and reports are generated.
• The generated reports are then sent to the Google Charts API for a graphical representation of the data.
• The resultant reports can be viewed in a web browser:
 Product by Physician shows the number of prescriptions a physician has written for a particular drug
 Net Change shows the total number of prescriptions for a particular drug across all physicians

1. Navigate to the Job Designs folder. 1


2. Click on Standard >ETL_OffLoad_Scenario
2
3. Double click on
Step_2_ProductAnalysis_MapReduce 0.1. This
opens the job in the designer window.
3
4. From the Run tab, click on Run to execute.

Click Here to finish this demo…


Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Demo Setup:

In this step-by-step version of the demo, you will see just how simple it is to work with Talend and Hadoop. You
will also have more control over the source data used within the demo for a more personalized experience.

1. Navigate to the Job Designs folder.


2. Click on Standard > ETL_OffLoad_Scenario > 1
Pre_Requirements_Step_by_Step
3. Double click on 2
PreStep_1_Generate_Mock_Rx_Data 0.1.
This opens the job in the designer window. 3
4. From the Run tab, click on Run to execute.
5. When the job starts, edit the values as you
wish (staying within the suggested
parameters and keeping in mind you are
working in a virtual environment with limited
space) or leave the default values. Click OK
when done. 5
4
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Demo Setup (cont.):


1
Once you have generated your Mock Rx data, 2
you will need to initialize the Hadoop
environment with comparison data – in this
case, it would be the ”Previous Month” analysis.
1. Navigate to the Job Designs folder.
3
2. Click on Standard > ETL_OffLoad_Scenario >
Pre_Requirements_Step_by_Step
3. Double click on
PreStep_2_PrepEnvironment 0.1. This
opens the job in the designer window.
4. From the Run tab, click on Run to execute.

Note: Once this job completes, you are


ready to execute the step-by-step ETL Off-
Load Demo.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Execute the ETL Off-Load “Step-by-Step” Demo:


1
With the demo environment setup complete, we can
begin examining the ETL Off-load Process.
2
1. Navigate to the Job Designs folder.
2. Click on Standard > ETL_OffLoad_Scenario >
ProductAnalysis_Step_by_Step
3. Double click on Step_ 1_PutFiles_on_HDFS 0.1. 3
This opens the job in the designer window.
4. From the Run tab, click on Run to execute.
4

When this job is complete, you will have your 5


custom-generated source files on HDFS. To view the
files:
5. Open Firefox
6
6. Click on the HDFS Browser link on the Bookmarks 7
Toolbar 8
7. Select Browse the file system from the Utilities
dropdown.
8. Navigate to /user/talend/ Product_demo/Input
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Execute the ETL Off-Load “Step-by-Step” Demo:


1
Now that your source data is in HDFS, we can use the
power of Hadoop and MapReduce to analyze the
large dataset.
52
1. Navigate to the Job Designs folder.
2. Click on Big Data Batch > ETL_OffLoad_Scenario
3. Double click on
32
Step_2_Generate_MonthlyReport_mr 0.1. This 42
opens the job in the designer window.
4. From the Run tab, click on Run to execute.

When this job is complete, you can again navigate to 5


the Hadoop file system to view the generated file:
5. Open Firefox
6
6. Click on the HDFS Browser link on the 7
Bookmarks Toolbar 8
7. Select Browse the file system from the Utilities
dropdown.
8. Navigate to /user/talend/Product_demo/Output
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Execute the ETL Off-Load “Step-by-Step” Demo: 1

With the Previous Month analysis as our baseline, we


can now compare our Current Month analysis and
21
track any anomalies.
1. Navigate to the Job Designs folder.
2. Click on Standard > ETL_OffLoad_Scenario >
ProductAnalysis_Step_by_Step
3. Double click on 31
Step_3_Month_Over_Month_Comparison 0.1.
This opens the job in the designer window.
4. From the Run tab, click on Run to execute.

When this job is complete, you can again navigate to 41


the Hadoop file system to view the generated files: 5
5. Open Firefox
6. Click on the HDFS Browser link on the Bookmarks
Toolbar
7. Select Browse the file system from the Utilities
dropdown.
8. Navigate to /user/talend/Product_demo/Output
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

1
Execute the ETL Off-Load “Step-by-Step” Demo:

The final step is to generate the charts using Google


21 32
Charts API
1. Navigate to the Job Designs folder.
2. Click on Standard > ETL_OffLoad_Scenario >
ProductAnalysis_Step_by_Step
3. Double click on
Step_4_GoogleChart_Product_by_Unit 0.1.
This opens the job in the designer window. 4
3
4. From the Run tab, click on Run to execute. 5

6
When this job is complete, you can view the
generated reports from the web browser:
5. Open Firefox
6. From a new tab, Click on Demos > Product
Demo > Net Change to view the report. 7

7. Repeat Step 2 above to open the Product by


Physician Report.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Execute the ETL Off-Load “Step-by-Step” Demo:


1
Reset the demo and run it again! You can run this demo
over and over and get different results by changing the
Source Files. 21
1. Navigate to the Job Designs folder.
2. Click on Standard > ETL_OffLoad_Scenario
3. Double click on
Step_3_ProductAnalysis_DemoReset 0.1. This
opens the job in the designer window. 31
4. From the Run tab, click on Run to execute.

Run through this demo again!


41
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Note: Execution of this demo requires a Hadoop distribution. If a distro hasn’t been selected, click here.

Overview:
In this example we demonstrate
using different Big Data Methods to
aggregate and analyze large volumes
of weblog data.
You will experience: Web Logs

• Using Hive to store and access Data PIG


in a Hadoop Distributed File System
• Using standard MapReduce to
analyze and count IP addresses in an
Apache log file. Ingest Process Store Deliver

• Performing the same Analysis (count


of IP addresses in an Apache log file)
using Pig.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

This Demo will highlight:

Hive Components Native MapReduce Pig Components

Connect, Create, Read and Write Use Talend’s MapReduce Understand the flexibility of
with Hive Components to access components to access and analyze Talend’s capabilities to perform the
data in HDFS data, natively, from HDFS same operations with multiple
technologies
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Execute the Apache Weblog Demo:


1
Create the Hive Tables in HDFS and clear out 2
old datasets.

1. Navigate to the Job Designs folder. 31


2. Click on Standard > ApacheWebLog
3. Double click on
Step_1_ApacheWebLog_HIVE_Create 0.1.
This opens the job in the designer window.
4. From the Run tab, click on Run to execute

When this job completes, old datasets from


previous executions will have been cleaned up
and a fresh Hive Table will be generated in
HDFS.

41
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

1
2
Execute the Apache Weblog Demo:
31
Filter the Apache Weblog files and load
them to HDFS.

1. Navigate to the Job Designs folder.


2. Click on Standard > ApacheWebLog
3. Double click on
Step_2_ApacheWeblog_Load 0.1.
This opens the job in the designer
window.
4. From the Run tab, click on Run to
execute

This job filters “301” codes from the


Weblog and loads the data to HDFS where
41
it can be viewed by both the HDFS file
browser or a Hive Query.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Execute the Apache Weblog Demo:

View the data in HDFS.


1
1. Open Firefox
2. Click on the bookmarked link titled HDFS
Browser
3. In the Utilities Dropdown, select Browse
the File System and navigate to
/user/talend/weblog 2 3a

3b
Note: While the data is loaded into
HDFS, it is also saved in a location
where the created Hive table is
expecting data. So now you can view
the data both through a Hive query or
HDFS file browsing.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Execute the Apache Weblog Demo:

Use MapReduce to analyze and calculate distinct IP


count.

1. Navigate to the Job Designs folder.


2. Click on Big Data Batch > ApacheWebLog 1
3. Double click on
Step_3_ApacheWeblog_Count_IP_MR 0.1. This
opens the job in the designer window.
4. From the Run tab, click on Run to execute 3
2
5. When the job completes, You can view the results 2
that are output to the Job Execution Window. 4
2

 The data from this job is also saved to HDFS. In


the HDFS File Browser navigate to
/user/talend/weblogMR/mr_apache_ip_out to
see the new files.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Retail Recommendation Sport Stats Clickstream ETL Off-Load Apache Weblog

Execute the Apache Weblog Demo: 1


2
Use MapReduce to analyze and calculate distinct IP
count.

1. Navigate to the Job Designs folder.


2. Click on Standard > ApacheWebLog 3
2
3. Double click on
Step_4_ApacheWeblog_Count_IP_Pig 0.1.
This opens the job in the designer window.
4. From the Run tab, click on Run to execute

 The data from this job is also saved to


HDFS. In the HDFS File Browser navigate 4
2
to /user/talend/weblogPIG/apache_ip_cnt
to see the new files.
Talend Big Data Sandbox
Big Data Insights Cookbook

Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario)

Conclusion

Simplify Big Data Integration Built for Batch and Real-time Big Data Lower Operating Costs

Talend vastly simplifies big data Talend is built for batch and real- Talend lowers operations costs.
integration, allowing you to time big data. Unlike other
leverage in-house resources to solutions that “map” to big data or Talend’s zero footprint solution
use Talend's rich graphical tools support a few components, Talend takes the complexity out of…
that generate big data code is the first data integration platform
(Spark, MapReduce, PIG, Java) for built on Spark with over 100 Spark Integration Deployment
you. components. Management
Maintenance
Talend is based on standards such Whether integrating batch
as Eclipse, Java, and SQL, and is (MapReduce, Spark), streaming A usage based subscription model
backed by a large collaborative (Spark), NoSQL, or in real-time, provides a fast return on
community. Talend provides a single tool for all investment without large upfront
your integration needs. costs.
So you can up skill existing
resources instead of finding new Talend’s native Hadoop data quality
resources. solution delivers clean and
consistent data at infinite scale.

You might also like