Talend Big Data Sandbox. Big Data Insights Cookbook
Talend Big Data Sandbox. Big Data Insights Cookbook
Overview Pre-requisites
Hadoop
Setup & Demo
Distribution
Configuration (Scenario)
Download
Talend Big Data Sandbox
Big Data Insights Cookbook
Using the Talend Real-Time Big Data The demos are built on real world Whether batch, streaming or real-
Platform, this Cookbook provides use-casees and demonstrate how time integration, you will begin to
step-by-step instructions to build and Talend, Spark, NoSQL and real-time understand how Talend can be used
run an end-to-end integration messaging can be easily integrated to address your big data challenges
scenario. into your daily business. and move your business into the
Data-Driven Age.
Talend Big Data Sandbox
Big Data Insights Cookbook
Virtual Environment
Sandbox Examples
Hadoop
Distribution Talend Real-
Ready-to-run Real-time
(hosted by Time Big Data Data
scenarios decisions
Docker Platform
Containers)
The Talend Real-Time Big Data See how Talend can turn data into
Sandbox is a virtual environment that real-time decisions through sandbox
combines the Talend Real-Time Big examples that integrate Apache
Data Platform with some sample Kafka, Spark, Spark Streaming,
scenarios pre-built and ready-to-run. Hadoop and NoSQL.
Talend Big Data Sandbox
Big Data Insights Cookbook
Talend Platform for Big Data includes a graphical IDE (Talend Studio),
teamwork management, data quality, and advanced big data features.
Follow the steps below to install and configure your Big Data Sandbox:
• Save the downloaded Virtual Machine file to a location on your local PC that is easy to access (e.g. C:/TalendSandbox)
• Follow the instructions below based on the Virtual Machine Player and matching Sandbox file that you are using
• When you start the Talend Big Data Sandbox for the first time, the virtual machine will begin a 10-step process to build
the Virtual Environment.
• This process can take 10-20 mins depending on internet connection speeds and network traffic. Popup messages will be
present on screen to keep you informed of the progress.
Login Info
User: talend
Password: talend
Sudo Password: talend
Talend Big Data Sandbox
Big Data Insights Cookbook
• Once the Virtual Machine reboots, the Docker Components that were installed during the build process will need to
initialize.
Login Info
User: talend
Password: talend
Sudo Password: talend
Talend Big Data Sandbox
Big Data Insights Cookbook
Choosing a Distribution…
1
1. Start Firefox
Note: This demo is available in Local Mode and Distribution Mode. In Local Mode it utilizes Talend’s Local Spark Engine and Local File
System. In Distribution Mode, it utilizes the selected Distro’s Yarn Resource Manager and HDFS File System.
Overview:
In this Demo you will see a simple Customers Channels
version of making your website an
Intelligent Application. Email Streaming
The following Demo will help you see the value that using Talend can bring to your big data projects:
The Retail Recommendation Demo is designed to illustrate the simplicity and flexibility Talend brings to using Spark in your Big Data Architecture.
Talend Big Data Sandbox
Big Data Insights Cookbook
Create a Kafka Topic to Produce Create a Spark recommendation Stream live recommendations to a
and Consume real-time streaming model based on specific user Cassandra NoSQL database for
data actions “Fast Data” access for a WebUI
If you are familiar with the ALS model, you can update the ALS parameters to enhance the model or just leave the default values.
Talend Big Data Sandbox
Big Data Insights Cookbook
Demo Setup:
1
Before you can execute the Retail Recommendation Demo, you
will need to generate the source data and pre-populate the 2
Cassandra Lookup Tables. 3
4
Talend Big Data Sandbox
Big Data Insights Cookbook
If you are familiar with the ALS model, you can update the ALS parameters to enhance the model or just leave the default values.
Talend Big Data Sandbox
Big Data Insights Cookbook
In this job:
• A Kafka Consumer reads in Clickstream Data.
• The data is fed into the Recommendation Engine, producing Real-time “offers” based on the current user’s activity.
• The tWindow component controls how often recommendations are generated.
• The recommendations are sent to 3 output streams
Execution window for viewing purposes 1
File System for later processing in your Big Data Analytics environment
Cassandra for use in a “Fast Data” layer by a WebUI
With data streaming to the Kafka Topic…
Start the recommendation pipeline.
2
1. Navigate to the Job Designs folder.
2. Click on Big Data Streaming > 3
Realtime_Recommendation_Demo
3. Double click on
Step_4b_Recommendation_Realtime_Engine
_Pipeline 0.1. This opens the job in the
designer window.
4. Click on Run to Start Recommendation Engine 4
Talend Big Data Sandbox
Big Data Insights Cookbook
Note: Execution of this demo requires a Hadoop distribution. If a distro hasn’t been selected, click here.
Overview:
In this example we will utilize real-
time streaming of data through a
Kafka queue to track on-field player
movements at a sporting event.
You will experience: Database
REST Service to
IoT data to Kafka Spark Streaming
Live Dashboard
Capture IoT data in XML files, then Use Spark Streaming Technology to Use a restful web service to track
load that data to a Kafka Queue for quickly calculate player distance player movements in a web-based
real-time processing. and speed as their positions change dashboard.
on the playing field.
Talend Big Data Sandbox
Big Data Insights Cookbook
2
Execute the Sport Stats Demo:
In this job:
• A Kafka Consumer reads the sensor data.
• A tWindow component controls how often data is read from the Kafka topic – in this case, 10 seconds worth of data is read every 10 seconds.
• The data is normalized for easier processing.
• Using the tCache components the process calculates distance and speed based on current and previous player positions.
• The resultant data is sent to 2 output streams
Execution window for viewing purposes
MySQL Database where it will be read by a web service to generate dashboard graphics. (MySQL is running on a Docker container)
2 3
4
With the Web Service running, continue
to the next step in this demo.
Talend Big Data Sandbox
Big Data Insights Cookbook
Note: Execution of this demo requires a Hadoop distribution. If a distro hasn’t been selected, click here.
Overview:
In this example we demonstrate
using native Map Reduce to enrich a
dataset and aggregate the results for
different web-based dashboards.
You will experience: Clickstream
Read and Write data to HDFS with Use Talend’s MapReduce Feed your analysis data to a
simple components from Talend components to enrich and analyze graphing tool such as Microsoft
data, natively, in Hadoop Excel or Tableau for stunning
displays of the results.
Talend Big Data Sandbox
Big Data Insights Cookbook
Demo Setup:
1
Load data to HDFS.
4
With our analysis data in HDFS, we can load
into a Hive Table for further querying or import
to a visualization tool. Continue to the next
steps of the demo to see how this can be done.
Talend Big Data Sandbox
Big Data Insights Cookbook
1 2
With our analysis complete, we can pull the raw file
from HDFS or even put it into a Hive table for further
querying.
3
1. Navigate to the Job Designs folder.
2. Navigate to Standard > Clickstream_Scenario >
Clickstream_Step_by_Step
3. Double click on
Step3_Clickstream_Get_WeblogAnalytics 0.1. This
opens the job in the designer window.
4. From the Run tab, click on Run to execute
This file could be imported to MS Excel or other BI tools like Tableau (not included in the Big Data Sandbox) to generate additional
dashboards.
Talend Big Data Sandbox
Big Data Insights Cookbook
Note: Execution of this demo requires a Hadoop distribution. If a distro hasn’t been selected, click here.
Overview:
In this example we demonstrate how
using Talend with Hadoop can speed
up and simplify processing large
volumes of 3rd Party Data. The
sample data is simulating a Life Warehouse
Sciences Prescribing habits data file
from a 3rd Party vendor.
3rd Party
You will experience: Files
Back to
• optimizing your data warehouse by 3rd Party
off-loading the ETL overhead to
Hadoop and HDFS.
• Fast, Pre-load analytics on large
volume datasets. Ingest Process Store Deliver Decide
Large volume
Pre-load Analytics ETL Off-loading
processing
With Talend and Hadoop, you can By analyzing large volumes of data Utilizing Talend with a Hadoop
process Gigabytes and Terabytes of BEFORE loading it to your Data Cluster, you can optimize your Data
data in a fraction of the time. Warehouse, you eliminate the Warehouse by removing the costly
overhead of costly data anomalies overhead of data processing.
in the Data Warehouse.
Talend Big Data Sandbox
Big Data Insights Cookbook
Note: To quickly and easily see the value of the ETL Off-Load Demo, proceed with the below steps. If you would like a more in-depth
experience and more control over the source data, Click here…
1
Demo Setup: 2
To Execute this demo, you must first generate
the source files for processing within Hadoop
3
1. Navigate to the Job Designs folder.
2. Click on Standard > ETL_OffLoad_Scenario
3. Double click on
Step_1_ProductAnalysis_DemoSetup 0.1.
This opens the job in the designer window.
4. From the Run tab, click on Run to execute
Demo Setup:
In this step-by-step version of the demo, you will see just how simple it is to work with Talend and Hadoop. You
will also have more control over the source data used within the demo for a more personalized experience.
1
Execute the ETL Off-Load “Step-by-Step” Demo:
6
When this job is complete, you can view the
generated reports from the web browser:
5. Open Firefox
6. From a new tab, Click on Demos > Product
Demo > Net Change to view the report. 7
Note: Execution of this demo requires a Hadoop distribution. If a distro hasn’t been selected, click here.
Overview:
In this example we demonstrate
using different Big Data Methods to
aggregate and analyze large volumes
of weblog data.
You will experience: Web Logs
Connect, Create, Read and Write Use Talend’s MapReduce Understand the flexibility of
with Hive Components to access components to access and analyze Talend’s capabilities to perform the
data in HDFS data, natively, from HDFS same operations with multiple
technologies
Talend Big Data Sandbox
Big Data Insights Cookbook
41
Talend Big Data Sandbox
Big Data Insights Cookbook
1
2
Execute the Apache Weblog Demo:
31
Filter the Apache Weblog files and load
them to HDFS.
3b
Note: While the data is loaded into
HDFS, it is also saved in a location
where the created Hive table is
expecting data. So now you can view
the data both through a Hive query or
HDFS file browsing.
Talend Big Data Sandbox
Big Data Insights Cookbook
Conclusion
Simplify Big Data Integration Built for Batch and Real-time Big Data Lower Operating Costs
Talend vastly simplifies big data Talend is built for batch and real- Talend lowers operations costs.
integration, allowing you to time big data. Unlike other
leverage in-house resources to solutions that “map” to big data or Talend’s zero footprint solution
use Talend's rich graphical tools support a few components, Talend takes the complexity out of…
that generate big data code is the first data integration platform
(Spark, MapReduce, PIG, Java) for built on Spark with over 100 Spark Integration Deployment
you. components. Management
Maintenance
Talend is based on standards such Whether integrating batch
as Eclipse, Java, and SQL, and is (MapReduce, Spark), streaming A usage based subscription model
backed by a large collaborative (Spark), NoSQL, or in real-time, provides a fast return on
community. Talend provides a single tool for all investment without large upfront
your integration needs. costs.
So you can up skill existing
resources instead of finding new Talend’s native Hadoop data quality
resources. solution delivers clean and
consistent data at infinite scale.