0% found this document useful (0 votes)
105 views

How-To - Build A Real-Time Search System Using StreamSets, Apache Kafka, and Cloudera Search

This document discusses building a real-time search system using StreamSets Data Collector, Apache Kafka, and Cloudera Search. It describes using StreamSets to ingest loan data from Lending Club into Kafka to simulate a live data feed, then using StreamSets again to consume the data from Kafka and index it in Cloudera Search for real-time search and analysis of the loan data. The system would help investors better understand loan data and identify good opportunities.

Uploaded by

rvasdev
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views

How-To - Build A Real-Time Search System Using StreamSets, Apache Kafka, and Cloudera Search

This document discusses building a real-time search system using StreamSets Data Collector, Apache Kafka, and Cloudera Search. It describes using StreamSets to ingest loan data from Lending Club into Kafka to simulate a live data feed, then using StreamSets again to consume the data from Kafka and index it in Cloudera Search for real-time search and analysis of the loan data. The system would help investors better understand loan data and identify good opportunities.

Uploaded by

rvasdev
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

(/)

Cloudera Engineering Blog


(http://blog.cloudera.com/)
Best practices, how-tos, use cases, and internals from Cloudera Engineering and the
community

SEARCH

How-to: Build a Real-Time Tweets by


@ClouderaEng
Search System using (https://twitter.com/C
louderaEng)
StreamSets, Apache Kafka,
and Cloudera Search Categories
Accumulo
(https://blog.cloudera.com/blog/2016/02/how-to-
February 16, 2016 (https://blog.cloudera.c
build-a-real-time-search-system-using-streamsets-apache- om/blog/category/accu
kafka-and-cloudera-search/) | By Justin Kestelyn mulo/) (2)
(https://blog.cloudera.com/blog/author/jkestelyn/) (@kestelyn) AI and Machine
(https://twitter.com/@kestelyn) | No Comments Learning
(https://blog.cloudera.com/blog/2016/02/how-to-build-a-real-time-search-system-using-
(https://blog.cloudera.c
streamsets-apache-kafka-and-cloudera-search/#respond)
om/blog/category/ai-
Categories: Cloudera Manager (https://blog.cloudera.com/blog/category/cloudera-
and-machine-
manager/) Guest (https://blog.cloudera.com/blog/category/guest/) How-to
learning/) (3)
(https://blog.cloudera.com/blog/category/how-to/) Hue
Altus
(https://blog.cloudera.com/blog/category/hue/) Kafka
(https://blog.cloudera.com/blog/category/kafka/) Search Tags
(https://blog.cloudera.com/blog/category/search/)
analysis
(https://blog.cloudera.com/
Thanks to Jonathan Natkins, a field engineer from
blog/tag/analysis/)
StreamSets, for the guest post below about using
analytics
StreamSets Data Collector—open source, GUI-driven ingest
(https://blog.cloudera.com/
technology for developing and operating data pipelines with
blog/tag/analytics/)
a minimum of code—and Cloudera Search and HUE to build
apache
a real-time search environment.
(https://blog.cloudera.com/
As pressure mounts on data engineers to deliver more data blog/tag/apache/) apache
from more sources in less time, StreamSets Data Collector hadoop
(https://streamsets.com/product/) can serve as a linchpin in (https://blog.cloudera.com/
the data management process, helping them simplify ingest blog/tag/apache-hadoop/)
pipeline development and operations across the rapidly Apache HBase
evolving ecosystem of big data tools and technology. In this (https://blog.cloudera.com/
post, we’ll create a pipeline for ingesting loan data to blog/tag/apache-hbase/)
show you how to use StreamSets Data Collector and Cloudera apache hive
Search (http://www.cloudera.com/products/apache- (https://blog.cloudera.com/
hadoop/apache-solr.html) to build a real-time search blog/tag/apache-hive/)
environment. Big Data
(https://blog.cloudera.com/
Use Case blog/tag/big-data/) CDH
StreamSets is an open source
(https://blog.cloudera.com/
(http://github.com/streamsets/datacollector), Apache-licensed
blog/tag/cdh/) cloud
system for building continuous ingestion pipelines. The
(https://blog.cloudera.com/
StreamSets Data Collector provides ETL-upon-ingest
blog/tag/cloud/) cloudera
capabilities while enabling custom-code-free integration
(https://blog.cloudera.com/
between a wide variety of source data systems (like relational
blog/tag/cloudera/)
databases, Amazon S3, or flat files) and destination systems
Cloudera Manager
within the Hadoop ecosystem. StreamSets is easy to install via
(https://blog.cloudera.com/
downloadable (http://streamsets.com/opensource) tarballs or
blog/tag/cloudera-
using the Cloudera Manager Custom Service Descriptor
manager/) Community
(http://www.cloudera.com/documentation/enterprise/latest/to
(https://blog.cloudera.com/
pics/cm_mc_addon_services.html) (CSD).
blog/tag/community/)
configuration
(https://blog.cloudera.com/
blog/tag/configuration/)
data
(https://blog.cloudera.com/
blog/tag/data/) developer
(https://blog.cloudera.com/
blog/tag/developer/)
(http://blog.cloudera.com/wp- developers
content/uploads/2016/02/streamsets-f1.png) (https://blog.cloudera.com/
blog/tag/developers/)
Lending Club is a company that provides peer-to-peer loans.
development
A user can request a loan, and then the loans are crowd-
(https://blog.cloudera.com/
funded by investors. Peer-to-peer lending has become an
blog/tag/development/)
extremely hot space, especially as similar platforms like
events
Kickstarter have gained traction. Investors take on huge
(https://blog.cloudera.com/
amounts of risk, however, if they invest in loans that they don’t
understand well. Fortunately, Lending Club provides publicly blog/tag/events-2/) Flume
available data (https://www.lendingclub.com/info/download- (https://blog.cloudera.com/
data.action) about the loans it issues, as well as the current blog/tag/flume/) Guest
performance and returns. Using StreamSets and Cloudera (https://blog.cloudera.com/
Search, one can leverage this data to better understand how blog/tag/guest/) Hadoop
to find loans in which it’s worth investing. (https://blog.cloudera.com/
blog/tag/hadoop/) HBase
Unfortunately, Lending Club doesn’t provide a truly live feed,
(https://blog.cloudera.com/
but we can simulate it easily using StreamSets and Apache
blog/tag/hbase/) HDFS
Kafka (http://www.cloudera.com/products/apache-
(https://blog.cloudera.com/
hadoop/apache-kafka.html). We’ll leverage StreamSets to load
blog/tag/hdfs/) Hive
data from flat files into Kafka, and then use StreamSets again
(https://blog.cloudera.com/
to consume the data from Kafka and send it to Cloudera
blog/tag/hive/) Hue
Search and HDFS.
(https://blog.cloudera.com/
For the sake of brevity, the data files have been downloaded blog/tag/hue/) impala
to a server running a StreamSets Data Collector, and some (https://blog.cloudera.com/
minor processing has been done to remove a one-line blog/tag/impala-2/)
preamble from the top of each of the CSV files. installation
(https://blog.cloudera.com/
Loading the Loan Data into Kafka blog/tag/installation/) java
Kafka is a high-throughput message-queueing system, which
(https://blog.cloudera.com/
has become widely used for building publish-subscribe
blog/tag/java/) log
systems with the Apache Hadoop ecosystem. A major benefit
(https://blog.cloudera.com/
of using Kafka as an intermediate datastore is that it makes it
blog/tag/log/) logs
very easy to replay ingestion and analysis, as well as making it
(https://blog.cloudera.com/
significantly easier to consume datasets across multiple
blog/tag/logs/)
applications. However, a common challenge with using Kafka
MapReduce
is that the primary methods of producing and consuming data
(https://blog.cloudera.com/
requires writing custom code to leverage the APIs.
blog/tag/mapreduce/)
We can use StreamSets to graphically build a pipeline that will open source
load data into a Kafka topic. We can also use this pipeline to (https://blog.cloudera.com/
do a little work to canonicalize our data format. Generally, it is blog/tag/open-source/)
a best practice to have a common data format within a Pig
Hadoop deployment for ease of building follow-on (https://blog.cloudera.com/
applications, and for this deployment, we’ve chosen JSON. blog/tag/pig/) platform
One benefit JSON gives us over CSV data is that CSV files are (https://blog.cloudera.com/
heavily dependent upon ordering of columns; converting to blog/tag/platform/)
JSON will help us avoid any potential column ordering issues python
later on. (https://blog.cloudera.com/
blog/tag/python/)
Configuring a StreamSets Pipeline questions
StreamSets pipelines avoid custom code by providing (https://blog.cloudera.com/
general-purpose connectors that are configuration-driven. blog/tag/questions/) R
StreamSets Data Collectors may have many pipelines, and (https://blog.cloudera.com/
each pipeline has a single data origin, but may have one or blog/tag/r/) release
more destinations. To load the loan data into Kafka, we will (https://blog.cloudera.com/
build a very simple pipeline that has a Directory origin and a blog/tag/release/) REST
Kafka destination. (https://blog.cloudera.com/
blog/tag/rest/) Search
A key concept in StreamSets is the idea of the StreamSets
(https://blog.cloudera.com/
Data Collector (SDC) Record. When data is read into a
blog/tag/search/) security
pipeline, it is parsed into an SDC Record. Having a common
(https://blog.cloudera.com/
record format within the pipeline enables transformations to
blog/tag/security/) sql
be built in a generic fashion, so that they can operate on any
(https://blog.cloudera.com/
record that comes through, regardless of schema. When the
blog/tag/sql/) Support
data is sent to a destination, it is serialized to a target data
(https://blog.cloudera.com/
format (when applicable).
blog/tag/support-2/)
Testing
(https://blog.cloudera.com/
blog/tag/testing/) use
cases
(https://blog.cloudera.com/
blog/tag/use-cases/)

Archives

(http://blog.cloudera.com/wp- Archives
content/uploads/2016/02/streamsets-f2.png) Select Month
For this initial pipeline, the Directory origin will be configured
to read in Delimited data with a header line, and the Kafka
destination will be configured to output JSON data. We will
also specify the location of the files to ingest, and the Kafka
topic to which we send the data.
(http://blog.cloudera.com/wp-
content/uploads/2016/02/streamsets-f3.png)
We must also configure how to handle errors in the pipeline. A
record may be classified as an error for many different
reasons: perhaps data came in the wrong format, or a field
required for a transformation was missing. When a record is
marked as an error, it is sent off to an error destination, which
could be another pipeline or a Kafka topic. If you’re not sure
what to do with error records, they can always be discarded.

Handling Varying Record Types and Preparing


Data for Search
On the other side of Kafka, we’ll use another StreamSets
pipeline to consume data from the Kafka topic and build up
an index in Cloudera Search. Oftentimes data is received in a
less-than-pristine format, and very frequently, it’s necessary to
do some amount of pre-processing or transformations to get
the data into a consumption-ready format.
StreamSets can be used to perform row-oriented
transformations as the data is ingested. A good way to think
about the types of transformations that StreamSets can
handle is to think of a pipeline as a continuous map-only job.
For this example, the pipeline has been designed to perform a
handful of transformation operations.
One interesting challenge is that the accepted and rejected
loan files that came from Lending Club have different
schemas. All the data is in CSV format, but accepted loans
have upwards of 50 fields, while rejected loans only have nine.
Since StreamSets parses each record individually, we don’t
have to make any changes to the pipeline to handle the
different record types. However, one transformation we’ll put
in place is to canonicalize some of the field names between
the two record types, using a Field Renamer processor. This
will allow us to perform transformations on semantically
identical fields, regardless of the schema.

(http://blog.cloudera.com/wp-
content/uploads/2016/02/streamsets-f4.png)
Another type of transformation and data enrichment that this
pipeline handles is mapping from a zip code to a
latitude/longitude pair. Occasionally, when it is necessary to
build proprietary logic or some other complex transformation
into a pipeline, it makes sense to use some of the extensibility
capabilities of StreamSets to fulfill those needs. In the case of
this pipeline, we’ve downloaded a mapping dictionary
available online (http://federalgovernmentzipcodes.us/) and
written a Python script to do the lookup and create some
additional fields to store the latitude and longitude data.

(http://blog.cloudera.com/wp-
content/uploads/2016/02/streamsets-f5.png)
Finally, we’ve separated out the accepted and rejected loans,
with accepted loans going to Cloudera Search and rejected
loans being archived on HDFS. Notably, the HDFS location
can be parameterized with field values or timestamps, which
can make the HDFS destination useful for loading data into
partitioned Apache Hive tables.

Starting Up the Pipelines and Getting Some


Results
Once the two StreamSets pipelines are started, data will start
to flow into the configured Cloudera Search index.

(http://blog.cloudera.com/wp-
content/uploads/2016/02/streamsets-f6.png)
As data arrives, we can use HUE
(http://www.cloudera.com/products/apache-hadoop/hue.html)
to build dashboards on the index, and get some more
information about these loans. In this dashboard, we’ve
plotted the number of loans being issued from each state, as
well as a comparison between income brackets and the status
of the loan (paid off, delinquent, etc.). We can use this
information, along with the rest of the data that we’re
continuously ingesting, to make better loan investment
decisions.
(http://blog.cloudera.com/wp-
content/uploads/2016/02/streamsets-f7.png)

Conclusion
The Hadoop ecosystem has a wide array of tools and
technologies for building solutions. In this post, you’ve learned
how to piece together complementary ingestion technologies
like StreamSets and Kafka to bring data in real-time to
analytics and search infrastructure like Solr, and finally
visualize that data with HUE. The combination of these
technologies provides an end-to-end solution for enabling
data scientists and analysts to better serve themselves, and
to get faster access to data that is critical to them.

 CSD (https://blog.cloudera.com/blog/tag/csd/) StreamSets


(https://blog.cloudera.com/blog/tag/streamsets/)

 New SQL Benchmarks: Making Python on Apache


Apache Impala (incubating) Hadoop Easier with Anaconda
Uniquely Delivers Analytic and CDH 
Database Performance (https://blog.cloudera.com/blog
(https://blog.cloudera.com/blog/2016/02/new-
python-on-apache-hadoop-
sql-benchmarks-apache- easier-with-anaconda-and-
impala-incubating-2-3- cdh/)
uniquely-delivers-analytic-
database-performance/)

Partner (https://www.cloudera.com/partners.html)
Developers (https://www.linkedin.com/company
(https://www.cloudera.com/developers.html) /cloudera)

Community (https://community.cloudera.com/)
Resources (https://www.cloudera.com/resources.html) (https://www.facebook.com/clouder
a)
Documentation
(https://www.cloudera.com/documentation.html)
Career (https://www.cloudera.com/careers.html)
Contact (https://www.cloudera.com/contact-us.html)
United States: +1 888 789 1488 (tel:18887891488)
Outside the US: +1 650 362 0488 (tel:16503620488) (https://twitter.com/cloudera)

(https://www.cloudera.com/contact
-us.html)

Terms & Conditions


(https://www.cloudera.com/l
egal/terms-and-
conditions.html)
Privacy Policy and Data
Policy
(https://www.cloudera.com/l
egal/policies.html)

You might also like