How-To - Build A Real-Time Search System Using StreamSets, Apache Kafka, and Cloudera Search
How-To - Build A Real-Time Search System Using StreamSets, Apache Kafka, and Cloudera Search
SEARCH
Archives
(http://blog.cloudera.com/wp- Archives
content/uploads/2016/02/streamsets-f2.png) Select Month
For this initial pipeline, the Directory origin will be configured
to read in Delimited data with a header line, and the Kafka
destination will be configured to output JSON data. We will
also specify the location of the files to ingest, and the Kafka
topic to which we send the data.
(http://blog.cloudera.com/wp-
content/uploads/2016/02/streamsets-f3.png)
We must also configure how to handle errors in the pipeline. A
record may be classified as an error for many different
reasons: perhaps data came in the wrong format, or a field
required for a transformation was missing. When a record is
marked as an error, it is sent off to an error destination, which
could be another pipeline or a Kafka topic. If you’re not sure
what to do with error records, they can always be discarded.
(http://blog.cloudera.com/wp-
content/uploads/2016/02/streamsets-f4.png)
Another type of transformation and data enrichment that this
pipeline handles is mapping from a zip code to a
latitude/longitude pair. Occasionally, when it is necessary to
build proprietary logic or some other complex transformation
into a pipeline, it makes sense to use some of the extensibility
capabilities of StreamSets to fulfill those needs. In the case of
this pipeline, we’ve downloaded a mapping dictionary
available online (http://federalgovernmentzipcodes.us/) and
written a Python script to do the lookup and create some
additional fields to store the latitude and longitude data.
(http://blog.cloudera.com/wp-
content/uploads/2016/02/streamsets-f5.png)
Finally, we’ve separated out the accepted and rejected loans,
with accepted loans going to Cloudera Search and rejected
loans being archived on HDFS. Notably, the HDFS location
can be parameterized with field values or timestamps, which
can make the HDFS destination useful for loading data into
partitioned Apache Hive tables.
(http://blog.cloudera.com/wp-
content/uploads/2016/02/streamsets-f6.png)
As data arrives, we can use HUE
(http://www.cloudera.com/products/apache-hadoop/hue.html)
to build dashboards on the index, and get some more
information about these loans. In this dashboard, we’ve
plotted the number of loans being issued from each state, as
well as a comparison between income brackets and the status
of the loan (paid off, delinquent, etc.). We can use this
information, along with the rest of the data that we’re
continuously ingesting, to make better loan investment
decisions.
(http://blog.cloudera.com/wp-
content/uploads/2016/02/streamsets-f7.png)
Conclusion
The Hadoop ecosystem has a wide array of tools and
technologies for building solutions. In this post, you’ve learned
how to piece together complementary ingestion technologies
like StreamSets and Kafka to bring data in real-time to
analytics and search infrastructure like Solr, and finally
visualize that data with HUE. The combination of these
technologies provides an end-to-end solution for enabling
data scientists and analysts to better serve themselves, and
to get faster access to data that is critical to them.
Partner (https://www.cloudera.com/partners.html)
Developers (https://www.linkedin.com/company
(https://www.cloudera.com/developers.html) /cloudera)
Community (https://community.cloudera.com/)
Resources (https://www.cloudera.com/resources.html) (https://www.facebook.com/clouder
a)
Documentation
(https://www.cloudera.com/documentation.html)
Career (https://www.cloudera.com/careers.html)
Contact (https://www.cloudera.com/contact-us.html)
United States: +1 888 789 1488 (tel:18887891488)
Outside the US: +1 650 362 0488 (tel:16503620488) (https://twitter.com/cloudera)
(https://www.cloudera.com/contact
-us.html)