0% found this document useful (0 votes)

975 views338 pages

Voltage - SecureData - Hadoop - 5.0 - Jul2022Update - Developer 1

Uploaded by

toxexis482

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

975 views338 pages

Voltage - SecureData - Hadoop - 5.0 - Jul2022Update - Developer 1

Uploaded by

toxexis482

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 338

Voltage SecureData

Developer Templates for Hadoop

Version 5.0

Integration Guide

July 2022
Legal notices
© Copyright 2011, 2014, 2016-2020, 2022 Micro Focus or one of its affiliates.
The only warranties for products and services of Micro Focus and its affiliates and licensors (“Micro Focus”) are as may be set forth in the
express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional
warranty. Micro Focus shall not be liable for technical or editorial errors or omissions contained herein. The information contained herein is
subject to change without notice.
Except as specifically indicated otherwise, this document contains confidential information and a valid license is required for possession, use
or copying. If this work is provided to the U.S. Government, consistent with FAR 12.211 and 12.212, Commercial Computer Software,
Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard
commercial license.
Contents

Chapter 1: Introduction to the Developer Templates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1

What are the Developer Templates? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1
Intended Use of the Developer Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3
How this Document is Organized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3
Chapter 2: Install and Build the Developer Templates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1
Related Software Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1
Other Voltage SecureData Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1
Java Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
Hadoop Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
Running in the Amazon EMR Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
Installing the Developer Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-7
Developer Templates Installation Payload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8
<install_dir> Folder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8
<install_dir>/bin Folder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8
<install_dir>/clientsamples Folder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8
<install_dir>/common-crypto Folder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8
<install_dir>/config Folder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9
<install_dir>/configlocator Folder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9
<install_dir>/dev-templates-src Folder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9
<install_dir>/eula Folder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9
<install_dir>/sampledata Folder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9
<install_dir>/spark Folder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9
<install_dir>/stream Folder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9
<install_dir>/stream/kafka_connect Folder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
<install_dir>/stream/kafka_storm Folder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
<install_dir>/stream/nifi Folder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
<install_dir>/stream/stream_common Folder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
<install_dir>/stream/streamsets_processor Folder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
Building the Developer Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
Building the MapReduce, Hive, and Sqoop Developer Templates . . . . . . . . . . . . . . . . . . . . . . 2-11
Build Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11
Editing the Properties in the Parent Maven POM File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11
Build Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-14
Support for Newer Apache Sqoop 1.x Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-16
Building the Spark Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-17
Build Pre-Requisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-17
Build Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-18

1 CONFIDENTIAL
Building the Datastream Developer Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19
Build Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20
Editing the Properties in the Parent Maven POM File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20
Build Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-24
Chapter 3: Common Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1
Authentication and Authorization Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1
Kerberos Delegation Token HDFS Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3
Configuration Step Summary for Kerberos Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3
Additional Kerberos Steps on the KDC Computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
Additional Kerberos Steps on the KDC Computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
Configuration Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5
Domain Name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6
Hostname . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
Simple API Install Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
Simple API Policy URL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8
Simple API Cache Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-9
Simple API File Cache Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10
Simple API Short FPE Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10
Web Service Batch Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11
REST Hostname . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11
Authentication/Authorization Failure on Access Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12
Product Name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14
Product Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-15
Component-Specific Designator for Client ID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17
Identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17
API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-18
CryptId Name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19
Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20
CryptId AuthId Name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20
Translator Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21
Translator Initialization Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22
CryptId Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22
Component-Specific Designator for Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-23
Field Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-23
Field Name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24
CryptId Name for Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-25
Delegation Token HDFS Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-26
Authentication/Authorization Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-26
Shared Secret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28
Username . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-29
Password . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-30
AuthId Name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-31
XML Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-32

CONFIDENTIAL 2
vsconfig.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-33
High-Level Elements in vsconfig.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-33
Attribute Values in vsconfig.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-37
vsauth.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-38
High-Level Elements in vsauth.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-39
Element and Attribute Values in vsauth.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-40
vs<component>.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-41
High-Level Elements in vs<component>.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-42
Attribute Values in vs<component>.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-43
Java Properties Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-45
vsnifi.properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-45
Specifying the Location of the XML Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-47
-D Generic Option to Specify a Property Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-47
Config-Locator Properties File Packaged as a JAR File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-48
Precedence When Checking for XML Configuration File Locations . . . . . . . . . . . . . . . . . . . . 3-50
Other Approaches to Providing Configuration Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-52
Shared Integration Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-54
Utility and Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-55
Package Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-55
Multiple Developer Template TrustStores - Background and Usage . . . . . . . . . . . . . . . . 3-56
Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-56
Package Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-57
Common Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-57
Package Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-57
Package Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-58
Hadoop Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-60
Package Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-61
Package Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-62
Shared Code for the DataStream Developer Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-63
Package Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-64
Package Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-64
Data Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-65
Package Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-65
Package Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-66
Data Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-67
Package Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-67
Package Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-67
Cryptographic Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-68
Package Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-69
Package Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-69
Using Old Versions of Other Voltage SecureData Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-71
Shared Sample Data for the Hadoop Developer Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-73
plaintext.csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-73
creditscore.csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-74

3 CONFIDENTIAL
Common Procedures for the Hadoop Developer Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-75
Common HDFS Procedures for the Hadoop Developer Templates . . . . . . . . . . . . . . . . . . . . 3-75
Creating a Home Directory in HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-75
Loading Hadoop Developer Template Files into HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-76
Loading Updated Configuration Files into HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-77
Common Procedures for Working with Kerberos Authentication . . . . . . . . . . . . . . . . . . . . . . 3-78
Prerequisites for Using Kerberos Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-78
The Hadoop Developer Templates Delegation Token Scripts . . . . . . . . . . . . . . . . . . . . . . 3-79
Getting Your Kerberos Ticket Granting Ticket . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-83
Getting And Storing Your Kerberos Delegation Token . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-83
Run Your Hadoop Developer Template Job or Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-84
Optional Destruction of the Delegation Token . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-84
Kerberos Authentication When Beeline/HiveServer2 Impersonation is Disabled . . . . 3-84
Logging and Error Handling in the Hadoop Developer Templates . . . . . . . . . . . . . . . . . . . . . . . 3-87
Handling Empty and Net-Empty Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-88
Known Limitations of the Developer Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-91
Hadoop Developer Template Code Needs a Full CSV Parser . . . . . . . . . . . . . . . . . . . . . . . . . . 3-91
Additional Verification of Converter and Translator Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-91
Additional Robustness for Java Properties Configuration File Parsing . . . . . . . . . . . . . . . . . 3-91
Chapter 4: MapReduce Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1
Integration Architecture of the MapReduce Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1
Batch Processing for MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
Configuration Settings for the MapReduce Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4
Updating Field Settings for the MapReduce Developer Template . . . . . . . . . . . . . . . . . . . . . . 4-5
Running the MapReduce Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6
Chapter 5: Hive Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1
History of Hive Support in the Hive Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2
Hive Developer Template 3.10 and Earlier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2
Hive Developer Template 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2
Hive Developer Template 4.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
Hive Developer Template 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
Hive Developer Template 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
Hive Developer Template 5.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
Different Types of Hive UDFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
Hive UDFs for Formatted Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
Hive UDFs for Binary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6
Special Considerations When Using the Binary Hive UDFs . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8
Integration Architecture of the Hive Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12
Java Classes for the Hive UDFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13
Creating and Calling Hive UDFs from the Hive Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14

CONFIDENTIAL 4
Limitations of the Hive UDFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15
Hive UDFs Work on One Data Value at a Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15
Hive UDF Failures with Literal Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16
Hive UDF Failures with HortonWorks HDP 2.2 and later . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-18
Major Changes in Hive 3.0 and the Script Changes They Required . . . . . . . . . . . . . . . . . 5-18
No Batch Processing for Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-19
Integration with Apache Impala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-19
Limitations and Special Requirements When Using Impala . . . . . . . . . . . . . . . . . . . . . . . . . 5-20
Configuration Settings for the Hive Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-21
Example of Updating Settings in the HIVE Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-22
Running the Hive Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-23
Setting Up to Run the Hive Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-24
Edit the Hive Scripts to Replace <username> Placeholder . . . . . . . . . . . . . . . . . . . . . . . . . . 5-24
Copy the Required JAR Files to Your Data Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-25
Create the Hive Tables for the Hive Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-25
Enable Impersonation for Beeline and Remote Hive Queries . . . . . . . . . . . . . . . . . . . . . . . . 5-26
Running Queries Locally From a Node Within Your Hadoop Cluster . . . . . . . . . . . . . . . . . . . 5-27
Running a JOIN Query Using a Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-27
Running a Binary Data Query Using a Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-28
Running a Simple HiveQL Query Interactively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-28
Running Hive Queries Using the Beeline Shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-29
Running a Hive Query from a Remote Computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-30
Creating Permanent Hive UDFs Using the Hive Command Line . . . . . . . . . . . . . . . . . . . . 5-31
Running a Remote Query Using JDBC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-31
Running a Remote Query Using ODBC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-38
Using the Generic Hive UDFs When Impersonation is Disabled . . . . . . . . . . . . . . . . . . . . . . . . 5-40
Running Queries in the Context of Hive LLAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-43
Running Queries Using Apache Impala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-44
Copy JAR Files to the Nodes Running the Impala Daemon . . . . . . . . . . . . . . . . . . . . . . . . . 5-44
Preparing Protected Data to Load into Impala Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-44
Create the Impala Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-46
Create the Impala UDFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-47
Run Impala Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-48
Drop Impala UDFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-48
Chapter 6: Sqoop Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1
Integration Architecture of the Sqoop Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2
Batch Processing for Sqoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3
Configuration Settings for the Sqoop Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-4
Example of Updating Settings in the SQOOP Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
Running the Sqoop Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6
Load Sample Data into MySQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7
Generate an ORM JAR File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7
Import and Protect Your Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9
Display the Protected Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10

5 CONFIDENTIAL
Advanced Sqoop Import Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10
Non-Batched Version of the Sqoop Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12
Determine Whether the Sqoop Integration Supports a Sqoop Import Option . . . . . . . 6-13
Chapter 7: Spark Integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-1
Integration Architecture of the Spark Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3
RDD and Dataset Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-4
RDD and Dataset Driver Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-4
RDD and Dataset Processor Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5
UDF-Based Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8
DataFrame, Spark SQL, and HiveUDF Driver Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8
DataFrame, Spark SQL, and HiveUDF Processor Functionality . . . . . . . . . . . . . . . . . . . . . . 7-11
Logging and Error Handling in the Spark Developer Template . . . . . . . . . . . . . . . . . . . . . . . . 7-12
Configuration Settings for the Spark Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12
Alternative Approaches to Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14
Sample Data for the Spark Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14
Running the Spark Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14
Run-Time Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-15
Changing the Input and Output Locations and Filenames on HDFS . . . . . . . . . . . . . . . . . . . 7-15
Steps to Run the Spark Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-16
Hadoop Distribution Dependencies for Running the Sample Jobs . . . . . . . . . . . . . . . . . . . . . 7-18
Spark Script Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-19
run-spark-prepare-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-19
run-spark-protect-rdd-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-20
run-spark-protect-dataset-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-20
run-spark-protect-dataframe-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-20
run-spark-protect-sql-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-20
run-spark-protect-hive-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-20
run-spark-access-rdd-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21
run-spark-access-dataset-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21
run-spark-access-dataframe-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21
run-spark-access-sql-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21
run-spark-access-hive-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21
run-pyspark-protect-dataframe-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-22
run-pyspark-protect-sql-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-22
run-pyspark-protect-hive-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-22
run-pyspark-access-dataframe-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-23
run-pyspark-access-sql-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-23
run-pyspark-access-hive-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-23
update-spark-config-files-in-hdfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-24
TrustStores Used by the Spark Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-24
Using the Spark Web Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-24
Limitations of the Spark Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-26
Chapter 8: NiFi Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1

CONFIDENTIAL 6
Quick Start Using the Provided NiFi Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2
Exercising the SecureDataExample Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-6
Using the Workflow With a Different Voltage SecureData Server . . . . . . . . . . . . . . . . . . . . . . . 8-7
Integration Architecture of the NiFi Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-8
Processor Classes for the NiFi Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-8
Configuration Classes for the NiFi Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-9
Logging and Error Handling in the NiFi Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . 8-10
Configuration Settings for the NiFi Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-10
Configuring the Properties of the NiFi SecureDataProcessor . . . . . . . . . . . . . . . . . . . . . . . . . . 8-12
Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-12
Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-13
Identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14
Auth Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-15
SharedSecret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-15
Username . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-15
Password . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-15
API Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-16
SecureDataProcessor Property Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-16
SecureDataProcessor Relationship Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17
Sample Data for the NiFi Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17
Adding the SecureDataProcessor to a Blank Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18
Limitations and Simplifications of the NiFi Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . 8-20
Chapter 9: StreamSets Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-1
Quick Start Using the Provided StreamSets Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2
Exercising the Sample Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-4
Using the Pipelines with a Different Voltage SecureData Server . . . . . . . . . . . . . . . . . . . . . . . . 9-6
Integration Architecture of the StreamSets Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . 9-7
Processor Classes for the StreamSets Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-8
Configuration Classes Used by the StreamSets Developer Template . . . . . . . . . . . . . . . . . . . 9-9
Logging and Error Handling in the StreamSets Developer Template . . . . . . . . . . . . . . . . . . . 9-9
Configuration Settings for the StreamSets Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . 9-10
Configuring the Settings of the Voltage SDProcessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-11
Operation Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-11
Config Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-11
Field to Process and CryptId Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-11
Sample Data for the StreamSets Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-11
Adding the Voltage SDProcessor to a Blank Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-12
Creating a New Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-12
Adding and Configuring an Origin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-13
Adding and Configuring the Voltage SDProcessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-15
Adding and Configuring a Destination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-16
Previewing the Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-18
Running the Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-20
Limitations of the StreamSets Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-21

7 CONFIDENTIAL
Chapter 10: Kafka Connect Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1
Integration Architecture of the Kafka Connect Developer Template . . . . . . . . . . . . . . . . . . . . . . 10-2
Transformation Classes for the Kafka Connect Developer Template . . . . . . . . . . . . . . . . . . . 10-3
Configuration Classes Used by the Kafka Connect Developer Template . . . . . . . . . . . . . . . 10-4
Kafka Connect Developer Template Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4
Logging in the Kafka Connect Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-5
Configuration Settings for the Kafka Connect Developer Template . . . . . . . . . . . . . . . . . . . . . . . 10-5
Kafka Connect Java Properties Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-5
connect-standalone-worker.properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-5
connect-file-source-protect.properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-6
connect-file-sink-access.properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-7
connect-file-sink.properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-8
Kafka Connect Developer Template XML Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . 10-9
Configuration File Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-10
Sample Data for the Kafka Connect Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-11
Running the Kafka Connect Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-12
Run-Time Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-12
Editing the Distribution-Specific Run-Time Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-12
kafkaBrokerList . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-12
kafkaServerPropsFile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-12
kafkaBinDir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-13
Steps to Run the Kafka Connect Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-13
Variations on Running the Kafka Connect Developer Template . . . . . . . . . . . . . . . . . . . 10-15
Script Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-16
create-kafka-topic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-16
delete-kafka-topic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-16
run-kafka-connect-protect-transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-16
vsdistrib.properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-17
Limitations of the Kafka Connect Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-17
Chapter 11: Kafka-Storm Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-1
Integration Architecture of the Kafka-Storm Developer Template . . . . . . . . . . . . . . . . . . . . . . . . 11-2
Kafka Producer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-3
Protect and Access Bolts, and Storm Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-3
Configuration Classes Used by the Kafka-Storm Developer Template . . . . . . . . . . . . . . . . . 11-4
Overview of the Storm Bolt Lifecycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-5
Logging and Error Handling in the Kafka-Storm Developer Template . . . . . . . . . . . . . . . . . 11-6
Configuration Settings for the Kafka-Storm Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . 11-7
Alternative Approaches to Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-8
Sample Data for the Kafka-Storm Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-10
Running the Kafka-Storm Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-10
Run-Time Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-11

CONFIDENTIAL 8
Editing the Distribution-Specific Run-Time Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-11
kafkaBrokerList . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-11
kafkaServerPropsFile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-11
kafkaBinDir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-12
hdfsOutDir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-12
Steps to Run the Kafka-Storm Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-12
Running the Kafka Console-Producer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-15
Script Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-16
create-kafka-topic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-16
delete-hdfs-files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-16
delete-kafka-topic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-17
delete-storm-topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-17
run-kafka-console-producer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-17
run-kafka-sample-producer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-17
run-storm-topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-18
vsdistrib.properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-18
Simplifications of the Kafka-Storm Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-18
Chapter 12: Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-1
Hadoop Build Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
GLib Error During Build . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
NiFi Build Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
Calling the Simple API From More Than One NAR File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
Issues Running Hadoop Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3
Simple Queries using Hive UDFs Fail on Some Hadoop Distributions . . . . . . . . . . . . . . . . . . 12-3
Queries Using Hive UDFs Fail with Literal Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-4
Hive Script Changes Required When Using Hive 3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-4
Hive Queries Fail in Kerberized Clusters When kinit Is Not Performed . . . . . . . . . . . . . . . . . 12-4
Binary Hive UDFs Fail Due to Data Being Too Large for the REST API . . . . . . . . . . . . . . . . 12-4
Failure to Copy JAR Files to the hive/lib Directory on All Data Nodes . . . . . . . . . . . . . . . . . . 12-4
MapReduce Jobs Fail with NoClassDefFoundError . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-5
Sqoop Steps (Including codegen) Fail with DB Driver Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-5
Sqoop codegen Command Fails with Streaming Result Set Error . . . . . . . . . . . . . . . . . . . . . . 12-6
Sqoop Jobs Fail with ORM Class Method Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-7
Data Protection Operations Fail with TLS Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-8
Simple API Operations for Dates Fail with VE_ERROR_GENERAL Error . . . . . . . . . . . . . . . 12-9
Simple API Operations Fail with Authentication Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-9
Simple API Operations Fail with Library Load Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-9
Simple API Operations Fail with Network Connection Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-9
Developer Templates Cannot Find the Configuration Files in HDFS . . . . . . . . . . . . . . . . . 12-10
Unable to Load libvibesimplejava.so . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10
Remote Queries Using Hive UDFs Fail on BigInsight 4.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10
Hadoop Job Error: “Container exited with a non-zero exit code 134” . . . . . . . . . . . . . . . . 12-11
Hadoop Job Not Failing when Invalid Auth Credentials Used for Access . . . . . . . . . . . . . 12-11
Hadoop Job Fails with REST on Older Voltage SecureData Servers . . . . . . . . . . . . . . . . . 12-11

9 CONFIDENTIAL
Hadoop Job Fails with Specific REST API Error Code and Message . . . . . . . . . . . . . . . . . . 12-11
Sqoop Import Job: REST API Error UNSUPPORTED_CODEPOINT . . . . . . . . . . . . . . . . . . 12-12
Hadoop Job Tasks Fail When Voltage SecureData Server is Overloaded . . . . . . . . . . . . 12-12
Error When Using Spark (Spark1) Instead of Spark2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12
Appendix A: Voltage SecureData Server Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . A-1
Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1
cc-sst-6-4 Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2
DATE-ISO-8601 Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3
AlphaExtendedTest1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3
Identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4
Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-5

CONFIDENTIAL 10
1 Introduction to the Developer Templates
This introduction provides a broad description of the Voltage SecureData Developer
Templates, a brief description of each of the reference implementations it contains, and their
intended use. It also describes the organization of the remainder of this document.

What are the Developer Templates?

Originally, the Developer Templates were strictly the Hadoop Developer Templates, providing
reference implementations of integrating Voltage SecureData APIs with several different
Hadoop components (MapReduce, Hive, and Sqoop). In subsequent releases, these reference
implementations have been enhanced and more reference implementations have been added.
The additional reference implementations include Hadoop components (such as Spark and
Impala) as well as similar open source solutions that don’t necessarily involve a Hadoop cluster
(such as NiFi, Kafka, Storm, and StreamSets).

The Developer Templates comprise:

• Java 8 source code that shows how calls to the Voltage SecureData Simple API and
REST API can be integrated into:

• Hadoop jobs, including MapReduce, Hive, Impala, Sqoop, and Spark, known
collectively as the Hadoop Developer Templates.

NOTE: The Spark integration is unique in that it includes both Scala and
Python source code.

• Stream-oriented workflows that use NiFi, StreamSets, Kafka Connect, and Kafka
and Storm working together, known collectively as the DataStream Developer
Templates.

• Configuration files that provide the information required for the demonstrated protect
and access operations.

• Sample data to demonstrate the protect and access operations in the reference
integrations.

• Where appropriate, scripts to run the setup and run the various Hadoop and
DataStream reference integrations.

• Remote Hive query infrastructure directories and files that let you simulate a Business
Intelligence (BI) tool conducting a Hive query from a computer outside your Hadoop
cluster.

1-1 CONFIDENTIAL
What are the Developer Templates? Developer Templates Integration Guide Version 5.0

• Support files for the various Hadoop and DataStream Developer Template reference
integrations.

• This document, the Voltage SecureData Developer Templates for Hadoop Integration
Guide, which provides detailed descriptions of, and instructions for using, the Developer
Templates.

The Developer Templates implement the following reference integrations:

• MapReduce: Protects plaintext and accesses ciphertext in CSV files between a source
HDFS location and a target HDFS location.

• Hive: UDFs that access columns of ciphertext in a Hive table using a HiveQL query.

• Impala: UDFs that access columns of ciphertext in an Impala table using a SQL query.

• Sqoop: Protects plaintext as it is imported from a relational database table into HDFS;

• Spark: Uses several different Spark data representation and query technologies to
protect plaintext and access ciphertext in CSV files between a source HDFS location and
a target HDFS location. The Spark Developer Template includes reference
implementations in both Scala and Python (using the PySpark libraries).

• NiFi: Uses a NiFi processor to protect plaintext and access ciphertext flowing through it.

• StreamSets: Uses a StreamSets processor to protect plaintext and access ciphertext

flowing through it.

• Kafka Connect: Uses a custom source transformation to protect plaintext as it is written

to a Kafka topic and uses a custom sink transformation to access ciphertext as it is read
from a Kafka topic.

• Kafka-Storm: Protects plaintext read from a Kafka topic using a Storm bolt and writes
the result to a file in HDFS.

These reference integrations use the Simple API and REST API to protect and access
Personally Identifiable Information (PII) using Format-Preserving Encryption (FPE). They also
use the REST API to protect and access Payment Card Industry (PCI) data using Secure
Stateless Tokenization™ (SST).

NOTE: Although the Developer Templates use FPE and SST as described above, these
protection technologies are not strictly tied to PII and PCI in this manner.

These reference integrations have been tested on several Hadoop distributions. For more
information, see the Voltage SecureData Developer Templates for Hadoop Version 5.0 Release
Notes.

CONFIDENTIAL 1-2
Developer Templates Integration Guide Version 5.0 Intended Use of the Developer Templates

Intended Use of the Developer Templates

In some cases, and for some aspects of the provided source code, the Developer Templates
demonstrate best practices when using the Voltage SecureData APIs, providing source code
that you can re-purpose more or less as is in your production solutions. In other cases and
aspects, the Developer Templates employ simplified techniques appropriate for demonstration
code that will need to be reworked for use in production solutions.

That said, the source code provided is meant to serve only as a starting point in the
development of production solutions. You must determine how best to adapt this code for your
own uses. You will need to customize this code to perform different types of protect and access
operations, on different data types and formats, and even perhaps use different underlying
libraries or adapt it to work with different technologies, such as Hadoop’s Pig or Flume.

For example, the Developer Templates code uses the Apache HttpClient 4.x library with its
client REST library, but your production implementation can use a different client-side library or
tool. The Voltage SecureData Server provides a REST-based web service, so you can use any
client library that sends and receives the appropriate type of messages.

An important aspect of protecting data in a Hadoop or workflow solution concerns compliance

issues. For example, if your customer data includes payment card data, it is beneficial to keep
your Hadoop cluster out-of-scope with regard to PCI compliance. To do this, you can protect
this type of sensitive data before it is imported into Hadoop. If that is not possible, an
acceptable alternative may be to protect your sensitive data as it is being imported into
Hadoop, minimizing the number of computers in your cluster that will have access to the
sensitive data in its unprotected form.

How this Document is Organized

The remaining chapters in this document are organized as follows:

• Chapter 2, “Install and Build the Developer Templates” - This chapter provides
instructions for installing and building all of the Developer Templates, as well as
providing information about the version requirements for supporting software.

• Chapter 3, “Common Infrastructure” - This chapter provides information about the

common infrastructure components of the Developer Templates. This includes a
description of the Java packages that provide this functionality as well as information
about its purpose. While most of these Java packages are common to all of the
Developer Templates, including the DataStream Developer Templates, one of them is
applicable only to the Hadoop Developer Templates.

1-3 CONFIDENTIAL
How this Document is Organized Developer Templates Integration Guide Version 5.0

This chapter also provides information about other common aspects of the Developer
Templates, such as the requirements for configuration files they use, an overview of the
sample data shared by the Hadoop Developer Templates, common HDFS procedures
for the Hadoop Developer Templates, logging and error handling in the Hadoop
Developer Templates, and known limitations across all of the Developer Templates.

In general, detailed information about the Developer Templates that is not specific to a
individual template can be found in this chapter.

• Chapter 4, “MapReduce Integration” - This chapter provides information that is specific

to the MapReduce reference integration, including its architecture, its configuration, and
how to run it using the provided sample data.

• Chapter 5, “Hive Integration” - This chapter provides information that is specific to the
Hive reference integration, including its architecture, its configuration, and how to run it
using the provided sample data. It also provides information about the Impala reference
integration.

• Chapter 6, “Sqoop Integration” - This chapter provides information that is specific to the
Sqoop reference integration, including its architecture, its configuration, and how to run
it using the provided sample data.

• Chapter 7, “Spark Integration” - This chapter provides information that is specific to the
Spark reference integration, including its architecture, its configuration, and how to run it
using the provided sample data.

• Chapter 8, “NiFi Integration” - This chapter provides information that is specific to the
NiFi reference integration, including its architecture, its configuration, and how to build,
deploy, and run the NiFi Developer Template using the provided ready-to-run NiFi
workflow. It also explains how to integrate the NiFi processor that demonstrates Voltage
SecureData functionality into a NiFi workflow from scratch.

• Chapter 9, “StreamSets Integration” - This chapter provides information that is specific

to the StreamSets reference integration, including its architecture, its configuration, and
how to build, deploy, and run the StreamSets Developer Template using the provided
ready-to-run StreamSets pipelines. It also explains how to integrate the StreamSets
processor that demonstrates Voltage SecureData functionality into a StreamSets
pipeline from scratch.

• Chapter 10, “Kafka Connect Integration” - This chapter provides information that is
specific to the Kafka Connect reference integration, including its architecture, its
configuration, and how to build and run the Kafka Connect Developer Template using
the provided Kafka source and sink connector transformations.

• Chapter 11, “Kafka-Storm Integration” - This chapter provides information that is

specific to the Kafka-Storm reference integration, including its architecture, its
configuration, and how to build and run the Kafka-Storm Developer Template using the
provided Kafka producer and Storm topology.

CONFIDENTIAL 1-4
Developer Templates Integration Guide Version 5.0 How this Document is Organized

• Chapter 12, “Troubleshooting” - This chapter provides troubleshooting entries related

to building, configuring, and running the Developer Templates.

• Appendix A, “Voltage SecureData Server Configuration” - This appendix provides

information about configuring your own Voltage SecureData Server to work with the
sample data and configuration files provided with the Developer Templates. This can be
useful if your environment prevents you from accessing the public-facing Voltage
SecureData Server maintained by Micro Focus Data Security with which the Developer
Templates, as delivered, are configured to interact.

1-5 CONFIDENTIAL
How this Document is Organized Developer Templates Integration Guide Version 5.0

CONFIDENTIAL 1-6
2 Install and Build the Developer Templates
This chapter describes the required related software as well as how to install and build the
Voltage SecureData Developer Templates for Hadoop on Linux.

Related Software Requirements

This section describes the other software required to use the Developer Templates.

Other Voltage SecureData Software

The Developer Templates rely on the Voltage SecureData Simple API (for Java) version 4.0 or
later, which is acquired and installed separately.

NOTE: See the Voltage SecureData Simple API Installation Guide for instructions for
installing the Simple API.

For the Hadoop Developer Templates, install the Simple API in the same directory on all data
nodes in your Hadoop cluster.

For the Datastream Developer Templates, install the Simple API in a directory on the
computer(s) where your data stream workflow will execute.

The XML configuration file vsconfig.xml and the NiFi configuration file
vsnifi.properties use the configuration setting "Simple API Install Path" (page 3-7) to
specify the Simple API installation directory. If you use these files’ default installation
directory /opt/voltage/simpleapi as your Simple API installation directory, you can
leave this configuration setting unchanged. Otherwise, you will need to change it in one or
both of these files, depending on which Developer Template(s) you are using.

You may want to run one or more of the sample programs provided with the Simple API in
order to independently verify that your Simple API installation is working correctly. If you do,
do so as the user under which you will be running these Developer Templates. For more
information about running the Simple API sample programs, see the Voltage SecureData
Simple API Developer Guide.

In order to use the Hadoop Developer Templates for production, you must have the Voltage
SecureData Server, version 5.8.2 or later, potentially including a license to use the SST
technology.

Using version 6.0 or later of the Voltage SecureData Server and version 5.0 or later of the
Simple API is recommended. This will enable you to protect and access extended character sets
using either the REST API, available in version 6.0 or later of the Voltage SecureData Server, or
using the Simple API, available in version 5. 0 or later.

2-1 CONFIDENTIAL
Related Software Requirements Developer Templates Integration Guide Version 5.0

Java Requirements
You must have the following versions of Java and Apache Maven to build and run the Hadoop
Developer Templates:

• Apache Maven version 3.0 or later.

NOTE: The NiFi Developer Template requires the use of Maven 3.1.0 or later.

• Java 8

NOTE: Java 7 is no longer supported for the Hadoop Developer Templates.

Hadoop Requirements
To use the Hadoop Developer Templates, you must already have a Hadoop cluster configured.
This version of the Hadoop Developer Templates has been tested on several versions of the
following Hadoop distributions:

• Cloudera Distributed Hadoop (CDH)

• Hortonworks Data Platform (HDP)

• MapR

• Amazon Elastic MapReduce (EMR)

Refer to the latest Release Notes for the list of Hadoop distribution versions with which this
version of the Hadoop Developer Templates has been tested. Also note that this version of the
Hadoop Developer Templates is likely to continue to work with earlier versions of these
Hadoop distributions, with which earlier versions of the Hadoop Developer Templates were
tested.

Running in the Amazon EMR Environment

The Hadoop Developer Templates have been tested in the Amazon EMR environment. To do
so, you will need to properly set up that environment for general MapReduce usage and then
perform several additional steps specific to running the Hadoop Developer Templates in that
environment. Complete instructions for setting up EMR can be found at the following URL:

https://aws.amazon.com

The following sections provide an overview of the steps required, but do not attempt to provide
complete instructions:

• Setting Up Amazon EMR (page 2-3)

• Setting Up to Run the Hadoop Developer Templates on Amazon EMR (page 2-5)

CONFIDENTIAL 2-2
Developer Templates Integration Guide Version 5.0 Related Software Requirements

Setting Up Amazon EMR

Setting up the Amazon EMR environment involves acquiring the Amazon Web Services
(AWS) prerequisites for Amazon EMR, configuring your Amazon EMR network, and
creating your Amazon EMR cluster. The high-level steps for accomplishing this are outlined
below. For detailed instructions, refer to Amazon’s on-line documentation.

• Acquire AWS Prerequisites for Amazon EMR

Perform the following steps to acquire the AWS prerequisites required for using
Amazon EMR:

1. Sign up for AWS.

2. Create an Amazon Simple Storage Service (Amazon S3) bucket.

NOTE: An Amazon S3 bucket is used to store cluster log files and output data.
Because of Hadoop requirements, S3 bucket names used with Amazon EMR
have the following constraints:

• Must contain only lowercase letters (a-z), digits (0-9), periods (.),
and hyphens (-)

• Cannot end with a digit

If you already have an S3 bucket that meets the criteria specified above, it can
be used. Otherwise create a new S3 bucket with a name that meets these
criteria.

3. Create and download an Amazon Elastic Compute Cloud (Amazon EC2) key pair
(as a .pem file).

• Configure Your Amazon EMR Network

The first step in setting up the EMR cluster is to configure the Network Configuration
settings which include the following steps:

1. Setting up your Virtual Private Cloud (VPC) and associated subnet.

2. Setting up your Internet gateway and associated routing table

3. Create and configure a security group for your VPC

• Create an Amazon EMR Cluster

Within the AWS’s EMR Advanced Options for cluster creation, there are four steps for
configuring your EMR cluster:

2-3 CONFIDENTIAL
Related Software Requirements Developer Templates Integration Guide Version 5.0

1. (Step 1: Software and Steps) Configure your EMR cluster software by choosing
Hadoop, Hive, Spark, Sqoop, Tez, and ZooKeeper, noting that version numbers
for each component may vary.

NOTE: The Spark Developer Template requires that your EMR cluster include
Spark2 (and only Spark2).

You can also add any other components that your scenario requires.

2. (Step 2: Hardware) Configure the virtual hardware for your EMR cluster by
setting the number of nodes of each type (Master, Core, and Task), and their
associated characteristics.

3. (Step 3: General Cluster Settings) Set your general EMR cluster options, such as
its name, logging and debugging characteristics, and so on.

4. (Step 4: Security) Set the security options for your EMR cluster, such as the EC2
key pair and security group created previously.

5. Click Create cluster to create your EMR cluster with the characteristics you have
configured.

AWS will provision and create your EMR cluster, and when successful, moving through
the Bootstrapping and Starting states until reaching the Waiting state, and resulting
in a Summary tab that will look similar to the following:

The Hardware tab will show the Master and Core nodes (and optional Task nodes) that
you have configured:

CONFIDENTIAL 2-4
Developer Templates Integration Guide Version 5.0 Related Software Requirements

Setting Up to Run the Hadoop Developer Templates on Amazon EMR

In order to run the Hadoop Developer Templates on Amazon EMR, you first need to be able
to remotely log into your Amazon EMR nodes. To do so, you will need your Amazon E2C
key pair (the third AWS prerequisite mentioned above, which was used when setting up the
security in Step 4 of creating your Amazon EMR cluster) and the public ID/DNS of the
relevant node(s):
ssh -i <location_of_keypair_pem_file> <public_id@public_DNS_name>

For example (shown on two lines for improved readability):

ssh -i ~/testKeyPair.pem
[email protected]

You created and downloaded your Amazon E2C key pair in an earlier step. To find the
public ID/DNS of the relevant node, click on its ID in the cluster Hardware tab, as shown
above.

Some of the installation steps for the Hadoop Developer Templates require that you be
logged in as the root user. To do so on an Amazon EMR node, several additional steps are
required. After remotely logging into the relevant node, perform these steps:

1. Start a text editor. For example:

sudo vi /etc/ssh/sshd_config

Allow root login and password authentication by uncommenting the following lines
and setting values to yes:
PermitRootLogin yes
PasswordAuthentication yes

Save and close the file.

2. Reload the sshd settings as follows:

sudo service sshd reload

3. Set a password for root as follows:

sudo passwd root

After completing these steps, log in as the root user on the specified Amazon EMR nodes in
order to install the prerequisites and the Hadoop Developer Templates themselves. Follow
these high-level steps:

1. Install Apache Maven on the Master node or node where the Hadoop Dev
Templates will be installed by A) following the instructions on the Apache Maven
Web site (https://maven.apache.org/install.html), or B) using the
Amazon Machine Images (AMI) repositories:

yum install maven

2-5 CONFIDENTIAL
Related Software Requirements Developer Templates Integration Guide Version 5.0

2. On all of your Amazon EMR nodes, add a new user and set the corresponding
password:
useradd awsuser

sudo passwd awsuser

NOTE: You can use another valid username. If you do, substitute that username
as appropriate in the steps that follow.

3. On all of your Amazon EMR nodes, install version 8 of the Java Development Kit
(JDK8) and set the environmental variable JAVA_HOME in the bash profile (~./
bash_profile) for both the root user and for the awsuser user created above.

NOTE: JDK8 may already be installed and the environment variable JAVA_HOME
may already be set. Use the following commands to check:

java -version
echo $JAVA_HOME

4. Install MySQL on one of the Core machines for Sqoop queries to work

a. Switch to root user:

su - root

b. Install and start MySQL server:

yum install mysql-server

service mysqld start

c. Change the default password for root mysql user.

/usr/libexec/mysql55/mysqladmin -u root password <new_password>

5. Download the MySQL JDBC driver to the directory /usr/lib/sqoop/lib on the

node on which the Hadoop Developer Templates will be installed.

6. Install the Simple API on all Amazon EMR nodes. For detailed step-by-step
instructions, see the Voltage SecureData Simple API Developer Guide.

7. Install the Hadoop Developer Templates as the user awsuser on the Master node
or one of the Core nodes (from where you intend to run the Hadoop Developer
Templates). For detailed step-by-step instructions, see "Installing the Developer
Templates" (page 2-7).

8. Build the Hadoop Developer Templates as the user awsuser on the same node
where you installed them. For detailed step-by-step instructions, see "Building the
Developer Templates" (page 2-10).

CONFIDENTIAL 2-6
Developer Templates Integration Guide Version 5.0 Installing the Developer Templates

9. Create an HDFS directory for user awsuser and change its ownership:
su - hdfs

hdfs dfs -mkdir /user/awsuser

hdfs dfs -chown -R awsuser:awsuser /user/awsuser

After completing these steps, you are ready to run the Hadoop Developer Templates
samples on Amazon EMR. For information about running a particular Developer Template,
see its corresponding Running section in the template-specific chapters of this guide. For
example:

• Running the MapReduce Template (page 4-6)

• Running the Spark Developer Template (page 7-14)

Installing the Developer Templates

This section provides instructions for installing the Developer Templates and includes a
description of the corresponding installation payload.

The Developer Templates are packaged in an archive file with a name of the following form:

voltage-hadoop-src-<version>-<build_id>.sh.tar

Install the Developer Templates by completing the following steps:

1. Copy the file voltage-hadoop-src-<version>-<build_id>.sh.tar to the

location you want to use for building the required JAR files, such as:
/home/<username>/voltage/hadoop-templates

2. Extract this package file to produce a self-extracting shell script file:

tar -xvf voltage-hadoop-src-<version>-<build_id>.sh.tar

3. Execute the self-extracting shell script:

./voltage-hadoop-src-<version>-<build_id>.sh

The license agreement displays, spanning many screens of information.

4. Read through the license agreement and enter y to accept the license.

NOTE: If you enter n, the installation terminates.

A message displays the default location in which the Developer Templates will be
installed. This is a subdirectory that includes the version number and package name.

2-7 CONFIDENTIAL
Installing the Developer Templates Developer Templates Integration Guide Version 5.0

5. Enter Y to choose the default location, or enter n to install the Developer Templates into
the directory in which you copied the .tar package file.

Developer Templates Installation Payload

This section provides a summary of the directory structure created by the Voltage SecureData
Developer Templates for Hadoop installation.

<install_dir> Folder
This folder contains the Maven parent POM file (pom.xml) that controls the build process for
the two multi-module Maven projects below it:

• The common cryptographic code shared by all of the Developer Templates.

• The Hadoop Developer Templates (MapReduce, Hive and Impala, and Sqoop).

NOTE: The Spark Developer Template, because it is written in Scala and Python, is built
using a different tool: Simple Build Tool (sbt). For more information, see "Building the Spark
Developer Template" (page 2-17).

<install_dir>/bin Folder
This directory contains numerous scripts, including HQL and SQL scripts and scripts specific to
Kerberos authentication, used to run the Hadoop Developer Template samples for MapReduce,
Hive, Impala, and Sqoop.

<install_dir>/clientsamples Folder
This directory hierarchy contains a variety of files for building and executing a Hive query using
a JDBC driver and an ODBC driver on Windows.

For more information about the contents of this directory hierarchy, see Chapter 5, “Hive
Integration”.

<install_dir>/common-crypto Folder
This directory hierarchy contains the Java source code and Javadoc package files for the
infrastructure common to the Developer Templates, including the REST client and build
support infrastructure for the inclusion of the Simple API.

This directory contains a sub-directory named simpleapi, initially empty except for a README
file. This sub-directory is where the Maven build directives and several scripts expect to find the
required Simple API JAR file, vibesimplejava.jar. You will need to copy this file here.

CONFIDENTIAL 2-8
Developer Templates Integration Guide Version 5.0 Installing the Developer Templates

<install_dir>/config Folder
This directory contains the XML configuration files for the Hadoop Developer Templates:
vsconfig.xml and vsauth.xml. It also contains their corresponding schema files in the sub-
directory schema.

<install_dir>/configlocator Folder
This directory contains the Java Properties configuration file config-locator.properties,
which is used in one of the alternative schemes for specifying the location of the standard XML
configuration files vsconfig.xml and vsauth.xml.

For more information about this alternative scheme, see "Config-Locator Properties File
Packaged as a JAR File" (page 3-48).

<install_dir>/dev-templates-src Folder
This directory contains the source code for the Hadoop Developer Templates other than the
Spark Developer Template (MapReduce, Hive, Impala, and Sqoop), including the Hadoop-
specific authentication and configuration code they share.

<install_dir>/eula Folder
This directory contain the End-User License Agreement to which you are obligated in the file
Micro_Focus_EULA.txt.

<install_dir>/sampledata Folder
This directory contains the sample data text files for the Hadoop Developer Templates,
including the sample data file encoded_binary.csv, which contains Base64-encoded binary
data for demonstrating the Hive and Impala binary UDFs.

<install_dir>/spark Folder
This directory hierarchy contains the sub-directories and files specific to the Spark Developer
Template, including the file build.sbt, which specifies the Scala build settings for Spark
Developer Template, and the sub-directories bin, config, lib, project, sampledata, and src that
contain the scripts, a Spark-specific XML configuration file, information and data for building
and running the Spark Developer Template samples, and its Scala and Python source code.

<install_dir>/stream Folder
This directory contains the Maven build files (pom.xml and build_project.sh) for all of the
Datastream Developer Templates (StreamSets, NiFi, Kafka Connect, and Kafka-Storm).

2-9 CONFIDENTIAL
Building the Developer Templates Developer Templates Integration Guide Version 5.0

NOTE: You must edit a variety of properties in the pom.xml file prior to building the
Datastream Developer Templates, such as to specify the Kafka and Storm versions being
used by your Hortonworks or Cloudera distribution.

<install_dir>/stream/kafka_connect Folder
This directory hierarchy contains the files and sub-directories specific to the Kafka Connect
Developer Template, including build files, scripts, XML configuration files, sample data, and Java
source code.

<install_dir>/stream/kafka_storm Folder
This directory hierarchy contains the files and sub-directories specific to the Kafka-Storm
Developer Template, including build files, scripts, XML configuration files, sample data, and Java
source code.

<install_dir>/stream/nifi Folder
This directory hierarchy contains the files and sub-directories specific to the NiFi Developer
Template, including build files, scripts, XML configuration files, sample data, a workflow
template, and Java source code.

<install_dir>/stream/stream_common Folder
This directory hierarchy contains the files and sub-directories shared by many of the
Datastream Developer Templates (StreamSets, Kafka Connect, and Kafka-Storm), including
build files and Java source code.

<install_dir>/stream/streamsets_processor Folder
This directory hierarchy contains the files and sub-directories specific to the StreamSets
Developer Template, including build files, scripts, XML configuration files, sample data and
pipelines, resources, and Java source code.

Building the Developer Templates

This section describes the steps for:

• Building the MapReduce, Hive, and Sqoop Developer Templates (page 2-11)

• Building the Spark Developer Template (page 2-17)

• Building the Datastream Developer Templates (page 2-19)

CONFIDENTIAL 2-10
Developer Templates Integration Guide Version 5.0 Building the Developer Templates

Building the MapReduce, Hive, and Sqoop Developer Templates

The build process for the Java-based Hadoop Developer Templates (MapReduce, Hive and
Impala, and Sqoop) is managed as a multi-module Maven project. Under this model, the Maven
build directives in the Project Object Model (POM) file pom.xml in the top-level directory
(<install_dir>) is defined as the parent POM of the pom.xml files in its sub-directories:
• <install_dir>/dev-templates-src

• <install_dir>/common-crypto (shared code)

Use the Hadoop Developer Templates source code and build infrastructure to generate the
JAR files to be copied to your Hadoop environment for the MapReduce, Hive, and Sqoop
Developer Templates. The build instructions in this section are divided into the following
categories, each addressed in its own sub-section:

• Build Prerequisites (page 2-11)

• Editing the Properties in the Parent Maven POM File (page 2-11)

• Build Steps (page 2-14)

• Support for Newer Apache Sqoop 1.x Versions (page 2-16)

Build Prerequisites
The following software packages must be installed and available at compile-time in order to
build the Java-based Hadoop Developer Templates:

• Apache Maven (3.0 or later)

NOTE: The NiFi Developer Template, which is built as part of the Datastream
Developer Templates, requires the use of Maven 3.1.0 or later. For more information,
see "Building the Datastream Developer Templates" (page 2-19).

• JDK (version 8 or later)

Editing the Properties in the Parent Maven POM File

Before following the build steps described in the following section, you must first provide
appropriate values for a number of elements defined in the parent POM file (<install_dir>/
pom.xml) used by Maven to build the MapReduce, Hive, and Sqoop Developer Templates (the
Spark Developer Template is built differently, using sbt). You will need to edit all of the
elements within the properties element at the beginning of parent pom.xml file, including
values based on whether you are running in a Hortonworks, Cloudera, MapR, or EMR
environment:

2-11 CONFIDENTIAL
Building the Developer Templates Developer Templates Integration Guide Version 5.0

Element: repo.id

This element provides a descriptive name of the remote repository, such as hortonworks
or cloudera, depending on the Hadoop distribution source you are using.

Element: repo.url

This element provides the full URL of the remote repository from which the relevant
dependency JAR files will be pulled. Standard repository URLs for Hortonworks, Cloudera,
MapR, and EMR, respectively, are as follows (where appropriate, shown on two lines to
improve readability):
• http://repo.hortonworks.com/content/groups/public

• https://repository.cloudera.com/artifactory/cloudera-repos

• https://repository.mapr.com/nexus/content/groups/mapr-public/

• https://<s3-endpoint>/<region-ID>-emr-artifacts/
<emr-release-label>/repos/maven/

Where:

<s3-endpoint> is the Amazon S3 endpoint of the region for the repository. For
example: s3.us-west-1.amazonaws.com.

<region-id> is the corresponding region in the S3 endpoint. For example:

us-west-1.

<emr-release-label> is the release label for the running Amazon EMR

cluster, in the form emr-n.n.n. For example: emr-5.30.0.

An example of the full URL is the following (shown on two lines to improve
readability):
https://s3.us-west-1.amazonaws.com/
us-west-1-emr-artifacts/emr-5.30.0/repos/maven

NOTE: This EMR repository does not contain the artifacts required to
successfully build the Spark and Sqoop templates. See the Release Notes for
work-arounds.

Element: hadoop.annotations.version

This element provides the version of the Apache Hadoop Annotations library that you want
to use.

Element: hadoop.common.version

This element provides the version of the Apache Hadoop Common library that you want to
use.

CONFIDENTIAL 2-12
Developer Templates Integration Guide Version 5.0 Building the Developer Templates

Element: hadoop.mapreduce.version

This element provides the version of the Apache Hadoop MapReduce library that you want
to use.

Element: hive.exec.version

This element provides the version of the Apache Hive Exec library that you want to use.

Element: sqoop.version

This element provides the version of the Apache Sqoop library that you want to use.

Element: maven.local.dir

This element provides the name of the directory that is set as the local Maven repository
(by default, ${user.home}/.m2/repository).

NOTE: Some newer Hadoop distributions have changed such that they no longer use
the deprecated Cloudera package for their Sqoop-generated ORM classes. This Maven
build process will dynamically adapt to the use of either the older, deprecated Cloudera
Sqoop package or the newer Apache Sqoop package. For more information, see "Support
for Newer Apache Sqoop 1.x Versions" (page 2-16).

Example Hortonworks Element Values:

<repo.id>hortonworks</repo.id>
<repo.url>http://repo.hortonworks.com/content/groups/public</repo.url>
<hadoop.annotations.version>1.4.7.3.1.4.0-315</hadoop.annotations.version>
<hadoop.common.version>1.4.7.3.1.4.0-315</hadoop.common.version>
<hadoop.mapreduce.version>1.4.7.3.1.4.0-315</hadoop.mapreduce.version>
<hive.exec.version>1.4.7.3.1.4.0-315</hive.exec.version>
<sqoop.version>1.4.7.3.1.4.0-315</sqoop.version>
<maven.local.dir>/home/testuser/.m2/repository</maven.local.dir>

Example Cloudera Element Values:

<repo.id>cloudera</repo.id>
<repo.url>https://repository.cloudera.com/artifactory/cloudera-repos</repo.url>
<hadoop.annotations.version>3.0.0-cdh6.3.2</hadoop.annotations.version>
<hadoop.common.version>3.0.0-cdh6.3.2</hadoop.common.version>
<hadoop.mapreduce.version>3.0.0-cdh6.3.2</hadoop.mapreduce.version>
<hive.exec.version>3.0.0-cdh6.3.2</hive.exec.version>
<sqoop.version>3.0.0-cdh6.3.2</sqoop.version>
<maven.local.dir>/home/testuser/.m2/repository</maven.local.dir>

Example MapR Element Values:

<repo.id>mapr</repo.id>
<repo.url>https://repository.cloudera.com/artifactory/cloudera-repos</repo.url>

2-13 CONFIDENTIAL
Building the Developer Templates Developer Templates Integration Guide Version 5.0

<hadoop.annotations.version>2.7.0-mapr-1808</hadoop.annotations.version>
<hadoop.common.version>2.7.0-mapr-1808</hadoop.common.version>
<hadoop.mapreduce.version>2.7.0-mapr-1808</hadoop.mapreduce.version>
<hive.exec.version>2.3.6-mapr-1912</hive.exec.version>
<sqoop.version>1.4.7-mapr-1904</sqoop.version>
<maven.local.dir>/home/testuser/.m2/repository</maven.local.dir>

Example EMR Element Values:

<repo.id>emr</repo.id>
<repo.url>https://s3.us-west-1.amazonaws.com/us-west-1-emr-artifacts/emr-5.30.0/
repos/maven</repo.url>
<hadoop.annotations.version>2.8.5-amzn-6</hadoop.annotations.version>
<hadoop.common.version>2.8.5-amzn-6</hadoop.common.version>
<hadoop.mapreduce.version>2.8.5-amzn-6</hadoop.mapreduce.version>
<hive.exec.version>2.3.6-amzn-2</hive.exec.version>
<sqoop.version>1.4.7</sqoop.version>
<maven.local.dir>/home/testuser/.m2/repository</maven.local.dir>

Build Steps
After you have edited the properties in the root POM file, follow the steps in this section to build
the MapReduce, Hive, and Sqoop Developer Templates.

NOTE: In the following steps, <install-dir> is the directory in which you installed the
Developer Templates.

1. Ensure that both Maven and javac are in your path, and that JAVA_HOME is set to the
installation location of JDK 8, and that wsimport is available as part of that JDK
installation.

2. Copy the file vibesimplejava.jar from a Simple API installation to the Hadoop
Developer Templates installation:

From: <simpleapi-install-dir>/lib
To: <install-dir>/common-crypto/simpleapi

See the file README.txt in this simpleapi directory for details.

3. Change directory (cd) to the directory in which you installed the Developer Templates
(<install_dir>).

4. Run the Maven command without any parameters:

mvn

CONFIDENTIAL 2-14
Developer Templates Integration Guide Version 5.0 Building the Developer Templates

NOTE: Because install is set as the default goal in the pom.xml file in this
directory, no parameters are required. You could also run either of the following two
commands with the same result:

mvn install

mvn clean install (removes all relevant JAR files before building)

When the build completes successfully, following JAR files have been added to the bin
directory of your build location:
• voltage-hadoop-<version>.jar

All of the core classes and their main dependencies, such as the
vsrestclient.jar and its JSON and HTTP Client library dependencies, are
packaged into a single JAR file, simplifying the commands used to run the
various Hadoop jobs and queries. Such a combined JAR file is often referred to
as an uber JAR file or fat JAR file.

This main Hadoop Developer Templates’ uber JAR file is built with the
appropriate version number in its filename. For example, this JAR filename in the
first such version in which this change was made is:
voltage-hadoop-4.0.0.jar. Further, the Hadoop Developer Templates also
started using the common convention of creating a symbolic link (symlink) to
this versioned filename using an unversioned symlink name (see below). This
allows the various scripts to remain the same from version to version, using the
appropriate latest version of the JAR file by referring to the unversioned symlink
name.

For simplicity, in the remainder of this document, the unversioned symlink name,
voltage-hadoop.jar, is used to refer to the corresponding versioned uber
JAR file.
• voltage-hadoop.jar

(a symbolic link to voltage-hadoop-<version>.jar above)

• voltage-hadoop-core-<version>.jar

In addition to the main JAR file voltage-hadoop.jar, described above, which

contains all the integration classes and their main dependencies, the build also
generates a JAR file containing just the core Hadoop integration classes and the
common crypto abstraction classes. This JAR file and its symlink are named
voltage-hadoop-core-<version>.jar and
voltage-hadoop-core.jar, respectively. In the event that you want to keep
the dependencies separate in your custom integrations, use this core JAR file
instead of the main JAR file voltage-hadoop.jar. If you use this alternative
approach, you will have to reference the dependent JAR files directly, as was
required in earlier versions of the Developer Templates.

2-15 CONFIDENTIAL
Building the Developer Templates Developer Templates Integration Guide Version 5.0

• voltage-hadoop-core.jar

(a symbolic link to voltage-hadoop-core-<version>.jar above)

In addition to the default goal, the Maven build also generates the Javadoc
documentation in the javadocs subdirectory of the main build location.

5. (Optional) If you want to run the Hadoop Developer Templates on another Hadoop
node, copy the following subdirectories to the location where you will run the Hadoop
jobs:
• bin
• config
• sampledata

Support for Newer Apache Sqoop 1.x Versions

The Sqoop integration in the Hadoop Developer Templates uses the Sqoop 1.x codegen
command to generate an object-relational mapping (ORM) class, as described in "Integration
Architecture of the Sqoop Template" (page 6-2). In previous versions of Apache Sqoop,
included in all the previous supported Hadoop distributions, this ORM class imported the core
Sqoop classes from the Sqoop package com.cloudera.sqoop, despite this package being
classified as deprecated and replaced by the corresponding classes in the package
org.apache.sqoop (the generated ORM class and internal Sqoop JAR file continued to
import and use the deprecated classes). Consequently, the integration class
SqoopRecordWrapper in the Hadoop Developer Templates had to use the deprecated
Cloudera package and its classes; the build would fail due to incompatible types if the new
Apache package was used.

The newest releases of Sqoop 1.x, such as Sqoop 1.4.7 included in Hortonworks Data Platform
(HDP) 3.0, no longer references the deprecated Cloudera package or its classes. Instead, both
its core Sqoop JAR file and any ORM classes it generates use the Apache package and its
classes. Without accommodating modifications, the Sqoop Developer Template will fail to build
on Hadoop distributions with this change.

To solve this build issue without sacrificing backward compatibility, the Hadoop Developer
Templates’ Maven build infrastructure detects which Sqoop package (Cloudera or Apache) is
used by the core Sqoop JAR file in the compile-time classpath and dynamically adjusts the
import directives in the Sqoop Developer Template’s Java source files. If the old Cloudera
package is being used, import directives for its classes are used; otherwise, import directives
for the same classes in the new Apache packages are used. This approach allows the build
process to complete successfully regardless of which Sqoop package is used in a particular
Hadoop distribution.

Note that it is possible to build the Sqoop Developer Template successfully but still have
runtime issues related to incompatible Sqoop packages. If you specify the wrong Sqoop version
using the sqoop.version property element in the root-level POM.xml file, the Sqoop
Developer Template integration classes in that JAR file may not import and use the appropriate
Sqoop packages for the runtime version of Sqoop on your Hadoop cluster. If this happens,

CONFIDENTIAL 2-16
Developer Templates Integration Guide Version 5.0 Building the Developer Templates

despite the build having succeeded, the Sqoop import job can fail at runtime with one of the
following errors, depending on exactly how the compile-time and runtime Sqoop jars are
mismatched:

• If you build by including the newer Apache Sqoop JAR file at compile-time but try to use
a Cloudera (old) Sqoop JAR file at runtime (for example, sqoop-1.4.7.x.jar used to
build, but using sqoop-1.4.6.x.jar on an HDP 2.6.x cluster), you will get the
following error:
java.lang.ClassCastException: com.voltage.securedata.hadoop.
sqoop.SqoopImportProtector cannot be cast to com.cloudera.
sqoop.lib.SqoopRecord

• If you build by including the older Cloudera Sqoop JAR file at compile-time but try to
use an Apache (new) Sqoop JAR file at runtime (for example, sqoop-1.4.6.x.jar
used to build, but using sqoop-1.4.7.x.jar on an HDP 3.0 cluster), you will get the
following error:
java.lang.ClassNotFoundException: com.cloudera.sqoop.lib.
SqoopRecord

If you encounter either of the above exceptions when running the Sqoop import job using the
JAR file voltage-hadoop.jar, ensure that you have specified the correct Sqoop version
(using the sqoop.version property element in the root-level POM.xml file) and then rebuild
the Hadoop Developer Templates. As described above, the Maven build infrastructure will
examine the Sqoop JAR file in the local Maven repository and then modify the Sqoop
Developer Template Java source code so that the correct matching classes are imported for
their use.

Building the Spark Developer Template

The Spark Developer Template uses the Simple Build Tool, commonly known as sbt, to build
manage its build process.

Use the Spark Developer Template Scala and Python source code and build infrastructure to
generate the JAR files to be copied to your Hadoop environment for this template. The build
instructions in this section are divided into the following categories, each addressed in its own
sub-section:

• Build Pre-Requisites (page 2-17)

• Build Steps (page 2-18)

Build Pre-Requisites
The following software must be installed in the build environment:

• Scala

• sbt

2-17 CONFIDENTIAL
Building the Developer Templates Developer Templates Integration Guide Version 5.0

• JDK (version 8 or higher)

NOTE: The Spark Developer Template was developed using version 2.11.8 of Scala and
version 1.1.0 of sbt. sbt pulls in any project dependencies automatically from the
org.apache.spark Maven repository. The file <install-dir>/spark/build.sbt
specifies the Scala version to use along with the library versions that are compatible with
that version of Scala. You can change the version of Scala to a different version as long as
you also adjust the versions of the dependent libraries accordingly. The versions listed in the
build.sbt file, as delivered, are:

scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.0.0"
libraryDependencies += "org.apache.spark" % "spark-hive_2.11" % "2.0.0"
libraryDependencies += "org.scala-lang" % "scala-reflect" % "2.11.8"
libraryDependencies += "org.scala-lang" % "scala-library" % "2.11.8"

Build Steps
After installing the Hadoop Developer Templates package by following the steps in "Installing
the Developer Templates" (page 2-7), confirming the build prerequisites, and making any
version changes in the file build.sbt, follow these steps to build the Spark Developer
Template:

1. Install the Simple API on the Hadoop nodes on which you will be running Spark. You
must install the same version of the Simple API, in the same location, on all nodes in your
Hadoop cluster.

2. Edit the configuration files. If you are initially running the Spark sample using the
dataprotection Key Server hosted by Micro Focus Data Security, you should only
need to edit the installPath attribute of the simpleAPI element in the following file:
<install-dir>/config/vsconfig.xml

Set this property to the path where you installed the Simple API on the Spark worker
nodes in the previous step.

When you begin to run the Spark sample and your own Spark jobs using your own
Voltage SecureData Server and Key Server, you will need to make additional changes to
the XML configuration files vsconfig.xml and vsauth.xml as well as to the Spark-
specific configuration file vsspark-rdd.xml.

3. Build the other Hadoop Developer Templates by following the steps in "Building the
MapReduce, Hive, and Sqoop Developer Templates" (page 2-11).

4. Change directory (cd) to <install-dir>/spark/bin.

5. Run the following script:

./run-spark-prepare-job

CONFIDENTIAL 2-18
Developer Templates Integration Guide Version 5.0 Building the Developer Templates

This script will copy all of the necessary CryptoFactory and related jar files to the
folder <install-dir>/spark/lib, and if necessary, copy the configuration files and
plaintext CSV file used by the Spark sample to HDFS.

6. Change directory (cd) to <install-dir>/spark.

7. Run the following sbt command:

sbt clean package

The following directory will be created regardless of whether the sbt build command is
successful:
<install-dir>/spark/target

And if the sbt build command is successful, a JAR file with the following name will be placed in
a sub-directory of the target directory:
<install-dir>/spark/target/scala-<version>/
spark-<version>-SNAPSHOT.jar

This JAR file will contain the compiled Scala code to be used by the scripts that run the Spark
sample job.

NOTE: sbt downloads dependencies from remote repositories on the Internet, so your Spark
build computer must have Internet access. If you do not have Internet access from your
Hadoop cluster, you can build the Spark project on a different computer with Internet access
and then transfer the resulting project and target directories to the computer on which
you will run the Spark sample or your custom Spark job.

Building the Datastream Developer Templates

The StreamSets, NiFi, Kafka Connect, Kafka-Storm Developer Templates are grouped together
for shared code and build infrastructure as the Datastream Developer Templates. They reside
together under the parent directory <install_dir>/stream, each in their own sub-directory
(together with their shared code sub-directory), as follows:
• <install_dir>/stream/kafka_connect

• <install_dir>/stream/kafka_storm

• <install_dir>/stream/nifi

• <install_dir>/stream/streamsets_processor

• <install_dir>/stream/stream_common (shared code)

The build process for the Java-based Datastream Developer Templates is managed as a multi-
module Maven project. Under this model, the Maven build directives in the Project Object
Model (POM) file pom.xml in the top-level directory is defined as the parent POM of the
pom.xml files in each of these sub-directories.

2-19 CONFIDENTIAL
Building the Developer Templates Developer Templates Integration Guide Version 5.0

Use the Datastream Developer Templates source code and build infrastructure to generate the
JAR files to be copied to your runtime environment for the StreamSets, NiFi, Kafka Connect, and
Kafka-Storm Developer Templates. The build instructions in this section are divided into the
following categories, each addressed in its own sub-section:

• Build Prerequisites (page 2-20)

• Editing the Properties in the Parent Maven POM File (page 2-20)

• Build Steps (page 2-24)

This section provides information about the pre-requisites for building the Datastream
Developer Templates, defining properties as required before initiating the Maven build, and
then using Maven to build one or more of these Developer Templates.

Build Prerequisites
The following software packages must be installed and available at compile-time in order to
build the Kafka Connect Developer Template:

• Apache Maven (3.0 or later)

NOTE: The NiFi Developer Template requires the use of Maven 3.1.0 or later.

• JDK (version 8 or higher)

Editing the Properties in the Parent Maven POM File

Before following the build steps described in the following section, you must first provide
appropriate values for a number of properties defined in the parent POM file
(<install_dir>/stream/pom.xml) used by Maven to build the Datastream Developer
Templates. Depending on which Datastream Developer Template(s) you are building, you will
need to edit a specific subset of the following properties in the <properties> section at the
beginning of parent pom.xml file, including values based on whether you are running in a
Hortonworks or Cloudera environment:

Property: repo.id

This property provides a descriptive name of the remote repository, such as hortonworks,
depending on the distribution source you are using.

If you provide an alternative Maven repository using the repo.url property, provide a
descriptive name for that repository as the value of this property.

Property: repo.url

This property provides the full URL of the remote repository from which the relevant
dependency JAR files will be pulled. For example:

CONFIDENTIAL 2-20
Developer Templates Integration Guide Version 5.0 Building the Developer Templates

https://repo.hortonworks.com/content/groups/public

https://repository.cloudera.com/artifactory/cloudera-repos

If no value is provided for this property (its value is left as placeHolderValue), the default
Maven repository will be used. To use a different Maven repository for the Datastream
Developer Template you are building, provide the URL of the alternative repository as the
value of this property.

Property: kafka.artifact.id

This property provides the artifact ID for Kafka. Use the artifact id in the following JAR
filename in the libs directory of the Kafka installation location:
<kafka-install-location>/libs/<kafka-artifact-id>-<version>.jar

The property value is the <kafka-artifact-id> part of the filename above, which starts
with the name kafka_. For example:
kafka_2.10

Provide a value for this property only when you are building the Kafka Connect Developer
Template and/or the Kafka-Storm Developer Template. Otherwise, leave this property set
to placeHolderValue.

Property: kafka.version

This property provides the version number of the relevant Kafka dependency JAR files. Use
the version number in the following JAR filename in the libs directory of the Kafka
installation location:
<kafka-install-location>/libs/<kafka-artifact-id>-<version>.jar

The property value is the <version> part of the filename above. For example:
0.10.0.2.5.3.0-37

Provide a value for this property only when you are building the Kafka Connect Developer
Template and/or the Kafka-Storm Developer Template. Otherwise, leave this property set
to placeHolderValue.

Property: storm.version

This property provides the version number of the relevant Storm dependencies and JAR
files. Use the version number in the following JAR filename in the lib directory of the Storm
installation location:
<storm-install-location>/lib/storm-core-<version>.jar

The property value is the <version> part of the filename above. For example:

2-21 CONFIDENTIAL
Building the Developer Templates Developer Templates Integration Guide Version 5.0

1.0.1.2.5.3.0-37

Provide a value for this property only when you are building the Kafka-Storm Developer
Template. Otherwise, leave this property set to placeHolderValue.

Property: storm.kafka.client.version

This property provides the version number of the KafkaSpout dependencies and JAR files,
which is usually the same as the storm.version value. If you cannot find a JAR file with a
name of the form storm-kafka-client-<version>.jar in your Storm installation, use
the same value as for the property storm.version, described above.

The property value is the <version> part of the filename above. For example:
1.0.1.2.5.3.0-37

If using the same value as for the property storm.version does not work, use the
following links to look up the correct value:

• HDP: http://repo.hortonworks.com/content/groups/public/org/apache/storm/storm-kafka-client/

• CDH: https://mvnrepository.com/artifact/org.apache.storm/storm-kafka-client

Provide a value for this property only when you are building the Kafka-Storm Developer
Template. Otherwise, leave this property set to placeHolderValue.

Property: storm.hdfs.version

This property provides the version number of the HdfsBolt dependencies and JAR files,
which is usually the same as the storm.version value. Use the version number in the
following JAR filename in the contrib (HDP) or external (CDH) directory of the Storm
installation location:
<storm-install-location>/contrib/storm-hdfs/
storm-hdfs-<version>.jar

- or -
<storm-install-location>/external/storm-hdfs/
storm-hdfs-<version>.jar

The property value is the <version> part of the filename above. For example:
1.0.1.2.5.3.0-37

Provide a value for this property only when you are building the Kafka-Storm Developer
Template. Otherwise, leave this property set to placeHolderValue.

CONFIDENTIAL 2-22
Developer Templates Integration Guide Version 5.0 Building the Developer Templates

Property: nifi.api.version

This property provides the version number of the NiFi API dependencies and JAR files. Use
the version number in the JAR filename nifi-api-<version>.jar within the NiFi
installation location. For example:

• Apache NiFi: nifi-api-1.9.2.jar

• HDP NiFi: nifi-api-1.9.0.3.4.1.1-4.jar

Provide a value for this property only when you are building the NiFi Developer Template.
Otherwise, leave this property set to placeHolderValue.

Property: nifi.utils.version

This property provides the version number of the NiFi Utilities dependencies and JAR files.
Use the version number in the JAR filename nifi-utils-<version>.jar within the
NiFi installation location. For example:

• Apache NiFi: nifi-utils-1.9.2.jar

• HDP NiFi: nifi-utils-1.9.0.3.4.1.1-4.jar

Provide a value for this property only when you are building the NiFi Developer Template.
Otherwise, leave this property set to placeHolderValue.

Example (of all) Hortonworks Property Values:

<repo.id>hortonworks</repo.id>
<repo.url>http://repo.hortonworks.com/content/groups/public</repo.url>

<kafka.artifact.id>kafka_2.10</kafka.artifact.id>
<kafka.version>0.10.0.2.5.3.0-37</kafka.version>
<storm.version>1.0.1.2.5.3.0-37</storm.version>
<storm.kafka.client.version>1.0.1.2.5.3.0-37</storm.kafka.client.version>
<storm.hdfs.version>1.0.1.2.5.3.0-37</storm.hdfs.version>
<nifi.api.version>1.9.2</nifi.api.version>
<nifi.utils.version>1.9.2</nifi.utils.version>

Example (of all) Cloudera Property Values:

<repo.id>cloudera</repo.id>
<repo.url>https://repository.cloudera.com/artifactory/cloudera-repos</repo.url>

<kafka.artifact.id>kafka_2.11</kafka.artifact.id>
<kafka.version>0.10.0-kafka-2.1.1</kafka.version>
<storm.version>1.0.1</storm.version>
<storm.kafka.client.version>1.0.1</storm.kafka.client.version>
<storm.hdfs.version>1.0.1</storm.hdfs.version>
<nifi.api.version>1.9.0.3.4.1.1-4</nifi.api.version>
<nifi.utils.version>1.9.0.3.4.1.1-4</nifi.utils.version>

2-23 CONFIDENTIAL
Building the Developer Templates Developer Templates Integration Guide Version 5.0

NOTE: You must make sure to specify the appropriate versions of these dependencies, from
the appropriate repository, depending on your specific environment. If you specify a
repository URL and/or set of dependency versions that do not match your environment, the
Maven build may still succeed but, for example, the generated Storm topology JAR file may
later fail to run properly. If you encounter unexpected runtime failures, make sure you built
the relevant Datastream Developer Template(s) using the correct repository and
dependencies for your environment.

Build Steps
After installing the Voltage SecureData Developer Templates for Hadoop package, confirming
that build prerequisites are met, and correctly editing the property values at the top of the
parent POM file pom.xml, build one or more of the Datastream Developer Templates by
following these steps:

NOTE: In the following steps, <install-dir> is the directory in which you installed the
Developer Templates.

1. Install the Simple API on the computer(s) on which you are running one or more of the
Datastream Developer Templates. If you are running a cluster, such as a cluster of Storm
workers, install the same version of the Simple API in the same location on all nodes in
the cluster.

2. Copy the file vibesimplejava.jar from a Simple API installation to the Developer
Templates installation on each computer on which it has been installed:

From: <simpleapi-install-dir>/lib
To: <install-dir>/common-crypto/simpleapi

See the file README.txt in these simpleapi directories for details.

3. Change directory (cd) to <install-dir>/stream.

4. Confirm that you have correctly edited the properties at the top of the parent Maven
build file pom.xml in this directory to specify the relevant values for your environment
and the Datastream Developer Templates you intend to build and use. For more
information, see "Editing the Properties in the Parent Maven POM File" (page 2-20).

5. Run the script build_project, which simplifies the Maven command line to build each
of the Datastream Developer Templates, one or more times to build one or more of
these Developer Templates, or to build them all at once:

• Kafka Connect build command:

./build_project kafka_connect

• Kafka-Storm build command:

./build_project kafka_storm

CONFIDENTIAL 2-24
Developer Templates Integration Guide Version 5.0 Building the Developer Templates

• SteamSets build command:

./build_project streamsets_processor

• NiFi build command:

./build_project :vs-nifi-processors

NOTE: This parameter has a leading colon (:) character because it identifies
an artifactId in a POM.xml file two levels down. The parameters for
building the other Datastream Developer Templates identify modules defined
in the top-level POM.xml file.

• Build command to build them all:

./build_project

These build commands always rebuild the Voltage SecureData code on which they all
depend, starting with the shared code in <install-dir>/common-crypto, then the
common code for the Datastream Developer Templates (<install_dir>/stream/
stream_common), and finally the code for the individual Datastream Developer
Template(s). The results, including the dependency JAR files for each Datastream
Developer Template, are packaged as shown for that Developer Template (usually as a
single uber JAR file):

• StreamSets archive file:

.../streamsets_processor/target/
sdprocessor-1.0.SNAPSHOT.tar.gz

• NiFi NAR file:

.../nifi/bin/vs-nifi-processors.nar

• Kafka Connect uber JAR file:

.../kafka_connect/target/vs-kafka-connect-1.0.jar

• Kafka-Storm uber JAR file:

.../kafka_storm/target/vs-kafka-storm-1.0.jar

Build Notes

• By default, when building the StreamSets Developer Template, several tests are run
using built-in data and the dataprotection Key Server hosted by Micro Focus Data
Security. These tests also assume that the Simple API has been installed in the directory
/opt/voltage/simpleapi. If this Key Server is not available, or if the Simple API is
installed in a different directory, the build (and test) process will fail.

To build the StreamSets Developer Template without running the associated tests,
include the extra command line parameter -DskipTests in the call to the
build_projects script:

2-25 CONFIDENTIAL
Building the Developer Templates Developer Templates Integration Guide Version 5.0

./build_project streamsets_processor -DskipTests

NOTE: Any extra parameters specified when calling the build_project script will
be passed, as is, to the Maven command line within it.

• Storm topologies are generally run using a single JAR file. Because of this, the Maven
build directives file pom.xml uses the Maven Shade Plugin to produce an uber-jar file
that contains all of the relevant dependencies packaged together. This includes the
Kafka and Storm libraries as well as all their dependencies and sub-dependencies,
transitively. For this reason, the resulting uber-jar file vs-kafka-storm-1.0.jar is
relatively large: approximately 60 MB. For reference and comparison, the classes for the
actual Kafka-Storm Developer Template integration code are originally packaged into
the JAR file original-vs-kafka-storm-1.0.jar, which at about 15 KB, is much
smaller. The required dependencies account for the large difference in these file sizes.

In contrast, the uber JAR file for the Kafka Connect Developer Template (vs-kafka-
connect-1.0.jar) is considerably smaller due to fewer downloaded dependencies
(and sub-dependencies).

• During the build process, Maven downloads dependencies from remote repositories on
the Internet. This means that the computer on which you are running the Maven build
process must have Internet access. If your target environment does not have Internet
access, you can run the Maven build on a different computer that does have Internet
access and then transfer the resulting files to your target environment.

• When the Simple API is called from multiple processors packaged in different NAR files,
JNI ClassLoader issues arise. In NiFi, each individual NAR file is loaded by a different
child ClassLoader, causing errors related to an attempt to load the Simple API more
than once.

NOTE: The Simple API for Java library uses Java Native Interface (JNI) for optimal
performance, with the core cryptographic operations written in the C language. While
this maximizes cryptographic processing speed, it requires that the native library be
loaded from a single ClassLoader, for use by all calling code running in the JVM.

The first NAR file to run will successfully load the Simple API native library, but a
subsequent processor to run, if it is in a different NAR file built as described above, will
fail with the following error:

java.lang.UnsatisfiedLinkError: Native Library <path-to-

library> already loaded in another classloader

The solution for this issue is to build your NAR files such that the supporting JARs,
including the Simple API JAR file vibesimplejava.jar, are not packaged within the
NAR file itself. Instead, the supporting JAR files will be copied to the bin folder along
with the less comprehensive NAR file. All of these files can then be copied to the
directory <nifi-install-location>/lib. This works because any JAR files in this
directory are loaded once by a shared ClassLoader in the main NiFi parent classpath.

CONFIDENTIAL 2-26
Developer Templates Integration Guide Version 5.0 Building the Developer Templates

To build the NiFi Developer Template NAR file without including the supporting JAR
files within it, run the build command with the parameters -P and external:
./build_project :vs-nifi-processors -P external

After this build command completes, copy the multiple NAR files (such as vs-nifi-
processors.nar) and all of the supporting JAR files (including
vibesimplejava.jar) from the NiFi Developer Template directory
<install-dir>/stream/nifi/bin to the directory <nifi-install-
location>/lib on the NiFi server. After you start (or restart) your NiFi server, the
supporting JAR files, and the Simple API native library, will be loaded once for shared
use.

NOTE: Even if you build the NAR file with the supporting JAR files packaged
internally in the bundled-dependencies directory, you can still copy the individual
supporting JAR files to the NiFi server’s lib directory so that they will be loaded
properly for shared use by SecureData processors in different NAR files. This will
avoid the UnsatisfiedLinkError problem, but will inefficiently include redundant
JAR files in multiple NAR files, making them larger than necessary.

2-27 CONFIDENTIAL
Building the Developer Templates Developer Templates Integration Guide Version 5.0

CONFIDENTIAL 2-28
3 Common Infrastructure
The Voltage SecureData for Hadoop Developer Templates share common infrastructure. Some
of this sharing is across all of the Developer Templates, including those that are explicitly
related to Hadoop (MapReduce, Hive, Sqoop, and Spark), NiFi, and Kafka-Storm. Other aspects
of the shared infrastructure is designed for use by the shared aspects of the templates related
to those templates that run in the context of a Hadoop cluster. For example, accessing the
configuration information in the XML configuration files shared by the MapReduce, Hive,
Sqoop, and Spark templates.

This chapter is organized into the following sections:

• Authentication and Authorization Overview (page 3-1)

• Configuration Settings (page 3-5)

• XML Configuration Files (page 3-32)

• Java Properties Configuration Files (page 3-45)

• Specifying the Location of the XML Configuration Files (page 3-47)

• Other Approaches to Providing Configuration Settings (page 3-52)

• Shared Integration Architecture (page 3-54)

• Shared Sample Data for the Hadoop Developer Templates (page 3-73)

• Common Procedures for the Hadoop Developer Templates (page 3-75)

• Logging and Error Handling in the Hadoop Developer Templates (page 3-87)

• Handling Empty and Net-Empty Values (page 3-88)

• Known Limitations of the Developer Templates (page 3-91)

Authentication and Authorization Overview

The Hadoop Developer Templates support four methods of authentication, three of which also
provide authorization (to derive a cryptographic key using the specified identity):

• Username and Password - This method provides authentication and authorization

using LDAP, generally using LDAP group membership for the authorization aspect.

• LDAP Plus Shared Secret - This method provides authentication using a shared secret
and authorization using LDAP (generally through LDAP group membership).

3-1 CONFIDENTIAL
Authentication and Authorization Overview Developer Templates Integration Guide Version 5.0

• Shared Secret - This method provides authentication using a shared secret and does
not provide user-level authorization.

• Kerberos - This method provides authentication using Kerberos and authorization

using LDAP (generally through LDAP group membership). This method is not available
for the Kafka-Storm Developer Template nor the NiFi Developer Template.

Corresponding configuration on the Voltage SecureData Server for your chosen method of
authentication (and authorization, if any) is required.

NOTE: Kerberos authentication requires several special configuration steps, including steps
on the computer running the Kerberos Key Distribution Center (KDC). For more information,
see the following subsection, "Configuration Step Summary for Kerberos Authentication"
(page 3-3).

The Developer Templates that run in the context of a Hadoop cluster (MapReduce, Hive,
Sqoop, and Spark) use the following configuration settings, specified in the XML configuration
file vsauth.xml, to provide default and cryptId-specific authentication/authorization
information:

• Authentication/Authorization Method (Username and Password, LDAP + Shared

Secret, Shared Secret, Kerberos)
• Shared Secret (<shared secret>)
• Username (<username> or {LOCALUSER})
• Password (<password> or <shared secret>)

The StreamSets Developer Template uses the following configuration settings, specified in the
XML configuration file vsauth.xml, to provide default and cryptId-specific authentication/
authorization information:

• Authentication/Authorization Method (Username and Password, LDAP + Shared

Secret, Shared Secret)
• Shared Secret (<shared secret>)
• Username (<username> or {LOCALUSER})
• Password (<password> or <shared secret>)

The Kafka Connect Developer Template and the Kafka-Storm Developer Template use the
following configuration settings, specified in the XML configuration file vsauth.xml, to
provide default and cryptId-specific authentication/authorization information:

• Authentication/Authorization Method (Username and Password, Shared Secret)

• Shared Secret (<shared secret>)
• Username (<username>)
• Password (<password>)

The NiFi Developer Template uses the following configuration settings on the Properties tab
of the Configure Processor dialog box of the NiFi processor SecureDataProcessor to
provide authentication/authorization information for that processor:

CONFIDENTIAL 3-2
Developer Templates Integration Guide Version 5.0 Authentication and Authorization Overview

• Authentication/Authorization Method (Username and Password, Shared Secret)

• Shared Secret (<shared secret>)
• Username (<username>)
• Password (<password>)

NOTE: The username cannot contain a colon character (:). If it does, authentication will
likely fail. This is because the username and password are combined when communicating
with the Voltage SecureData Key Server, using a colon character as a separator.

Kerberos Delegation Token HDFS Location

In the XML configuration file vsauth.xml, when the authMethod attribute of the
authDefault element or any of the authId elements (default or authId-specific, respectively)
is set to Kerberos, the delegationTokenHdfsPath attribute of the kerberos element
must be specified. This path value is the absolute or relative path to a directory in HDFS where
the Hadoop Developer Templates will store Kerberos delegation tokens issued by the Voltage
SecureData Server for Kerberos-authenticated users:
<kerberos delegationTokenHdfsPath="absolute_or_relative_HDFS_path" />

Where absolute_or_relative_HDFS_path is either an absolute or relative HDFS path

to a directory in which the shared Hadoop Developer Template code will store Kerberos
delegation tokens. Start absolute paths with the forward slash character (/); otherwise
paths are relative to the current user's home directory in HDFS.

When using Kerberos authentication, the shared Hadoop Developer Template code will use the
calling user’s Kerberos ticket granting ticket (TGT) to request a delegation token from the
Voltage SecureData Server, storing the returned token in this location in HDFS with file
permissions set appropriately. The Hadoop job tasks (for MapReduce, Hive, Sqoop, and so on)
running on the individual data nodes in the cluster read that token from this location in HDFS
and use it to request cryptographic keys for local (Simple API) protect or access operations or
to make remote (REST API) protect or access requests to the Voltage SecureData Server.

Configuration Step Summary for Kerberos Authentication

For Kerberos authentication, several extra configuration steps are required, including steps on
the computer running the Kerberos Key Distribution Center (KDC) and steps on the Voltage
SecureData Management Console. For full details about these steps, see the Voltage
SecureData Administrator Guide. A summary of the steps is provided here.

3-3 CONFIDENTIAL
Authentication and Authorization Overview Developer Templates Integration Guide Version 5.0

Additional Kerberos Steps on the KDC Computer

The following additional steps are required on the computer running the KDC:

1. On the computer running the KDC, use the kadmin.local command to add the
Kerberos service principal for the Voltage SecureData Server cluster hostname and
export the service principal to a file named kms.keytab, as follows:
kadmin.local
kadmin.local: addprinc HTTP/voltage-pp-0000.<district-domain>@<Kerberos-realm>
kadmin.local: ktadd -k kms.keytab HTTP/voltage-pp-0000.<district-domain>@<Kerberos-realm>
kadmin.local: exit

NOTE: The service principal name of the Voltage SecureData Server, used in the two
kadmin.local commands above, is not the full URL for the Voltage SecureData
Server. The service principal name must match the form used above AND must
exactly match the value entered into the Principal Name field in the System >
Kerberos page in the Voltage SecureData Management Console. For example:

HTTP/[email protected]

2. As the root user, securely copy the files kms.keytab and /etc/krb5.conf from the
KDC computer to a location from which you can use the Management Console to upload
them.

IMPORTANT: The first of these files, kms.keytab, contains sensitive information that
must be kept secure. The second file, /etc/krb5.conf, defines the Kerberos realm,
which is not sensitive.

Additional Kerberos Steps on the KDC Computer

The following additional steps, including Kerberos-specific steps, are required on Management
Console:

1. On the System->Kerberos page, upload the two files from the previous step to the
Management Console.

2. On the System->Kerberos page, enter the service principal name in the Principal
Name field exactly as it was specified in the kadmin.local commands above, and
then click the Save Settings button.

3. On the Key Management > Authentication page, configure one or more LDAP
Authentication Methods for the Voltage SecureData Key Server. This is what will be
used to authorize users for specific identities for key requests from the Simple API.

4. Also on the Key Management > Authentication page, check the Enable Kerberos
Authentication for Key Requests checkbox to enable Kerberos authentication for key
requests from the Simple API (after which they will be authorized using the methods
configured in the previous step).

CONFIDENTIAL 3-4
Developer Templates Integration Guide Version 5.0 Configuration Settings

5. On the Web Service->Identity Authorization page, configure one or more Identity

Authorization Rules as User Name Patterns or LDAP Group Lookups in order to
authorize users for specific identities provided in REST API requests.

6. Also on the Web Service->Identity Authorization page, check the Enable Kerberos
Authentication for Web Service REST Requests checkbox to enable Kerberos
authentication for REST API calls (after which they will be authorized using the rules
configured in the previous step).

7. Deploy the settings from the Management Console to the host(s) in the Voltage
SecureData Server cluster.

IMPORTANT: The Kerberos protocol uses tickets with timestamps. In order for Kerberos
authentication to function properly, the Hadoop nodes, the KDC and the Voltage SecureData
Server must have synchronized system clocks. If the clock times on these computers drift too
far apart, Kerberos authentication may fail, resulting in error messages such as the following
being logged in the Voltage SecureData Key Server debug.log file:

Authentication exception: GSSException: Failure unspecified at

GSS-API level (Mechanism level: Clock skew too great (37))
org.apache.hadoop.security.authentication.client.Authenticat
ionException: GSSException: Failure unspecified at GSS-API
level (Mechanism level: Clock skew too great (37))

Micro Focus Data Security strongly recommends that you use the Network Time Protocol
(NTP) to keep the clock times synchronized on the relevant computers.

Configuration Settings

This section provides detailed information about the configuration settings available for the
Hadoop Developer Templates through the use of two different types of configuration files:

• The XML configuration files used by the Hadoop Developer Templates that run in the
context of a Hadoop cluster (MapReduce, Hive, Sqoop, and Spark). For more
information, see "XML Configuration Files" (page 3-32).

• The Java Properties configuration files used by the NiFi Developer Template and the
Kafka-Storm Developer Template. For more information, see "Java Properties
Configuration Files" (page 3-45).

Further, configuration information is sometimes provided as function parameters, such as to

user-defined functions (UDFs) or as choices in a user interface (in the case of NiFi).

This section provides information about the relevant configuration settings in a generic way,
while also identifying how this information is provided for the different Hadoop Developer
Templates. The configuration settings fall into two classes, as follows:

3-5 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0

• Sometimes relevant to both the XML configuration files (MapReduce, Hive, Sqoop, and
Spark), the Java Properties configurations files (Nifi and Kafka-Storm), and even to NiFi
user interface choices, the direct settings that identify Voltage SecureData resources
and that provide relevant information needed during cryptographic operations. An
example of the former is the Voltage SecureData Server that you are using. Examples of
the latter include the data protection format and the identity for cryptographic key
generation required for a particular protect or access operation.

• Relevant to only the XML configuration files (MapReduce, Hive, Sqoop, and Spark), the
XML infrastructure settings that are used in the structuring of your XML configuration
files and which are referenced elsewhere within the XML itself and sometimes as UDF
parameter values that identify groupings of configuration settings. Examples of this type
of configuration setting include the names given to sets of cryptographic settings
(known as cryptIds) and names given to sets of authentication/authorization settings
(known as authIds).

Each of the settings in this section identifies this classification parenthetically.

Domain Name
Use the Domain Name (direct) setting to specify the security district domain name of the
relevant Voltage SecureData Server. This is the part of the Voltage SecureData Server
hostname without the voltage-pp-0000 prefix.

This value, when specified, is used to construct three other optional configuration settings of
which the domain name is a part:

• Simple API Policy URL:

https://voltage-pp-0000.<domainName>/policy/clientPolicy.xml

• REST Hostname:
voltage-pp-0000.<domainName>

If any of these optional configuration settings are specified individually, those settings will be
used instead of settings constructed from the domain name setting.

In the XML configuration file vsconfig.xml, this setting is specified using the domainName
attribute (and its value) of the secureDataServer element. The default version of the
configuration file vsconfig.xml provided with the Hadoop Developer Templates specifies
the following demonstration security district domain, hosted by Micro Focus:
dataprotection.voltage.com

NOTE: In the XML configuration file vsconfig.xml, you must specify either a domain name
(as described here) or a hostname (as described in the following section), but not both. If
both are specified, the hostname will be used and the domain name will be ignored.

CONFIDENTIAL 3-6
Developer Templates Integration Guide Version 5.0 Configuration Settings

To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).

Hostname
Use the Hostname (direct) setting to specify the full server hostname of the relevant Voltage
SecureData Server. This typically includes the standard voltage-pp-0000 prefix, required
when making requests to the Voltage SecureData Key Server.

This value, when specified, is used to construct three other optional configuration settings of
which the hostname is a part:

• Simple API Policy URL:

https://<hostName>/policy/clientPolicy.xml

• REST Hostname:
<hostName>

If any of these optional configuration settings are specified individually, those settings will be
used instead of settings constructed from the hostname setting.

In the XML configuration file vsconfig.xml, this setting is specified using the hostName
attribute (and its value) of the secureDataServer element. For example, you could change
the default version of the configuration file vsconfig.xml provided with the Hadoop
Developer Templates to use this alternate approach to specify the full hostname of the
demonstration security district domain hosted by Micro Focus:
voltage-pp-0000.dataprotection.voltage.com

NOTE: In the XML configuration file vsconfig.xml, you must specify either a hostname (as
described here) or a domain name (as described in the previous section), but not both. If
both are specified, the hostname will be used and the domain name will be ignored.

To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).

Simple API Install Path

Use the Simple API Install Path (direct) setting to specify your standardized location for the
Simple API installation on the data nodes in your Hadoop cluster (it must be the same on every
data node). This location will be used to load the Simple API native library (.so) and identify
the location of its trustStore directory.

In the XML configuration file vsconfig.xml, this setting is specified using the installPath
attribute (and its value) of the simpleAPI element.

In the Java Properties configuration file vsnifi.properties (NiFi), this setting is specified
using the property name simpleapi.install.path (and its value).

3-7 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0

In both types of configuration files, the default value for this setting is:
/opt/voltage/simpleapi

If you have installed the Simple API in a different directory on the data nodes in your Hadoop
cluster, you must change this configuration setting accordingly.

NOTE: Calls to the Web Service API (the REST API), at least when using the provided Web
Service clients, such as the calls to the REST API in the various Developer Templates, also
use the trustStore directory from the Simple API install location to initialize the trusted
root certificates used when connecting to the remote Web Service server. Therefore, you
must specify this attribute value even if you are not using the Simple API.

To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).

To see this setting in the context of the Java Properties configuration file
vsnifi.properties, see “vsnifi.properties” (page 3-45).

Simple API Policy URL

Use the Simple API Policy URL (direct) setting to specify the URL for the client policy file.

In the XML configuration file vsconfig.xml, this setting is optional and may be specified
using the policyUrl attribute (and its value) of the simpleAPI element. When this setting is
not specified, the client policy file URL is constructed using either the domainName or the
hostName attribute value of the secureDataServer element, as follows:

https://voltage-pp-0000.<domainName>/policy/clientPolicy.xml

or
https://<hostName>/policy/clientPolicy.xml

In the XML configuration file vsconfig.xml, as shipped for use with the Hadoop Developer
Templates, a default Simple API Policy URL is constructed from the specified domain name:
dataprotection.voltage.com

Use this setting if your client policy file URL cannot be constructed from the domainName or
the hostName attribute value, whichever one you supply.

In the Java Properties configuration file vsnifi.properties, this setting is specified using
the name simpleapi.policy.url (and its value). In both of these Java Properties
configuration files, as shipped, the value for this setting is (shown on two lines for improved
readability):
https://voltage-pp-0000.voltage-pp-0000.dataprotection.voltage
.com/policy/clientPolicy.xml

CONFIDENTIAL 3-8
Developer Templates Integration Guide Version 5.0 Configuration Settings

To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).

To see this setting in the context of the Java Properties configuration file
vsnifi.properties, see “vsnifi.properties” (page 3-45).

Simple API Cache Type

Use the Simple API Cache Type (direct) setting to specify the type of caching to use with
Simple API operations when downloading cryptographic keys, the client policy, and so on from
the Voltage SecureData Server. The allowed values are:

• none - Do not cache downloaded items. This setting is not recommended for
production environments.

• memory - Cache downloaded items in memory.

• file - Cache downloaded items in files in a local directory.

IMPORTANT: In the context of the Hadoop Developer Templates that run
in the context of a Hadoop cluster (MapReduce, Hive, Sqoop, and Spark),
it is important to understand that file-based caching in the Simple API
uses the local file system (NFS is not recommended). This means that
each data node participating in the Hadoop job, when using Simple API
file-based caching, will have its own cache and that the same
cryptographic key may be cached independently on different data nodes,
resulting in more key derivation interactions with the Voltage SecureData
Servers than might be expected if the data nodes were all sharing a single
(HDFS) file-based cache.

Nevertheless, using file-based caching on the local file system of each data
node may still result in fewer network interactions with the Voltage
SecureData Server. New processes launched on a given data node will be
able to used (file-based) cached cryptographic keys, which is not the case
when in-memory caching is used.

In the XML configuration file vsconfig.xml, this setting is optional and may be specified
using the cacheType attribute (and its value) of the simpleAPI element.

In the Java Properties configuration files vsnifi.properties, this setting is specified using
the name simpleapi.cache.type (and its value).

In both types of configuration files, the default value for this setting is memory (note that
although it is not required, in the provided XML configuration file vsconfig.xml, the default
value of memory is specified explicitly). For more information about the in-memory and file-
based caching modes used by the Simple API, see the Voltage SecureData Simple API
Developer Guide.

3-9 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0

To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).

To see this setting in the context of the Java Properties configuration file
vsnifi.properties, see “vsnifi.properties” (page 3-45).

Simple API File Cache Path

Use the Simple API File Cache Path (direct) setting when you want to use file-based caching
for the Simple API (the Simple API Cache Type setting, described above, is set to file) in order
to specify the local file system directory in which to cache downloaded cryptographic keys,
client policy, and so on. Note that this attribute value must specify the same local directory in
each data nodes’ local file system (not a directory in HDFS - see the IMPORTANT note in the
"Simple API Cache Type" on page 3-9 section above).

In the XML configuration file vsconfig.xml, this setting is optional and may be specified
using the fileCachePath attribute (and its value) of the simpleAPI element.

In the Java Properties configuration files vsnifi.properties, this setting is optional and
may be specified using the name simpleapi.file.cache.path (and its value).

In both types of configuration files, if this setting is not specified when file-based caching is
being used, the default location for caching is a subdirectory named cache directly subordinate
to the Simple API’s installation location (<installDir>/cache). If this setting is specified
when file-based caching is not being used, its value is ignored. For more information about this
setting’s path value when file-based caching is being used, see the Voltage SecureData Simple
API Developer Guide.

CAUTION: If file-based caching is configured for the Simple API, the user account under
which the job is running must have permissions to write to (and create, if necessary) the
specified caching directory. Make sure to set the directory permissions appropriately, based
on the user account(s) under which your Hadoop jobs will be run. Incorrect permissions can
cause errors, such as VE_ERROR_FILE_CREATE_DIR, when the job attempts to create the
Simple API LibraryContext object.

To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).

To see this setting in the context of the Java Properties configuration file
vsnifi.properties, see “vsnifi.properties” (page 3-45).

Simple API Short FPE Behavior

Use the Simple API Short FPE Behavior (direct) setting to enable or disable Short FPE mode,
which configures the Simple API behavior when protecting and accessing short input values.
When set to false, input values that are too short to securely protect throw an error.
Otherwise, when set to true, input values that are shorter than what would normally be

CONFIDENTIAL 3-10
Developer Templates Integration Guide Version 5.0 Configuration Settings

considered the lower limit of being cryptographically secure are allowed to be protected and
accessed anyway. For more information about this behavior, see the Voltage SecureData
Simple API Developer Guide.

In the XML configuration file vsconfig.xml, this setting is optional and may be specified
using the shortFPE attribute (and its value) of the simpleAPI element.

In the Java Properties configuration file vsnifi.properties, this setting is optional and may
be specified using the name simpleapi.shortfpe (and its value).

In both types of configuration files, if this setting is not specified, its default value is false.

IMPORTANT: In previous versions of the Hadoop Developer Templates, this value was not
configurable without changing the Developer Templates source code. Further, its default
value in the source code allowed the protection and access of short FPE values (equivalent
to a setting of true in version 4.1 and later). Using these different default settings on the
same data could cause protection and access operations that previously succeeded to fail
after upgrading.

To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).

To see this setting in the context of the Java Properties configuration file
vsnifi.properties, see “vsnifi.properties” (page 3-45).

Web Service Batch Size

Use the Web Service Batch Size (direct) setting to set the maximum number of items in each
batch for Web Service operations that support batching.

In the XML configuration file vsconfig.xml, this setting is optional and may be specified
using the batchSize attribute (and its value) of the webService element. If this setting is not
specified, its default value is 2000.

NOTE: Not all Hadoop Developer Templates support batching. And even for those
templates that do support batching, this configuration setting is currently only used by the
Sqoop Developer Template and the RDD and Dataset variants of the Spark Developer
Template. Although other Hadoop Developer Templates, such as MapReduce and NiFi, also
perform batching, their batch sizes are specified directly in the source code.

To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).

REST Hostname
Use the REST Hostname (direct) setting to specify the full hostname of the REST server that
will be used for REST API operations.

3-11 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0

In the XML configuration file vsconfig.xml, this setting is optional and may be specified
using the restHostName attribute (and its value) of the webService element. When this
setting is not specified, the REST hostname is constructed using the domainName or taken as
the hostName attribute value of the secureDataServer element, as follows:
voltage-pp-0000.<domainName>

or
<hostName>

In the XML configuration file vsconfig.xml, as shipped for use with the Hadoop Developer
Templates, a default REST hostname is constructed from the specified domain name:
voltage-pp-0000.dataprotection.voltage.com

Use this setting if your REST hostname cannot be constructed from the domainName attribute
value or is not that same as the hostName attribute value, whichever one you supply.

In the Java Properties configuration file vsnifi.properties, this setting is required when
the REST API is used and is specified using the name rest.hostname (and its value). Its
default setting in both of these Java Properties files is:
voltage-pp-0000.dataprotection.voltage.com

To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).

To see this setting in the context of the Java Properties configuration file
vsnifi.properties, see “vsnifi.properties” (page 3-45).

Authentication/Authorization Failure on Access Behavior

Use the Authentication/Authorization Failure on Access Behavior (direct) setting to specify
the behavior of the Voltage SecureData APIs when an access operation encounters an
authentication or authorization failure.

Depending on your purposes, transparently returning the protected value (rather than an
error) in this case may still allow analytics to be performed. When this behavior is enabled and
an authentication/authorization error is detected during access, the operation returns the
protected value instead of throwing an exception. Also, the trapped and ignored exception is
logged as a warning in the Hadoop job logs for auditing/debugging purposes. This logging
allows you to see whether this behavior was triggered for the cryptographic API call(s).

The following values for this setting are allowed:

• false - Authentication/authorization failures throw exceptions for all operations,

including access operations (decryption and detokenization). This is the default
behavior.

CONFIDENTIAL 3-12
Developer Templates Integration Guide Version 5.0 Configuration Settings

• true - Authentication/authorization failures during access operations result in the

protected value(s) being returned, without an exception being thrown. Note that
authentication/authorization failures for protection operations always throw exceptions,
regardless of the value of this configuration setting.

When this behavior is enabled, how the Developer Template code determines when an
applicable authentication/authorization failure has occurred depends on the API being used, as
follows:

• Simple API - When the error code 566 (VE_ERROR_AUTHORIZATION_DENIED) is

returned.

• REST API - For authentication failures, when the HTTP status code 401
(UNAUTHORIZED) is returned. For authorization failures, when the HTTP status code
403 (FORBIDDEN) is returned, except when the error code in the JSON response is
30635 (TOKENIZATION_IDENTITY_MISMATCH).

When any of these error conditions occur during access, the setting of this configuration value
is checked to determine the resulting behavior:

• When set to false, the authentication/authorization error is logged and the Hadoop
job fails with an exception.

• When set to true, the authentication/authorization error is logged but otherwise

ignored and the protected data returned, as is, to the Hadoop job.

In the XML configuration file vsconfig.xml, this setting is optional and may be specified
using the returnProtectedValueOnAccessAuthFailure attribute (and its value) of the
general element.

In the Java Properties configuration file vsnifi.properties, this setting is optional and may
be specified using the name return.protected.value.on.access.auth.failure (and
its value).

In both types of configuration files, if this setting is not specified, its default value is false.

CAUTION: Sometimes an internal server error, which can occur when an LDAP server is
unavailable, will be represented to the client as an authentication/authorization failure that is
indistinguishable from a failure to actually authenticate/authorize the provided username/
password/identity. When the return-protected-value-on-auth-failure behavior is enabled,
the protected value is returned as a successful result without any indication of the true
nature of the internal server error, even in the Hadoop job logs. You will need to determine
whether this behavior is acceptable for your expected use of the data and the likelihood of
this type of error based on your LDAP redundancy and network stability.

To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).

3-13 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0

To see this setting in the context of the Java Properties configuration file
vsnifi.properties, see “vsnifi.properties” (page 3-45).

Product Name
Use the Product Name (direct) setting to specify a custom product name for one or more
Hadoop Developer Template client types, depending upon the context in which it is specified.

The specified custom product name, if any, or the default product name, is included in requests
to the Voltage SecureData Server when using the Simple API to request cryptographic keys
and when making the REST API requests. This information, whether the default value or
(especially) a customized value, can be useful for tracking and reporting purposes because it is
logged for these types of requests. Details for the relevant APIs are as follows:

• Simple API:

Registered with the Simple API as the Product field of the Client Identifier, and
included in cryptographic key requests.

Logged by the Key Server in the file /opt/vsibe/logs/audit.log as:

cvProduct=<product_name>/<product_version>

NOTE: The Simple API has the following character restrictions on the product name:

Uppercase and lowercase letters, digits, and the following additional characters:

. ( ) { } [ ] - _

Note that no spaces are allowed.

• REST API:

Concatenated with the product version with a slash character (/) between them and
included as the User-Agent field in the REST request header:
User-Agent=<product_name>/<product_version>

Logged by Web Service in the file /opt/vssoa/logs/audit.log as:

requestClientApplication=<product_name>/<product_version>

In the XML configuration file vsconfig.xml, this setting is optional. It may be specified at two
levels:

• Globally, using the product attribute of the clientIds element.

• For a specific Hadoop component, using the product attribute of a clientId element.
Any setting specified for a specific Hadoop component will override the global setting, if
present.

CONFIDENTIAL 3-14
Developer Templates Integration Guide Version 5.0 Configuration Settings

In the XML configuration files with names of the form vs<component>.xml, this setting is
optional and may be specified for the Hadoop component indicated by the filename using the
product attribute of the clientId element. Any setting specified in this way will override any
corresponding settings in the XML configuration file vsconfig.xml.

Its default setting at the global level in the XML configuration file vsconfig.xml, as shipped,
is the same as its default setting when no value is specified (either globally or for a specific
Hadoop component):
SecureData-Hadoop-Dev-Templates

In the Java Properties configuration file vsnifi.properties, this setting is optional and may
be specified using the name product.name (and its value). This setting is commented out in
this Java Properties file, resulting in its respective default value (NiFi Developer
Template) being used in Simple API and REST API requests.

These event logs are subsequently processed by the event aggregator on the Voltage
SecureData Server. For more information about event logging and aggregation, see the
Voltage SecureData Server product documentation.

To see this setting in the context of the XML configuration files vsconfig.xml and
vs<component>.xml, see “Attribute Values in vsconfig.xml” (page 3-37) and “Attribute
Values in vs<component>.xml” (page 3-43), respectively.

To see this setting in the context of the Java Properties configuration file
vsnifi.properties, see “vsnifi.properties” (page 3-45).

Product Version
Use the Product Version (direct) setting to specify a custom product version for one or more
Hadoop Developer Template client types, depending upon the context in which it is specified.

The specified custom product version, if any, or the default product version, is included in
requests to the Voltage SecureData Server when using the Simple API to request cryptographic
keys and when making the REST API requests. This information, whether the default value or
(especially) a customized value, can be useful for tracking and reporting purposes because it is
logged for these types of requests. Details for the relevant APIs are as follows:

• Simple API:

Registered with the Simple API as the Version field of the Client Identifier, and
included in cryptographic key requests.

Logged by the Key Server in the file /opt/vsibe/logs/audit.log as:

cvProduct=<product_name>/<product_version>

3-15 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0

NOTE: The Simple API has the following character restrictions on the product
version:

Uppercase and lowercase letters, digits, and the following additional characters:

. ( ) { } [ ] - _

Note that no spaces are allowed.

• REST API:

Concatenated with the product name with a slash character (/) between them and
included as the User-Agent field in the REST request header:
User-Agent=<product_name>/<product_version>

Logged by Web Service in the file /opt/vssoa/logs/audit.log as:

requestClientApplication=<product_name>/<product_version>

In the XML configuration file vsconfig.xml, this setting is optional. It may be specified at two
levels:

• Globally, using the version attribute of the clientIds element.

• For a specific Hadoop component, using the version attribute of a clientId element.
Any setting specified for a specific Hadoop component will override the global setting, if
present.

In the XML configuration files with names of the form vs<component>.xml, this setting is
optional and may be specified for the Hadoop component indicated by the filename using the
version attribute of the clientId element. Any setting specified in this way will override any
corresponding settings in the XML configuration file vsconfig.xml.

In the XML configuration file vsconfig.xml, as shipped, no version attribute is specified at

either the global level or the component-specific level, resulting in its default value being used
for all Hadoop components:
<product_version>

In the Java Properties configuration file vsnifi.properties, this setting is optional and may
be specified using the name product.version (and its value). This setting is commented out
in both of these Java Properties files, resulting in no value being provided as the version in
Simple API and REST API requests.

NOTE: These event logs are subsequently processed by the event aggregator on the
Voltage SecureData Server. For more information about event logging and aggregation, see
the Voltage SecureData Server product documentation.

CONFIDENTIAL 3-16
Developer Templates Integration Guide Version 5.0 Configuration Settings

To see this setting in the context of the Java Properties configuration file
vsnifi.properties, see “vsnifi.properties” (page 3-45).

Component-Specific Designator for Client ID

Use the Component-Specific Designator for Client ID (XML infrastructure) setting to specify
a custom component-level product name and/or product version for a specific Hadoop
component. Use one of the following valid values for this attribute to identity the Hadoop
component for which you want to provide one or both custom values:

• mr - Define one or both custom client ID values for the MapReduce Developer
Template

• hive - Define one or both custom client ID values for the Hive Developer Template

• sqoop - Define one or both custom client ID values for the Sqoop Developer Template

• spark - Define one or both custom client ID values for the Spark Developer Template

If you set a custom component-level product name and/or product version for a particular
Hadoop component, the value(s) you set will override both A) any global product name and/or
product version you set using the clientIds element, and B) the default values that are
automatically provided by the Hadoop Developer Templates (SecureData-Hadoop-Dev-
Templates and <product_version>, respectively).

In the XML configuration file vsconfig.xml, this setting is required in all component-specific
client identifiers and must be specified using the component attribute (and its value) of each
clientId element. Each such clientId element must have one of the four unique names
provided above.

In the XML configuration file vsconfig.xml, as shipped, no individual client identifiers are
defined.

To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).

Identity
Use the Identity (direct) setting to specify an identity for use when deriving cryptographic
keys, including for authorization purposes where appropriate.

3-17 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0

In the context of cryptIds, the specified identity can be used with one or more cryptIds,
depending upon whether it is specified as the default identity or a cryptId-specific identity. In
the XML configuration file vsconfig.xml, this cryptId setting is optional at either the global
level or at the level of individual cryptIds, but not at both. It must be specified at one or both of
the following levels:

• Globally, using the defaultIdentity attribute of the cryptIds element.

• For an individual cryptId, using the identity attribute of a cryptId element. Any
setting specified for an individual cryptId will override the global setting, if present.

At the global level in the XML configuration file vsconfig.xml, as shipped, this configuration
setting is set as [email protected], which allows the Hadoop Developer Templates to run
successfully using the public-facing Voltage SecureData Server dataprotection, hosted by
Micro Focus.

In the context of the NiFi Developer Template, the identity is specified interactively on the
Properties tab of the Configure Processor dialog box for the SecureDataProcessor, and
serves as the identity for cryptographic operations performed by that NiFi processor.

In the context of the Kafka-Storm Developer Template, the identity is specified in the Java
Properties configuration file vsauth.properties and serves as a global identity, used for all
cryptographic key derivations for that template. The name portion of the name/value pair is
identity and the default identity provided in this configuration file is:

auth.identity = [email protected]

To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).

To see this setting in the context of the Java Properties configuration file
vsauth.properties, see “Specifying the Location of the XML Configuration Files” (page 3-
47).

API
Use the API (direct) setting to optionally specify the Voltage SecureData API that you want to
perform one or more cryptographic operations. The choices are:

• simpleapi - Use the Simple API for the relevant cryptographic operation(s)

• rest - Use the REST API for the relevant cryptographic operation(s)

In the context of cryptIds, the specified API can be used with one or more cryptIds, depending
upon whether it is specified as the default API or at the level of individual cryptIds:

• Globally, using the defaultApi attribute of the cryptIds element.

• For an individual cryptId, using the api attribute of a cryptId element. Any setting
specified for an individual cryptId will override the global setting, if present.

CONFIDENTIAL 3-18
Developer Templates Integration Guide Version 5.0 Configuration Settings

If this optional setting is not specified at either level, cryptIds will default to using the Simple
API.

In the XML configuration file vsconfig.xml, as shipped, this configuration setting is set as
simpleapi at the global level.

In the context of the NiFi Developer Template, the API is specified interactively (using a drop-
down box) on the Properties tab of the Configure Processor dialog box for the
SecureDataProcessor, and serves as the API choice for cryptographic operations performed
by that NiFi processor.

In the context of the Kafka-Storm Developer Template, the API defaults to the Simple API, but
can be changed by specifying the REST API as an optional fourth parameter (rest) to the
Kafka-Storm script run-storm-topology.

To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).

CryptId Name
Use the CryptId Name (XML infrastructure) setting to specify a name for an individual cryptId.
The provided name will generally appear in the following contexts, thereby identifying the
cryptId to be used when protecting or accessing an item of sensitive data:

• The value of the cryptId attribute for one or more field elements in the XML
configuration file vsconfig.xml.

• The value of the cryptId attribute for one or more field elements in XML
configuration files with names of the form vs<component>.xml.

• A parameter to a UDF in the Hive Developer Template or a UDF variant of the Spark
Developer Template.

In the XML configuration files vsconfig.xml, this setting is required in all individually
specified cryptIds and must be specified using the name attribute (and its value) of each
cryptId element. Each such cryptId element must have a unique name.

In the XML configuration file vsconfig.xml, as shipped, several individual cryptIds are
defined with names that include alpha, date, cc, and ssn.

This setting is not relevant to the NiFi Developer Template nor to the Kafka-Storm Developer
Template.

To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).

3-19 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0

Format
Use the Format (direct) setting to specify the name of a data protection format to be used
when performing the relevant cryptographic operation. The format name identifies the name of
an FPE or SST format that has been centrally configured using the Voltage SecureData
Management Console.

NOTE: Only the REST API can be used in conjunction with SST formats.

In the context of cryptIds, the specified format is used to protect or access any field associated
with that cryptId. In the XML configuration file vsconfig.xml, this setting is required in all
individually specified cryptIds and must be specified using the format attribute (and its value)
of each cryptId element. As shipped, the XML configuration file vsconfig.xml specifies
cryptIds with several different formats, including Alphanumeric, DATE-ISO-8601, cc-sst-
6-4, and ssn.

NOTE: When specifying a format for a cryptId, note that the format name AES is reserved for
IBSE/AES encryption using the binary Hive UDFs (and, potentially, future AES support in
other Hadoop Developer Templates) and must not be used by your Voltage SecureData
administrator when defining formats using the Management Console. This applies across all
of the Hadoop Developer Templates, including the NiFi Developer Template and the Kafka-
Storm Developer Template.

In the context of the NiFi Developer Template, the format is specified interactively on the
Properties tab of the Configure Processor dialog box for the SecureDataProcessor, and
specifies the data protection format for cryptographic operations performed by that NiFi
processor.

In the context of the Kafka-Storm Developer Template, the format is specified as the required
third parameter to the Kafka-Storm script run-storm-topology.

To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).

CryptId AuthId Name

Use the CryptId AuthId Name (XML infrastructure) setting to specify the name of an authId
element (the value of its name attribute) in the configuration file vsauth.xml. If provided, the
authentication method and credentials in the identified authId element will be used when
performing authentication/authorization associated with this cryptId.

In the XML configuration file vsconfig.xml, this setting is optional for individually specified
cryptIds and may be specified using the authId attribute (and its value) of each cryptId
element. If not provided, the authentication/authorization performed in association with this
cryptId will be according to the authentication method and credentials provided in the
authDefault element of the configuration file vsauth.xml.

CONFIDENTIAL 3-20
Developer Templates Integration Guide Version 5.0 Configuration Settings

In the XML configuration file vsconfig.xml, as shipped, several individual cryptIds are
defined without authId attributes, therefore using the default authentication/authorization
information in the configuration file vsauth.xml.

For more information about using the configuration file vsauth.xml to configure
authentication methods and credentials for the Hadoop Developer Templates, see "vsauth.xml"
(page 3-38).

This setting is not relevant to the NiFi Developer Template nor to the Kafka-Storm Developer
Template.

To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).

Translator Class
Use the Translator Class (direct) setting to specify the name of a custom Java translator class
that you can use to translate the data before and/or after the Voltage SecureData
cryptographic processing is performed. You can write and configure your own custom
translator class to perform specific pre- and/or post-processing or you can use one of the
translator classes provided with the Hadoop Developer Templates.

In the XML configuration file vsconfig.xml, this setting is optional for individually specified
cryptIds and may be specified using the translatorClass attribute (and its value) of each
cryptId element. If not provided, no pre- and/or post-processing is performed.

In the XML configuration file vsconfig.xml, as shipped, several individual cryptIds are
defined without translatorClass attributes, therefore not performing any pre- and/or post-
processing.

This setting is not relevant to the NiFi Developer Template nor to the Kafka-Storm Developer
Template.

NOTE: If you are using a version of the Simple API prior to 4.3 to protect and access date
formats, you will need to use the included translator class
LegacySimpleAPIDateTranslator for the relevant cryptIds. This is necessary because
older versions of the Simple API did not handle date formats directly, and required pre- and
post-processing using a special internal date/time syntax. By default, this version of the
Hadoop Developer Templates assumes a newer version of the Simple API and does not use
this translator class. For more information about translator classes, see "Data Translation"
(page 3-67).

To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).

3-21 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0

Translator Initialization Data

Use the Translator Initialization Data (direct) setting to specify initialization data for the
translator class specified by the translatorClass attribute. If no translator class is specified,
this attribute and its value are ignored.

In the XML configuration file vsconfig.xml, this setting is optional for individually specified
cryptIds and may be specified using the translatorInitData attribute (and its value) of
each cryptId element. If not provided, no initialization data will be provided to the specified
translator class, if any.

NOTE: If this value is specified, it will be passed to the init(String initData) method
in the constructed instance of the class specified using the corresponding
translatorClass attribute. This value can be used to initialize the translator instance with
additional advanced configuration settings. For more information about how you would use
it in your translator class implementations, see the definition of method init(String
initData) in the interface StringTranslator and the abstract class
BaseStringTranslator.

In the XML configuration file vsconfig.xml, as shipped, several individual cryptIds are
defined without translatorClass or translatorInitData attributes, therefore not
performing any pre- and/or post-processing.

This setting is not relevant to the NiFi Developer Template nor to the Kafka-Storm Developer
Template.

To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).

CryptId Description
Use the CryptId Description (XML infrastructure) setting to specify a description of the
purpose of, or use case for, this cryptId. The description value is for informational purposes only
and has no effect on protect or access behavior.

In the XML configuration file vsconfig.xml, this setting is optional for individually specified
cryptIds and may be specified using the description attribute (and its value) of each
cryptId element.

In the XML configuration file vsconfig.xml, as shipped, several individual cryptIds are
defined without description attributes.

This setting is not relevant to the NiFi Developer Template nor to the Kafka-Storm Developer
Template.

To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).

CONFIDENTIAL 3-22
Developer Templates Integration Guide Version 5.0 Configuration Settings

Component-Specific Designator for Fields

Use the Component-Specific Designator for Fields (XML infrastructure) setting to specify
the relevant Hadoop component to which the enclosed set of field elements apply. Use one
of the following valid values for this attribute to identify the Hadoop component for which you
want to define field mappings:

• mr - Define field mappings by CSV column index for the MapReduce Developer
Template

• sqoop - Define field mappings by database column name for the Sqoop Developer
Template

• spark - Define field mappings by index for the Resilient Distributed Dataset (RDD)
and Dataset variants of the Spark Developer Template

The Hadoop Developer Template components that use User Defined Functions (UDFs)
explicitly pass the name of the relevant cryptId as a UDF parameter, thereby eliminating the
need to supply field mappings using a fields element in either the global XML configuration
file vsconfig.xml or in a component-specific XML configuration file (such as vsspark-
rdd.xml). As of this release, such components include the Hive Developer Template and the
DataFrame, Spark SQL, and HiveUDF variants of the Spark Developer Template.

In the XML configuration file vsconfig.xml, this setting is required in all individually specified
sets of fields and must be specified using the component attribute (and its value) of each
fields element. Each such fields element must have one of the three unique names
defined above.

In component-specific XML configuration files (such as vsspark-rdd.xml), the fields

element does not have a component attribute because the component is implicit, based on the
component identified by the name of the XML configuration file.

In the XML configuration file vsconfig.xml, as shipped, sets of fields are defined for the
MapReduce (mr) and Sqoop (sqoop) Developer Templates.

To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).

Field Index
Use the Field Index (direct) setting to specify the index of a field/column that is subject to a
protect or access operation when the corresponding Hadoop Developer Template is run.

Field index values are integers and are zero-based (0 is the index of the first column or field).

Every field element in the XML configuration file vsconfig.xml must include a field index
or field name (see "Field Name" on page 3-24), but not both.

3-23 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0

Field indexes are used for the MapReduce Developer Template and the RDD and Dataset
variants of the Spark Developer Template.

In the XML configuration file vsconfig.xml, this setting is conditionally required in all
individually specified fields for the relevant Hadoop components and can be specified using the
index attribute (and its value) of each field element.

In the relevant component-specific XML configuration files with names of the form
vs<component>.xml, this setting is required and must be specified for the Hadoop
component indicated by the filename using the index attribute (and its value) of each field
element.

Do not specify field mappings (including field indexes) for the same Hadoop component using
both the XML configuration file vsconfig.xml and a component-specific XML configuration
file with a name of the form vs<component>.xml (if field mappings are specified in both files,
the field mappings in the component-specific XML configuration file will be used).

In the XML configuration file vsconfig.xml, as shipped, a set of fields is defined for the
MapReduce (mr) Developer Template that uses field indexes.

In the XML configuration file vsspark-rdd.xml, as shipped, a set of fields is defined for the
RDD and Dataset variants of the Spark Spark Developer Template that uses field indexes.

Field Name
Use the Field Name (direct) setting to specify the name of a database column that is subject to
protect operations during Sqoop Developer Template import processing (such as when the
component attribute of the enclosing fields element in the XML configuration file
vsconfig.xml is set to sqoop).

Every field element in the XML configuration file vsconfig.xml must include a field name
or field index (see "Field Index" on page 3-23), but not both.

Field names are used for the Sqoop Developer Template.

In the XML configuration file vsconfig.xml, this setting is conditionally required in all
individually specified fields for the Sqoop Developer Template and can be specified using the
name attribute (and its value) of each field element.

In the component-specific XML configuration file vssqoop.xml, if used, this setting is required
and must be specified for the Sqoop Developer Template using the name attribute (and its
value) of each field element.

CONFIDENTIAL 3-24
Developer Templates Integration Guide Version 5.0 Configuration Settings

Do not specify field mappings (including field names) for the same Hadoop component using
both the XML configuration file vsconfig.xml and the XML configuration file vssqoop.xml
(if the same field mapping is specified in both files, the field mapping in the component-specific
XML, configuration file will be used).

In the XML configuration file vsconfig.xml, as shipped, the set of fields defined for the Sqoop
Developer Template uses field names to specify the database columns subject to protect
operations during Sqoop import processing.

CryptId Name for Field

Use the CryptId Name for Field (XML infrastructure) setting to specify a cryptId, from the set
of cryptIds defined within the cryptIds element in the XML configuration file vsconfig.xml,
to be associated with each field element. The value of this attribute must exactly match the
value of one (and only one) of the name attribute values of the cryptId elements.

In the XML configuration file vsconfig.xml, this setting is required for each field to be
cryptographically processed and can be specified using the cryptId attribute (and its value)
of each field element.

In the component-specific XML configuration files with names of the form

vs<component>.xml, this setting is required for each field to be cryptographically processed
and can be specified using the cryptId attribute (and its value) of each field element.

Do not specify field mappings (which include a cryptId name) for the same Hadoop component
using both the XML configuration file vsconfig.xml and a component-specific XML
configuration file.

As shipped, the XML configuration file vsconfig.xml defines the fields subject to
cryptographic processing, including the relevant cryptIds defined in this same configuration file,
for the MapReduce Developer Template and the Sqoop Developer Template.

As shipped, the XML configuration file vsspark-rdd.xml defines the fields subject to
cryptographic processing, including the relevant cryptIds defined in the XML configuration file
vsconfig.xml, for the RDD and Dataset variants of the Spark Developer Template.

3-25 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0

Delegation Token HDFS Path

Use the Delegation Token HDFS Path (direct) setting to specify the HDFS directory path in
which the Hadoop Developer Templates will write Kerberos delegation tokens, with
permissions set appropriately, for subsequent use by all of the data nodes in your Hadoop
cluster. This path can be an HDFS absolute path, starting with the forward slash character (/),
or a relative path from the current user's home directory in HDFS.

In the XML configuration file vsauth.xml, as shipped, this configuration setting specifies the
relative HDFS path voltage/config, which resolves to the absolute path /user/
<username>/voltage/config. The relevant user (the user running the Hadoop jobs and
the vsk* scripts) must have appropriate write permissions to create that directory, if necessary,
and to write files into that directory. Due to the sensitive nature of the Kerberos delegation
tokens stored in this directory, your security considerations may dictate that you restrict
permissions on this directory and the files it contains to as few users as possible.

This setting is conditionally required depending on whether Kerberos authentication/

authorization is ever used. If none of your authMethod attributes are set to Kerberos,
including both for the authDefault element and any named authId elements defined within
the authIds element, your vsauth.xml configuration file need not include the kerberos
element (nor this attribute).

To see this setting in the context of the XML configuration file vsauth.xml, see “Element and
Attribute Values in vsauth.xml” (page 3-40).

Authentication/Authorization Method
Use the Authentication/Authorization Method (direct) setting to specify the method by
which authentication and authorization will be performed with the Voltage SecureData Server
during cryptographic operations. Valid values for this setting are:

• Kerberos - Use Kerberos to authenticate/authorize cryptographic operations with the

Voltage SecureData Server(s) in your Hadoop cluster.

NOTE: Kerberos authentication/authorization is only allowed for Hadoop Developer

Templates that run within a Hadoop cluster (MapReduce, Hive, Sqoop, and Spark). It
is not available for the DataStream Developer Templates (StreamSets, Kafka
Connect, Kafka-Storm, and Nifi).

• SharedSecret - Use a shared secret to authenticate (but not authorize) cryptographic

operations with your Voltage SecureData Server(s).

• UserPassword - Use one of the following two methods to authenticate/authorize

cryptographic operations with your Voltage SecureData Server(s):

• Username and Password

• LDAP + Shared Secret

CONFIDENTIAL 3-26
Developer Templates Integration Guide Version 5.0 Configuration Settings

NOTE: LDAP + SharedSecret authentication/authorization is only allowed for

Developer Templates that run in the context of a Hadoop cluster
(MapReduce, Hive, Sqoop, and Spark) and the StreamSets Developer
Template. It is not available for the Kafka Connect Developer Template, the
Kafka-Storm Developer Template, or the NiFi Developer Template.

Wherever UserPassword is specified as the authentication/authorization method and

the accompanying username is set to any value other than {LOCALUSER}, you are
specifying the Username and Password authentication/authorization method and you
must also specify the password associated with that username.

Wherever UserPassword is specified as the authentication/authorization method for

one of the Hadoop Developer Templates that run within a Hadoop cluster (MapReduce,
Hive, Sqoop, and Spark) and the accompanying username is set to the value
{LOCALUSER}, you are specifying the LDAP + Shared Secret authentication/
authorization method and you must also specify the shared secret (as the Password
configuration setting).

vsconfig.xml and vsauth.xml for all Developer Templates other than NiFi

In the context of the cryptIds defined in the configuration file vsconfig.xml and authIds
defined in the configuration file vsauth.xml, use the Authentication/Authorization
Method (direct) setting to specify:

• The authentication/authorization method to serve as the default when no named

authentication/authorization method is specified (authDefault element).

- and/or -

• The authentication/authorization method for all named authentication/authorization

methods (every authId element defined within the authIds element.

When Kerberos is used anywhere in your configuration, you must also specify an HDFS
directory path using the delegationTokenHdfsPath attribute of the kerberos element
in the configuration file vsauth.xml. Wherever Kerberos is specified as the value of the
authMethod attribute, you must not include the sharedSecret, username, or password
subordinate elements.

Wherever SharedSecret is specified as the value of the authMethod attribute, you must
include only the sharedSecret subordinate element (do not include the username or
password subordinate elements).

Wherever UserPassword is specified as the value of the authMethod attribute, you must
include only the username and password subordinate elements (do not include the
sharedSecret subordinate element), the values of which specify, respectively, either:

• The relevant username and password, or,

3-27 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0

• The reserved username {LOCALUSER} and the shared secret.

Setting Authentication/Authorization for NiFi Developer Template

In the context of the NiFi Developer Template, choose the Authentication/Authorization

Method for the SecureDataProcessor interactively by selecting SharedSecret or
UserPassword in the Auth Method drop-down box on the Properties tab of the
Configure Processor dialog box.

If you choose SharedSecret, you must also provide the shared secret as the value of the
SharedSecret property on the Properties tab.

If you choose UserPassword, you must also provide the username and password as the
values of the Username and Password properties, respectively, on the Properties tab.

To see this setting in the context of the XML configuration file vsauth.xml, see “Element and
Attribute Values in vsauth.xml” (page 3-40).

Shared Secret
Use the Shared Secret (direct) setting to specify, when applicable, the shared secret to be used
to authenticate cryptographic operations with the Voltage SecureData Server.

In the context of the authIds defined in the configuration file vsauth.xml, include the
sharedSecret element and its value when the authMethod attribute of the containing
authDefault and/or authId element is set to SharedSecret:

<authDefault authMethod="SharedSecret">
<sharedSecret>shared_secret</sharedSecret>
</authDefault>

and/or
<authIds>
<authId name="authId_name" authMethod="SharedSecret">
<sharedSecret>shared_secret</sharedSecret>
</authId>
</authIds>

In the context of the Kafka-Storm Developer Template’s Java Properties configuration file
vsauth.properties, include the auth.sharedSecret property when the auth.method
property is set to SharedSecret:
auth.method = SharedSecret
auth.sharedSecret = <actual shared secret>

In the context of the NiFi Developer Template, on the Properties tab of the Configure
Processor dialog box for the SecureDataProcessor, interactively set the shared secret as
the value of the SharedSecret property on the Properties tab when SharedSecret is set in the
Auth Method drop-down box.

CONFIDENTIAL 3-28
Developer Templates Integration Guide Version 5.0 Configuration Settings

To see this setting in the context of the XML configuration file vsauth.xml, see “Element and
Attribute Values in vsauth.xml” (page 3-40).

To see this setting in the context of the Java Properties configuration file
vsauth.properties, see “Specifying the Location of the XML Configuration Files” (page 3-
47).

Username
Use the Username (direct) setting to specify, when applicable, the username to be used to
authenticate cryptographic operations with the Voltage SecureData Server. When the
username is set to {LOCALUSER}, LDAP + Shared Secret authentication is being used and the
accompanying password must specify a valid shared secret. When the username is set to
anything else, Username and Password authentication is being used and the accompanying
password must specify the (usually LDAP) password for the specified user.

In the context of the authIds defined in the configuration file vsauth.xml, include the
username element and its value when the authMethod attribute of the containing
authDefault and/or authId element is set to UserPassword:

<authDefault authMethod="UserPassword">
<username>username_or_{LOCALUSER}</username>
<password>password_or_sharedsecret</password>
</authDefault>

and/or
<authIds>
<authId name="authId_name_here" authMethod="UserPassword">
<username>username_or_{LOCALUSER}</username>
<password>password_or_sharedsecret</password>
</authId>
</authIds>

In the context of the Kafka-Storm Developer Template’s Java Properties configuration file
vsauth.properties, include the auth.username property when the auth.method
property is set to UserPassword:
auth.method = UserPassword
auth.username = <username_or_{LOCALUSER}>
auth.password = <password_or_sharedsecret>

In the context of the NiFi Developer Template, on the Properties tab of the Configure
Processor dialog box for the SecureDataProcessor, interactively set the username or
{LOCALUSER} as the value of the Username property on the Properties tab when
UserPassword is set in the Auth Method drop-down box.

To see this setting in the context of the XML configuration file vsauth.xml, see “Element and
Attribute Values in vsauth.xml” (page 3-40).

3-29 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0

To see this setting in the context of the Java Properties configuration file
vsauth.properties, see “Specifying the Location of the XML Configuration Files” (page 3-
47).

Password
Use the Password (direct) setting to specify, when applicable, the password or shared secret to
be used to authenticate cryptographic operations with the Voltage SecureData Server. When
the username is set to {LOCALUSER}, LDAP + Shared Secret authentication is being used and
this password setting must specify a valid shared secret. When the username is set to anything
else, Username and Password authentication is being used and this password setting must
specify the (usually LDAP) password for the specified user.

In the context of the authIds defined in the configuration file vsauth.xml, include the
password element and its value when the authMethod attribute of the containing
authDefault and/or authId element is set to UserPassword:

<authDefault authMethod="UserPassword">
<username>username_or_{LOCALUSER}</username>
<password>password_or_sharedsecret</password>
</authDefault>

and/or
<authIds>
<authId name="authId_name_here" authMethod="UserPassword">
<username>username_or_{LOCALUSER}</username>
<password>password_or_sharedsecret</password>
</authId>
</authIds>

In the context of the Kafka-Storm Developer Template’s Java Properties configuration file
vsauth.properties, include the auth.password property when the auth.method
property is set to UserPassword:
auth.method = UserPassword
auth.username = <username_or_{LOCALUSER}>
auth.password = <password_or_sharedsecret>

In the context of the NiFi Developer Template, on the Properties tab of the Configure
Processor dialog box for the SecureDataProcessor, interactively set the password or
shared secret as the value of the Password property on the Properties tab when
UserPassword is set in the Auth Method drop-down box.

To see this setting in the context of the XML configuration file vsauth.xml, see “Element and
Attribute Values in vsauth.xml” (page 3-40).

To see this setting in the context of the Java Properties configuration file
vsauth.properties, see “Specifying the Location of the XML Configuration Files” (page 3-
47).

CONFIDENTIAL 3-30
Developer Templates Integration Guide Version 5.0 Configuration Settings

AuthId Name
Use the required AuthId Name (XML infrastructure) setting to specify a name for an authId, as
specified using the authId elements defined within the authIds element. Specify a unique
name for each authId element using its required name attribute (and its value), thereby
creating a named authentication/authorization method that can be referenced within a cryptId
specification using the authId attribute of the cryptId element. This mechanism is used to
optionally associate an individually defined and named authentication/authorization method
and its associated credentials with one or more cryptIds.

To see this setting in the context of the XML configuration file vsauth.xml, see “Element and
Attribute Values in vsauth.xml” (page 3-40).

3-31 CONFIDENTIAL
XML Configuration Files Developer Templates Integration Guide Version 5.0

XML Configuration Files

Beginning with version 4.1, the Voltage SecureData for Hadoop Developer Templates that run
in the context of a Hadoop cluster (MapReduce, Hive, Sqoop, and Spark) use the following XML
configuration files instead of the Java Properties configuration files used in previous releases:

• vsconfig.xml - General configuration values, including some used by all of these

Hadoop components and some designated for use by specific Hadoop components that
describe the fields to be protected and accessed by each component.

• vsauth.xml - Configuration values used for authentication and authorization, which

include sensitive credentials in-the-clear.

• vs<component>.xml - An alternative location to specify the general configuration

values designated for use by specific Hadoop components, as indicated in the filename.

The information in these configuration files corresponds to the information provided in the
corresponding Java Properties files (vsconfig.properties, vsauth.properties, and
vs<component>.properties) used by these Hadoop Developer Templates in previous
releases, with several important differences, summarized here:

• Information used by specific types of protect and access operations, such as a data
protection format, an identity for cryptographic key generation, and so on, are now
grouped together with an identifying name as a cryptId. This is similar to the alias
grouping concept used in previous versions of the Hive Developer Template and UDF
variants of the Spark Developer Template. CryptIds provide a mechanism for grouping
this type of information, required for all protect and access operations, for use by all of
the relevant Hadoop components.

• Fields to be protected or accessed in the context of some types of particular Hadoop

components are now identified by a field mapping associated with that component. A
field mapping identifies a data field either by:

• Index, such as for a CSV file in the MapReduce Developer Template, or the non-
UDF variants of the Spark Developer Template, or by:

• Name, such as for a database column name in the Sqoop Developer Template.

A field mapping associates the index or name with a corresponding cryptId, which
provides the information necessary to protect or access the type of data in that field.

• Fields to be protected or accessed in the context of other types of particular Hadoop

components, the UDF-based components of the Hive Developer Template and the UDF
variants of the Spark Developer Template, continue to be identified by database column
name as a UDF parameter. The relevant cryptId name is now provided directly as the
second UDF parameter, in place of the alias name provided as the second UDF
parameter in previous releases of the Hadoop Developer Templates.

CONFIDENTIAL 3-32
Developer Templates Integration Guide Version 5.0 XML Configuration Files

The XML configuration files used by the Developer Templates are always validated using their
corresponding XSD schema files: vsconfig.xsd, vsauth.xsd, and vscomponent.xsd. The
schema files are usually located in a sub-directory config/schema within the directory for a
particular Developer Template. For example, the XSD schema files for the Kafka Connect
Developer Template are located in the following directory:
<install_dir>/stream/kafka_connect/config/schema

NOTE: The StreamSets Developer Template is an exception to this rule. To conform to

standard practice for StreamSets, its XSD schema files are located in the directory
<install_dir>/stream/streamsets_processor/src/main/resources.

The remainder of this section provides detailed information about these three types of XML
configuration files, in separate sub-sections:

• vsconfig.xml (page 3-33)

• vsauth.xml (page 3-38)

• vs<component>.xml (page 3-41)

vsconfig.xml
The XML configuration file vsconfig.xml is capable of providing all of the non-
authentication/authorization configuration information for the Hadoop Developer Templates
that run in the context of a Hadoop cluster (MapReduce, Hive, Sqoop, and Spark). In addition to
providing global and component-specific configuration information needed by these templates,
it also introduces the concept of a cryptId (pronounced “cryp-tid”), which encapsulates the
information required to perform protect and access operations on a particular field of data.
Finally, this XML configuration file captures the information needed to associate data fields for
each of the different types of templates with a corresponding cryptId, providing information
about the format of that data, the identity to use when retrieving a cryptographic key for
protecting or accessing that data, the Voltage SecureData API to use, and so on.

The following two sub-sections, High-Level Elements in vsconfig.xml and Attribute Values in
vsconfig.xml, provide detailed information about the XML structure used in this configuration
file and how the configuration information is provided as attribute values, respectively.

NOTE: The schema definition for the configuration file vsconfig.xml is in the following file:

<install_dir>/config/schema/vsconfig.xsd

High-Level Elements in vsconfig.xml

The configuration file vsconfig.xml uses the following high-level XML structure to specify its
configuration values:

<?xml version="1.0" encoding="UTF-8"?>

3-33 CONFIDENTIAL
XML Configuration Files Developer Templates Integration Guide Version 5.0

<vs:configuration schemaVersion="1"
xmlns:vs="https://www.voltage.com/sd/config"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

1
<secureDataServer attributes only />

3
<simpleAPI attributes only />

1
<webService attributes only />

1
<general attributes only />
1
<clientIds optional attributes >
2
<clientId attributes only />
</clientIds>

3
<cryptIds optional attributes >
4
<cryptId attributes only />
</cryptIds>

1
<fieldMappings>
4
<fields one required attribute>
4
<field attributes only />
</fields>
</fieldMappings>

</vs:configuration>

1 Zero or one instance may be specified.

2 Zero or more instances may be specified.
3 Exactly one instance must be specified.
4 One or more instances must be specified.

The remainder of this section provides a summary of the configuration information in each of
the elements of the high-level XML structure of the configuration file vsconfig.xml.

secureDataServer Element

Use the secureDataServer element to specify the relevant Voltage SecureData Server
using one or the other (but not both) of its two attributes: domainName or hostName.

NOTE: If both domainName and hostName are specified, the hostname will be used and
the domain name will be ignored.

If you provide all three of the following configuration settings, which are otherwise built
using the attribute setting(s) of this element, this element is not needed and may be left out
of the XML configuration file vsconfig.xml:

• Simple API Policy URL (page 3-8) - The value of the policyUrl attribute of the
simpleAPI element.

CONFIDENTIAL 3-34
Developer Templates Integration Guide Version 5.0 XML Configuration Files

• REST Hostname (page 3-11) - The value of the restHostName attribute of the
webService element.

Back to High-Level Elements in vsconfig.xml

simpleAPI Element

Use the simpleAPI element to provide configuration information related to the use of the
Simple API: the required attribute installPath and several optional attributes for
controlling Simple API behavior: policyUrl, cacheType, fileCachePath, and
shortFPE.

NOTE: Because the Hadoop Developer Templates use a CryptoFactory class (and
other associated classes) to cache the Simple API LibraryContext and Crypto
instances for re-use, any changes to settings that affect the LibraryContext instance,
such as the install path and the cache and short FPE settings, require that the
CryptoFactory instance be reinitialized.

In the case of Hadoop jobs that launch a new JVM every time they run, such as
MapReduce, Sqoop, and Spark, no additional steps are needed: starting those jobs will
create a new CryptoFactory instance using the new settings in the configuration files
(XML or Java Properties). However, other Hadoop Developer Templates use long-
running services that have already initialized the CryptoFactory instance (and thus
the underlying libraryContext instance), those services will need to be restarted if
you change these Simple API settings in the configuration. Specifically, this includes the
following integrations/jobs:

• NiFi - You must restart the NiFi service

• Hive UDF - You must restart the Hive CL or the HiveServer2 service
• Kafka-Storm - You must restart the Storm topology

Back to High-Level Elements in vsconfig.xml

webService Element

Use the webService element to provide optional configuration information related to the
use of the REST API: the attribute batchSize and the REST API hostname attribute
restHostName.

Back to High-Level Elements in vsconfig.xml

general Element

Use the general element to provide optional configuration information that applies to all
of the APIs, which is presently the single attribute that controls behavior when an access
operation fails its authentication/authorization step:
returnProtectedValueOnAccessAuthFailure

Back to High-Level Elements in vsconfig.xml

3-35 CONFIDENTIAL
XML Configuration Files Developer Templates Integration Guide Version 5.0

clientIds and clientId Elements

Use the clientIds and clientId elements to provide optional configuration information
for sending a customized product name and version in requests to the Voltage SecureData
Server when using the Simple API to request cryptographic keys and when making the
REST API requests.
<clientIds product="global product description"
version="global product version">
<clientId component="component designator"
product="component-specific product description"
version="component-specific product version" />
<clientId ... />
</clientIds>

Back to High-Level Elements in vsconfig.xml

cryptIds and cryptId Elements

Use the cryptIds and cryptId elements to provide named sets of format/identity/
authentication information, for reference (either directly or through field mappings) when
performing cryptographic processing on specific data values.

The enclosing cryptIds element allows for the definition of default values for the identity
and the API, while the subordinate cryptId elements allow those choices to be overridden,
as well as allowing for the definition of an identifying name for the cryptId, the associated
data protection format, a reference to the associated authentication information, the name
and initialization data of the associated translator Java class, if any, and an optional
informational description.
<cryptIds defaultIdentity="default identity for key derivation"
defaultApi="default API to use">
<cryptId name="cryptId name"
format="data protection format"
identity="identity for key derivation"
authId="authId name"
api="API to use"
translatorClass="translator class name"
translatorInitData="translator initialization data"
description="informational description" />
<cryptId ... />
</cryptIds>

Back to High-Level Elements in vsconfig.xml

fieldMappings, fields, and field Elements

Use the fieldMappings, fields, and field elements to provide information about
which fields/columns are subject to cryptographic operations for a subset of the Hadoop
Developer Template components (MapReduce, Sqoop, and the non-UDF variants of Spark).

CONFIDENTIAL 3-36
Developer Templates Integration Guide Version 5.0 XML Configuration Files

The enclosing fieldMappings element contains one or more fields elements, the
component attribute(s) of which identify the relevant Hadoop Developer Template
component for which the enclosed field elements define the fields/columns subject to
cryptographic processing.

Each fields element contains one or more field elements, each of which identifies a
field/column, either by index (using its index attribute and value) or by name (using its
name attribute and value). In both cases, the corresponding cryptId, which provides
information about how to protect and access the specified field/column, is identified using
the cryptId attribute and its value (which maps to the value of the name attribute of the
relevant cryptId).
<fieldMappings>
<fields component="component designator">
<field index="field index, when appropriate"
name="field name, when appropriate"
cryptId="cryptId name" />
<field ... />
</fields>
</fieldmappings>

Back to High-Level Elements in vsconfig.xml

Attribute Values in vsconfig.xml

The configuration file vsconfig.xml uses the following XML attribute values (blue links to
detailed information) to specify its configuration values:

<?xml version="1.0" encoding="UTF-8"?>

<vs:configuration schemaVersion="1"
xmlns:vs="https://www.voltage.com/sd/config"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<secureDataServer domainName=Domain Name"
or
hostName="Hostname" />
<simpleAPI installPath="Simple API Install Path"
policyUrl="Simple API Policy URL"
cacheType="Simple API Cache Type"
fileCachePath="Simple API File Cache Path">
shortFPE="Simple API Short FPE Behavior" />
<webService batchSize="Web Service Batch Size"
restHostName="REST Hostname" />
<general returnProtectedValueOnAccessAuthFailure=
"Authentication/Authorization Failure on Access Behavior" />
<clientIds product="Product Name"
version="Product Version">
<clientId component=
"Component-Specific Designator for Client ID"
product="Product Name"
version="Product Version" />
<clientId ... />
</clientIds>

3-37 CONFIDENTIAL
XML Configuration Files Developer Templates Integration Guide Version 5.0

<cryptIds defaultIdentity="Identity"
defaultApi="API">
<cryptId name="CryptId Name"
format="Format"
identity="Identity"
authId="CryptId AuthId Name"
api="API"
translatorClass="Translator Class" />
translatorInitData=
"Translator Initialization Data" />
description="CryptId Description" />
<cryptId ... />
</cryptIds>
<fieldMappings>
<fields component="Component-Specific Designator for Fields">
<field index="Field Index"
or
name="Field Name"
cryptId="CryptId Name for Field" />
<field ... />
</fields>
<fields ... />
</fieldMappings>
</vs:configuration>

vsauth.xml
The configuration file vsauth.xml provides the authentication/authorization configuration
information for the Hadoop Developer Templates that run in the context of a Hadoop cluster
(MapReduce, Hive, Sqoop, and Spark). It can provide individually named sets of authentication/
authorization information (the method and its associated credentials) that can be referenced
by cryptIds in the XML configuration file vsconfig.xml as well as a default set of
authentication/authorization information for use by cryptIds that do not reference an
individually named set. Finally, for when Kerberos authentication/authorization is used, it
provides a way to specify the HDFS directory where Kerberos delegation tokens will be stored.

The following two sub-sections, High-Level Elements in vsauth.xml and Element and Attribute
Values in vsauth.xml, provide detailed information about the XML structure used in this
configuration file and how the configuration information is provided as element and attribute
values, respectively.

NOTE: The schema definition for the configuration file vsauth.xml is in the following file:

<install_dir>/config/schema/vsauth.xsd

High-Level Elements in vsauth.xml

The configuration file vsauth.xml uses the following high-level XML structure to specify its
configuration values:

CONFIDENTIAL 3-38
Developer Templates Integration Guide Version 5.0 XML Configuration Files

<?xml version="1.0" encoding="UTF-8"?>

<vs:authentication schemaVersion="1"
xmlns:vs="https://www.voltage.com/sd/auth"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

1
<kerberos delegationTokenHdfsPath="HDFS path" />

<authDefault one required attribute and

1
zero, one, or two subordinate elements >
</authDefault>
1
<authIds>
<authId two required attributes and
2
zero, one, or two subordinate elements >
</authId>
<authIds>
</vs:authentication>

1 Exactly one instance may be specified.

2 One or more instances may be specified.

The remainder of this section provides a summary of the configuration information in each of
the elements of the high-level XML structure of the configuration file vsauth.xml.

kerberos Element

Use the kerberos element to specify the HDFS directory in which Kerberos delegation
tokens issued by the Voltage SecureData Server for Kerberos-authenticated users will be
stored. This element is required whenever any of the authMethod attributes (default or
otherwise) is set to Kerberos; otherwise it is ignored.
Back to High-Level Elements in vsauth.xml

authDefault and Credential Elements

Use the authDefault and its credential subordinate elements to provide a default
authentication/authorization method to be used when a particular cryptId does not include
an authId attribute. Depending on the chosen authentication/authorization method, there
will either be zero (Kerberos), one (SharedSecret), or two (UserPassword)
subordinate elements expected.

<authDefault authMethod="authentication/authorization method">

No subordinate elements
or
<sharedSecret>shared secret</sharedSecret>
or

3-39 CONFIDENTIAL
XML Configuration Files Developer Templates Integration Guide Version 5.0

<username>username</username>
<password>password</password>
</authDefault>

Back to High-Level Elements in vsauth.xml

authIds, authId, and Credential Elements

Use the authIds element, its subordinate (one or more) authId elements (and their
credential subordinate elements, when applicable) to provide a set of named
authentication/authorization method/credential pairings. Each authId element defines a
name attribute, an authMethod attribute, and depending on the value of the latter
attribute, zero (Kerberos), one (SharedSecret), or two (UserPassword) subordinate
elements. The value of the name attribute may be provided as the value of the authId
attribute of one or more cryptId elements in the configuration file vsconfig.xml.

<authIds>
<authId name="authId name"
authMethod="authentication/authorization method">
No subordinate elements
or
<sharedSecret>shared secret</sharedSecret>
or
<username>username</username>
<password>password</password>
</authId>
<authId ... />
<authIds>

Back to High-Level Elements in vsauth.xml

Element and Attribute Values in vsauth.xml

The configuration file vsauth.xml uses the following XML element and attribute values (blue
links to detailed information) to specify its configuration values:

<?xml version="1.0" encoding="UTF-8"?>

<vs:authentication schemaVersion="1"
xmlns:vs="https://www.voltage.com/sd/auth"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<kerberos delegationTokenHdfsPath=
"Delegation Token HDFS Path" />
<authDefault authMethod="Authentication/Authorization Method">
No subordinate elements
or
or
<sharedSecret>Shared Secret</sharedSecret>
or
<username>Username</username>
<password>Password</password>

CONFIDENTIAL 3-40
Developer Templates Integration Guide Version 5.0 XML Configuration Files

</authDefault>
<authIds>
<authId name="AuthId Name"
authMethod="Authentication/Authorization Method"/>
No subordinate elements
or
<sharedSecret>Shared Secret</sharedSecret>
or
<username>Username</username>
<password>Password</password>
</authId>
<authId ... />
</authIds>
</vs:authentication>

vs<component>.xml
The set of XML configuration files with names of the form vs<component>.xml provide an
alternative mechanism for providing certain types of component-specific configuration
information for the Hadoop Developer Templates that run in the context of a Hadoop cluster
(MapReduce, Hive, Sqoop, and Spark). Valid values for <component> are the following:

• mr - Provide the clientId and/or fields elements for the MapReduce

Developer Template in the configuration file vsmr.xml.

• hive - Provide the clientId element for the Hive Developer Template in the
configuration file vshive.xml. Note that because the relevant field and
cryptId are specified as UDF parameters, the fields element is not
relevant to the Hive Developer Template and should not be specified in
the configuration file vshive.xml.

• sqoop - Provide the clientId and/or fields elements for the Sqoop
Developer Template in the configuration file vssqoop.xml.

• spark-rdd - Provide the clientId and/or fields elements for the RDD and
Dataset variants of the Spark Developer Template in the configuration
file vsspark-rdd.xml.

• spark-udf - Provide the clientId element for the UDF variants of the Spark
Developer Template in the configuration file vsspark-udf.xml. Note
that because the relevant field and cryptId are specified as UDF
parameters, the fields element is not relevant to the UDF variants of
the Spark Developer Template and should not be specified in the
configuration file vsspark-udf.xml.

Use this alternative mechanism when you prefer to provide component-specific settings in the
smaller component-specific XML configuration files instead of providing them together in the
larger shared XML configuration file vsconfig.xml. If you do provide one or more
component-specific XML configuration files for relevant Hadoop components, do not also
provide clientId and/or fields elements for those components in the shared XML

3-41 CONFIDENTIAL
XML Configuration Files Developer Templates Integration Guide Version 5.0

configuration file vsconfig.xml. For example, if you provide a fields element in the
component-specific XML configuration file vsmr.xml, do not provide a fields element in the
shared XML configuration file with its component attribute set to mr. If you do so by mistake,
the settings in component-specific XML configuration files will override the same settings in the
shared XML configuration file vsconfig.xml.

NOTE: The schema definition for XML configuration files with names of the form
vs<component>.xml is in the following file:

<install_dir>/config/schema/vscomponent.xsd

The following two sub-sections, High-Level Elements in vs<component>.xml and Attribute

Values in vs<component>.xml, provide detailed information about the XML structure used in
this type of XML configuration file and how the configuration information is provided as
attribute values, respectively.

High-Level Elements in vs<component>.xml

XML configuration files with names of the form vs<component>.xml uses the following high-
level XML structure to specify their configuration values:

<?xml version="1.0" encoding="UTF-8"?>

<vs:componentConfiguration schemaVersion="1"
xmlns:vs="https://www.voltage.com/sd/config"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

1
<clientId attributes only />

1
<fields>
2
<field attributes only />
</fields>

</vs:componentConfiguration>

1 Zero or one instance may be specified.

2 One or more instances must be specified.

The remainder of this section provides a summary of the configuration information in each of
the elements of the high-level XML structure of configuration files with names of the form
vs<component>.xml.

clientId Element

Use the clientId element to provide optional configuration information for sending a
customized product name and/or version in requests to the Voltage SecureData Server
when using the Simple API to request cryptographic keys and when making the REST API
requests.

CONFIDENTIAL 3-42
Developer Templates Integration Guide Version 5.0 XML Configuration Files

<clientId product="component-specific product description"

version="component-specific product version" />

Back to High-Level Elements in vs<component>.xml

fields and field Elements

Use the fields and field elements to provide optional configuration information about
which fields/columns are subject to cryptographic operations for a subset of the relevant
Hadoop Developer Template components (MapReduce, Sqoop, and the non-UDF variants
of Spark).

The relevant component-specific XML configuration files may contain a single fields
element that defines the fields/columns subject to cryptographic processing for the
component associated with the component-specific XML configuration file in which it
appears.

The fields element contains one or more field elements, each of which identifies a field/
column, either by index (using its index attribute and value) or by name (using its name
attribute and value). In both cases, a corresponding cryptId in the shared configuration file
vsconfig.xml, which provides information about how to protect and access the specified
field/column, is identified using the cryptId attribute and its value (which maps to the
value of the name attribute of the relevant cryptId).
<fields>
<field index="field index, when appropriate"
or
name="field name, when appropriate"
cryptId="cryptId name" />
<field ... />
</fields>

Back to High-Level Elements in vs<component>.xml

Attribute Values in vs<component>.xml

The set of XML configuration files with names of the form vs<component>.xml use the
following XML attribute values (blue links to detailed information) to specify its configuration
values:

<?xml version="1.0" encoding="UTF-8"?>

<vs:componentConfiguration schemaVersion="1"
xmlns:vs="https://www.voltage.com/sd/config"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<clientId product="Product Name"
version="Product Version" />
<fields>
<field index="Field Index"
or
name="Field Name"
cryptId="CryptId Name for Field" />
<field ... />

3-43 CONFIDENTIAL
XML Configuration Files Developer Templates Integration Guide Version 5.0

</fields>
</vs:componentConfiguration>

As delivered, the component-specific approach for XML configuration files is illustrated for the
RDD and Dataset variants of the Spark Developer Template in the XML configuration file
vsspark-rdd.xml. This file defines fields 7, 8, 9, and 10 as subject to cryptographic
processing using the cryptIds alpha, date, cc, and ssn, respectively (defined in the shared
configuration file vsconfig.xml):

NOTE: In previous releases of the Hadoop Developer Templates, there were two different
component-specific Java Properties configuration files for the Spark Developer Template:
vsspark-rdd.properties and vsspark-udf.properties. The latter of these was
used to define aliases (previously, aliases served the same role that cryptIds do now: named
bundles of format/identity/auth information.) UDF calls in both the Hive Developer
Template and the UDF variants of the Spark Developer Template now specify cryptId names
as a UDF parameter instead of specifying a component-specific alias name. This direct
specification of the cryptId (always configured in the shared configuration file
vsconfig.xml for use by all Hadoop components) as a UDF parameter, eliminates the
need for ever specifying fields for these UDF-based components, either in the shared
configuration file vsconfig.xml or in the relevant component-specific configuration files
(vshive.xml or vsspark-udf.xml).

CONFIDENTIAL 3-44
Developer Templates Integration Guide Version 5.0 Java Properties Configuration Files

Java Properties Configuration Files

The NiFi Developer Template uses a single Java Properties configuration file:

• vsnifi.properties - General configuration values not specifically related to protect

and access operations performed by the NiFi processor SecureDataProcessor.

NOTE: The information required for protect and access operations, including cryptographic
settings such as the data protection format and the identity, as well as the authentication
method and credentials, are entered in the NiFi user interface for the
SecureDataProcessor.

The Java Properties configuration file used by the NiFi Developer Template must confirm to
the following requirements:

• The format of each configuration line is <parameter_name> = <value>.

• Each parameter name and value must be on a single line.

• Any parameter name and value pairs spanning multiple lines must use the backslash
character (\) to indicate line continuation.

• Lines beginning with a hash character (#) are treated as comments and not processed.

• The first parameter value in most of the configuration files provides a version number
for the configuration file. For example, the first parameter in the configuration file
vsauth.properties is auth.config.version, with its value set to 1:

auth.config.version = 1

Never change these values in a Java Properties configuration file.

For more information about configuring the NiFi Developer Template, see "Processor Classes
for the NiFi Developer Template" (page 8-8) and "Configuring the Properties of the NiFi
SecureDataProcessor" (page 8-12).

The remainder of this section provides detailed information about this Java Properties
configuration file.

vsnifi.properties
The Java Properties configuration file vsnifi.properties, used by the NiFi Developer
Template, contains the same general configuration values as the XML configuration file
vsconfig.xml used by the other Developer Templates. It contains extensive comments that
explain each value. Without those comments, the available configuration values are shown
below, with blue links to the generic explanation of each value in the section "Configuration
Settings" (page 3-5).

3-45 CONFIDENTIAL
Java Properties Configuration Files Developer Templates Integration Guide Version 5.0

Available Configuration Settings in vsnifi.properties:

config.version = 1
simpleapi.policy.url = Simple API Policy URL
simpleapi.install.path = Simple API Install Path
simpleapi.cache.type = Simple API Cache Type
simpleapi.file.cache.path = Simple API File Cache Path
simpleapi.shortfpe = Simple API Short FPE Behavior
rest.hostname = REST Hostname
product.name = Product Name
product.version = Product Version
return.protected.value.on.access.auth.failure =
Authentication/Authorization Failure on Access Behavior

The remaining configuration values, related to protect or access operations, such as identifying
the cryptographic operation as either protect or access, the data protection format, the identity,
and so on, and including authentication/authorization information, are provided for the NiFi
Developer Template on the Properties tab of the Configure Processor dialog box for the
relevant SecureDataProcessor. For more information, "Configuration Settings for the NiFi
Developer Template" (page 8-10).

NOTE: Do not change the value of the config.version property. Leave it set to 1.

CONFIDENTIAL 3-46
Developer Templates Integration Guide Version 5.0 Specifying the Location of the XML Configuration Files

Specifying the Location of the XML Configuration Files

The Hadoop Developer Templates support three methods for using the XML configuration
files vsconfig.xml , vsauth.xml, and Hadoop-component-specific configuration files with a
names of the form vs<component>.xml:

• For any of the Hadoop components (MapReduce, Hive, Sqoop, and Spark), modify these
XML configuration files as necessary in the local file system directory <install_dir>/
config and then use the script copy-sample-data-to-hdfs (and possibly the
script run-spark-prepare-job) to copy those files to the HDFS directory /user/
<user>/voltage/config.

• Alternatively, specify one or more XML configuration file locations using well-known
property names with the -D generic option for Hadoop commands. This alternative
method is recommended only for the yarn command used to start the MapReduce
Developer Template. For more information, see "-D Generic Option to Specify a Property
Value" (page 3-47).

• Alternatively, specify one or more configuration file locations using well-known property
names in the Java Properties file config-locator.properties, packaged into the
JAR file vsconfig.jar. This alternative method can be used for any of the Hadoop
components. For more information, see "Config-Locator Properties File Packaged as a
JAR File" (page 3-48).

NOTE: As shipped in version 4.2, the only component-specific configuration file is

vsspark-rdd.xml.

CAUTION: One of the important steps that occurs in the script copy-sample-data-to-
hdfs is the setting of permissions for the primary Hadoop Developer Template XML
configuration files after they are copied to HDFS. The permissions are set such that only the
relevant Hadoop user can read these files. Because it contains sensitive credentials as
plaintext, this is particularly important for the authentication/authorization configuration file,
normally named vsauth.xml. If you choose to use XML configuration files at a different
location, you must take similar measures to ensure that only authorized users can read them.

-D Generic Option to Specify a Property Value

Some Hadoop commands, such as yarn, support several generic options, including the -D
parameter. This parameter allows the specification of a name/value pair as a system property
that can be used to supply an alternative location for one or more of the primary Hadoop
Developer Template XML configuration files expected by the Hadoop Developer Templates.
For example:

-DVOLTAGE_CONFIG_FILE=/apps/mapred/voltage/config/vsconfig.xml

3-47 CONFIDENTIAL
Specifying the Location of the XML Configuration Files Developer Templates Integration Guide Version 5.0

The code in class HDFSConfigLoader will look for the following three names, case-sensitive,
each associated with one of the three primary configuration files, shown here with their default
names:
Expected Name Component Value Component Specifies:
VOLTAGE_CONFIG_FILE Full path to the general XML configuration file.
Normally: vsconfig.xml
VOLTAGE_AUTH_FILE Full path to the authentication/authorization XML
configuration file.
Normally: vsauth.xml
VOLTAGE_COMP_FILE Full path to the component-specific XML
configuration file, if any.
Normally: vs<component>.xml,

Where <component> is a component name such as

spark-rdd.

To use this method of specifying alternate HDFS locations for the configuration files used for
the MapReduce Developer Template, provide two extra -D parameters to the yarn command
in script files such as run-mr-protect-job and run-mr-access-job. For example:
yarn jar ... \
-DVOLTAGE_CONFIG_FILE=/apps/mapred/voltage/config/vsconfig.xml \
-DVOLTAGE_AUTH_FILE=/apps/mapred/voltage/config/vsauth.xml \
-libjars ...

For information about the precedence when checking for various methods of specifying
configuration file locations, see "Precedence When Checking for XML Configuration File
Locations" (page 3-50).

Config-Locator Properties File Packaged as a JAR File

You also have the option of specifying alternate locations for one or more of the primary
Hadoop Developer Template XML configuration files in the well-known properties file
config-locator.properties and packaging that properties file into a JAR file that you
provide when running a Hadoop job or defining a Hive UDF. This approach can be used for any
of the Hadoop Developer Templates, including MapReduce, but it is primarily useful for Sqoop
and Hive.

The configuration loader code in class HDFSConfigLoader automatically looks for this
optional properties file in the job classpath, and if found, reads the alternate configuration file
locations from that file.

CONFIDENTIAL 3-48
Developer Templates Integration Guide Version 5.0 Specifying the Location of the XML Configuration Files

The steps for using this approach are as follows:

1. Change (cd) to the following directory:

<install_dir>/configlocator/src/main/resources

2. Edit the following Java Properties file:

config-locator.properties

3. Set values for one or more of the following property names in order to specify an
alternate location for the corresponding primary Hadoop Developer Template XML
configuration file:
Expected Property Name Property Value Specifies:
config.hdfs.location Full path to the general XML configuration file.
Normally: vsconfig.xml
auth.hdfs.location Full path to the authentication/authorization
XML configuration file.
Normally: vsauth.xml
comp.hdfs.location Full path to the component-specific XML
configuration file, if any.
Normally: vs<component>.xml,

Where <component> is a component name such

as spark-rdd.

The example settings provided in the properties file config-locator.properties

are for a common Hive location, but can be edited as required to any locations in HDFS.

4. Save your changes and close the Java Properties file config-locator.properties.

5. Change (cd) to the following directory:

<install_dir>/configlocator

6. Build the JAR file vsconfig.jar to package the properties file

config-locator.properties by running the Maven POM file pom.xml in this
directory, as follows:
mvn

This Maven POM file builds the JAR file vsconfig.jar for reference by the Hadoop
jobs in the directory <install_dir>/configlocator/target. It also copies this
JAR file to the directories <install_dir>/bin and <install_dir>/spark/lib.

3-49 CONFIDENTIAL
Specifying the Location of the XML Configuration Files Developer Templates Integration Guide Version 5.0

7. Reference the JAR file vsconfig.jar when running a Hadoop job or when defining a
UDF. Specifically, depending on the type of the Hadoop job, either update the
-libjars line to include this JAR file or add this JAR file to the using line, as follows:

MapReduce:

yarn jar voltage-hadoop.jar com.vol...op.mapreduce.Protector \

-libjars ...,vsconfig.jar \
voltage/mr-sample-data \
voltage/protected-sample-data

Sqoop:

sqoop import \
-libjars ...,vsconfig.jar \
--username $DATABASE_USERNAME \
-P \
--connect jdbc:mysql://$DATABASE_HOST/$DATABASE_NAME \
--table $TABLE_NAME \
--jar-file voltage-hadoop.jar \
--class-name com.voltage.securedata.hadoop.sqoop.Sqoo... \
--target-dir voltage/protected-sqoop-import

Hive:

create function accessdata as 'com.voltage.securedata.had...

using jar 'hdfs:///user/<user>/voltage/hiveudf/vibesimple...,
jar 'hdfs:///user/<user>/voltage/hiveudf/vsconfig.jar',
jar 'hdfs:///user/<user>/voltage/hiveudf/voltage-ha...;

You will need to copy the JAR file vsconfig.jar to the specified location in HDFS.

You should also include the JAR file vsconfig.jar with the other two JAR files
(vibesimplejava.jar and voltage-hadoop.jar) that you must manually copy
into the appropriate hive/lib directory (the parent classpath for the Hive service) on
all data nodes of your Hadoop cluster. For more information, see “Failure to Copy JAR
Files to the hive/lib Directory on All Data Nodes” (page 12-4).

Precedence When Checking for XML Configuration File Locations

As delivered, the code in class HDFSConfigLoader looks for the alternate ways to specify the
location of the primary Hadoop Developer Template configuration files in a particular order. For
a given XML configuration file from the set vsconfig.xml, vsauth.xml, and
vs<component>.xml, the alternate specification mechanisms are checked in the following
order, until an HDFS location path is found:

1. The HDFS location specified in the Hadoop job configuration as one of the following
variables:

• VOLTAGE_CONFIG_FILE - full path to general XML configuration file.

CONFIDENTIAL 3-50
Developer Templates Integration Guide Version 5.0 Specifying the Location of the XML Configuration Files

• VOLTAGE_AUTH_FILE - full path to authentication/authorization XML

configuration file.
• VOLTAGE_COMP_FILE - full path to component-specific XML configuration file.

2. The HDFS location specified as a -D system property with one of the following names:

• VOLTAGE_CONFIG_FILE - full path to general XML configuration file.

• VOLTAGE_AUTH_FILE - full path to authentication/authorization XML
configuration file.
• VOLTAGE_COMP_FILE - full path to component-specific XML configuration file.

3. The HDFS location specified as an environment variable with one of the following
names:

• VOLTAGE_CONFIG_FILE - full path to general XML configuration file.

• VOLTAGE_AUTH_FILE - full path to authentication/authorization XML
configuration file.
• VOLTAGE_COMP_FILE - full path to component-specific XML configuration file.

4. The HDFS location specified as a property in the Java Properties file config-
locator.properties, packaged in the JAR file vsconfig.jar, with one of the
following names:

• config.hdfs.location - full path to general XML configuration file.

• auth.hdfs.location - full path to authentication/authorization XML
configuration file.
• comp.hdfs.location - full path to component-specific XML configuration file.

5. The default HDFS location, as established by the script copy-sample-data-to-hdfs,

is checked. Specifically:
/user/<user>/voltage/config/vsconfig.xml
/user/<user>/voltage/config/vsauth.xml
/user/<user>/voltage/config/vs<component>.xml

3-51 CONFIDENTIAL
Other Approaches to Providing Configuration Settings Developer Templates Integration Guide Version 5.0

Other Approaches to Providing Configuration Settings

Some other possible approaches for providing the authentication/authorization credentials

required for your Hadoop jobs to use the Voltage SecureData APIs include:

• Package the authentication/authorization credentials in a configuration file packaged

within a JAR file. Your Hadoop jobs can reference this JAR file when they run, such as by
using the -libjars option. However, you would need to do further research to
understand whether such a JAR file would be accessible by others on the nodes on
which the jobs are run.

• Copy any configuration files containing the authentication/authorization credentials to a

local (non-HDFS) directory on all data nodes. Copy this file to each data node in the
Hadoop cluster using a configuration management tool such as Puppet or Chef, and
then protect it using ACLs. The Hadoop jobs can then be written to look for this
configuration file in a specific local directory.

This approach is similar to the approach taken in the samples, but using a local file
system rather than HDFS. The disadvantage is that you will likely need a configuration
management tool to manage the copying of this file to all of the data nodes.

• Store the authentication/authorization credentials in a more centralized data store, such

as in a secure database. The disadvantage and added complexity with this approach is
that you would still need a way to manage the credentials required to access the
centralized data store.

• Pass authentication/authorization credentials as command-line arguments or function

parameters. Some Hadoop tools allow passing in command-line arguments when
running the job. Similarly, Hive allows you to define your UDF evaluate method(s) to
take in additional parameters. These may potentially be used to pass authentication/
authorization credentials to the job or UDF. Use caution when investigating this
approach because command-line arguments and function parameters to UDFs may end
up getting logged by the Hadoop system, which could compromise the security of any
sensitive information that you pass using this technique.

• Pass authentication/authorization credentials within configuration files using a

distributed cache. Some Hadoop tools, primarily MapReduce, allow you to pass files that
are copied to each data node running the job via the distributed cache. This could
potentially be used to pass in configuration files containing authentication/authorization
credentials. Again, exercise caution when investigating this approach, because any such
files copied to the data nodes via distributed cache may be accessible by others who
have access to those data nodes.

CONFIDENTIAL 3-52
Developer Templates Integration Guide Version 5.0 Other Approaches to Providing Configuration Settings

NOTE: Do not hard-code the authentication/authorization credentials directly in the

source code itself because it can easily be discovered by decompiling the class files.
Anyone who gains access to the jar/class files (which may be readily available on the
computers when the jobs are distributed to the data nodes in the cluster) can gain
access to this sensitive information.

You will need a way to distribute the required authentication/authorization credentials to the
data nodes running jobs that use SecureData operations. You might decide to use some of the
examples described in this section, or you might decide to use a completely different approach,
depending on your specific integration use-case and Hadoop environment.

For example, even if you use a configuration file approach, this does not necessarily have to be
formatted as a Java Properties file. You might decide to use XML or some other syntax for
specifying configuration settings in one or more files.

3-53 CONFIDENTIAL
Shared Integration Architecture Developer Templates Integration Guide Version 5.0

Shared Integration Architecture

This section describes the integration architecture shared by the Hadoop Developer
Templates. Some of the Java code that implements this architecture is shared by all of the
Hadoop Developer Templates and some of it is used only by the Hadoop Developer Templates
that operate within the context of a Hadoop cluster.

The following Java packages are part of this shared integration architecture:

• Utility and Support (page 3-55):

Source code package: com.voltage.securedata.common

These classes provide some general purpose utility and support classes that can be
useful in any of the templates.

• Authentication (page 3-56):

Source code package: com.voltage.securedata.auth

This class provides a mechanism to request a a Kerberos authentication/delegation

token from the Voltage SecureData Server using the REST API.

• Common Configuration (page 3-57):

Source code package: com.voltage.securedata.config

These classes provide functionality for accessing configuration information that is that
common across the templates.

• Hadoop Configuration (page 3-60):

Source code package: com.voltage.securedata.hadoop.config

These classes provide functionality for accessing configuration information specific to

the Hadoop-specific templates (MapReduce, Hive, Sqoop, and Spark) that is found in
the configuration files shared by those templates.

• Data Conversion (page 3-65):

Source code package: com.voltage.securedata.converter

These classes provide functionality for converting data between non-string data types
and the string formats expected by the SecureData APIs.

• Data Translation (page 3-67):

Source code package: com.voltage.securedata.translator

CONFIDENTIAL 3-54
Developer Templates Integration Guide Version 5.0 Shared Integration Architecture

These classes provide functionality for translating data to and from the string formats
expected by the SecureData APIs.

• Cryptographic Abstraction (page 3-68):

Source code package: com.voltage.securedata.crypto

These classes provide an abstraction layer for the Voltage SecureData APIs (the Simple
API and the REST API) used by the Hadoop Developer Templates for cryptographic
processing.

Utility and Support

The following Java package and its associated Java source code provides general utility classes
used in one or more, and potentially all, of the Developer Templates:

Source code package: com.voltage.securedata.common

Source code location: <install-dir>/common/src

Package Contents
This package defines the following set of classes in .java source files of the same name:

• Base64 - This class provides methods for Base64 encoding and decoding.

• FileUtils - This class provides methods for reading from a text file, which is used
when reading from configuration files.

• Sanitizer - This class provides a method for sanitizing log messages, which is useful
for mitigating Log Forging security vulnerabilities.

NOTE: This class is used to scrub all log messages to remove any newline/tab
characters that may have come from user-provided input. While this functionality
mitigates the forging of illegitimate log messages by ending the current log line
and starting a new one, it does not provide any protection against other types of
logging attacks. These other types of logging attacks, which include cross-site
scripting (XSS), SQL injection, and so on, must be mitigated, as necessary, by any
downstream processes that read and/or display the logs. For example, the
Hadoop log viewer Web UI automatically performs protection against XSS by
escaping any HTML/Javascript characters in the log messages before rendering
them in HTML responses.

If your job logs are being processed by one or more custom downstream systems,
make sure those systems perform the necessary mitigation (escaping, scrubbing,
and so on) as appropriate to their context(s).

3-55 CONFIDENTIAL
Shared Integration Architecture Developer Templates Integration Guide Version 5.0

• TruststoreInitializer - This class provides methods for duplicating the

trusted root certificates in the Simple API’s OpenSSL truststore in the JVM truststore
used by the REST API client. This simplifies the process of adding a new trusted root
certificate to the Developer Template infrastructure because you only need to add it in
one place, as explained below.

Multiple Developer Template TrustStores - Background and Usage

The Developer Templates perform data protection operations that require secure
connections to the Voltage SecureData Server. Specifically, the Simple API requests
cryptographic keys from the Voltage SecureData Key Server and the REST API client
code make requests to REST Web service on the Voltage SecureData Server. In both
cases, the connection is secured using TLS, and the client Hadoop job or NiFi workflow
must be able to trust the TLS certificate presented by the Voltage SecureData Server.

On Linux platforms, the Simple API uses OpenSSL for secure connections to the Key
Server. OpenSSL uses a trustStore directory that contains trusted root certificates as
individual .pem files, each of which has a hash-named symbolic link that allows for fast
lookups.

Clients using the REST API must configure their local TLS transport mechanism to trust
the required root certificates. Exactly how this is done depends on which REST library
you are using. As delivered, the template code for the REST API uses the Apache HTTP
Client library, which uses the JVM truststore for establishing root certificate trust.

The Developer Templates use the TruststoreInitializer class in the common

infrastructure to help eliminate the need to manage trusted root certificates in two
different truststores. The method initTrustStoreFromPath in this class reads all of
the .pem files from the Simple API trustStore directory and loads them into (in
memory only, each time) the JVM truststore for use when making TLS connections
using the REST API. To see how this is done, examine the method
initTrustStoreFromPath in the file TruststoreInitializer.java.

Using this approach, adding a new trusted root certificate for use in the Developer
Templates is exactly the same as just adding a new trusted root certificate for the
Simple API. After the new certificate is added to the Simple API trustStore directory
(and that directory is re-hashed for use by the Simple API), the new certificate is
automatically added to the JVM truststore whenever any of the Developer Template
samples are executed.

Authentication
The following Java package and its associated Java source code provides a thin wrapper class
for requesting a Kerberos authentication/delegation token from the Voltage SecureData Server:

Source code package: com.voltage.securedata.auth

Source code location: <install-dir>/common/src

CONFIDENTIAL 3-56
Developer Templates Integration Guide Version 5.0 Shared Integration Architecture

Package Contents
This package defines the following class in a .java source file of the same name:

• AuthTokenHandler - This class provides for requesting a Kerberos authentication/

delegation token from the Voltage SecureData Server using the REST API.

Common Configuration
The following Java package and its associated Java source code provide classes to read and
parse the configuration settings from several different types of configuration files used by the
Developer Templates:

Source code package: com.voltage.securedata.config

Source code location: <install-dir>/common/src

Package Contents
This package defines the following set of classes and enumerations in .java source files of the
same name:

• AuthCredentials, AuthSettings - These classes provide in-memory containers

for the authentication/authorization information common to all of the Developer
Templates (both Hadoop-specific, NiFi, and Kafka/Storm). This includes both the
default auth settings and the named auth settings, loadable from both Java Properties
files and from XML files.

• CryptoConfig, CryptoConfigSettings, ComponentConfigSettings,

ClientId, CryptId, FieldMapping, ProductInfo,
SecureDataServerSettings, SimpleAPISettings, and
WebServiceSettings - These classes provide in-memory containers for the
configuration information common to all of the Developer Templates (both Hadoop-
specific, NiFi, and Kafka/Storm). This includes information about, for example, the REST
API hostname, information encapsulated in cryptIds (format name, identity, and so on),
field-to-cryptId mappings, and so on.

• AuthPopulator, ConfigPopulator, and PropertiesLoader - These classes

provides methods for getting the configuration settings from a Java Properties file into
the in-memory container classes.

• XMLConfigPopulator - This class provides methods for validating the XML

configuration files with the corresponding schema and reading the XML configuration
files into the in-memory container classes.

• ConfigException - This class provides custom exceptions for this package.

• FilepathBuilder - This utility class provides a method for constructing file paths.

3-57 CONFIDENTIAL
Shared Integration Architecture Developer Templates Integration Guide Version 5.0

• UserInfo - This interface defines a uniform mechanism for retrieving the current user
in different operating contexts (generic Hadoop versus Hive).

• CryptoApiType, CryptoAuthMethod, and SimpleAPICacheType - These

enumerations provide the choices for the different types of APIs, authentication
methods, and Simple API cache types.

Package Usage
This package provides functionality to read a set of configuration settings that are typically
required by the Voltage SecureData APIs regardless of whether they are used in the context of
Hadoop or not. These settings can come from a variety of sources, such as XML configuration
files on HDFS, as used by the Hadoop Developer Templates, or from a Java Properties file on
the local file system, as used by the NiFi and Kafka-Storm templates.

The XML and Java Properties configuration files processed by the classes in this package must
conform to the relevant XSD files and the requirements specified in "Java Properties
Configuration Files" (page 3-45), respectively.

When first running the sample jobs in the Hadoop Developer Templates, you can leave all the
settings at their default values in the configuration files, except for possibly customizing the
location where you installed the Simple API on the data nodes, as follows:
XML: <simpleAPI installPath="/path/to/simpleapi/location" />

or
Java Properties: simpleapi.install.path = /path/to/simpleapi/location

You may also want to change the configuration settings for your own Voltage SecureData
Server, after first trying out the jobs against the public-facing Voltage SecureData Server
dataprotection, hosted by Micro Focus.

This package will read and populate in-memory versions of the following configuration settings,
required by all of the Developer Templates:

• Configuration Version

Java Properties configuration file vsnifi.properties, used only by the NiFi

Developer Template, have a configuration version, which you must not change.

• NiFi Developer Template vsnifi.properties:

config.version = 1

NOTE: Only the single Java Properties configuration file vsnifi.properties

has this setting. XML configuration files do not.

• Domain Name or Hostname (of the Voltage SecureData Server)

XML configuration file (vsconfig.xml).

CONFIDENTIAL 3-58
Developer Templates Integration Guide Version 5.0 Shared Integration Architecture

• Simple API Install Path

XML configuration file (vsconfig.xml) and Java Properties configuration file
(vsnifi.properties).

• Simple API Policy URL

XML configuration file (vsconfig.xml) and Java Properties configuration file
(vsnifi.properties).

• Simple API Cache Type

XML configuration file (vsconfig.xml) and Java Properties configuration file
(vsnifi.properties).

• Simple API File Cache Path

XML configuration file (vsconfig.xml) and Java Properties configuration file
(vsnifi.properties).

• Simple API Short FPE Behavior

XML configuration file (vsconfig.xml) and Java Properties configuration file
(vsnifi.properties).

• Web Service Batch Size

XML configuration file (vsconfig.xml).

• REST Hostname
XML configuration file (vsconfig.xml) and Java Properties configuration file
(vsnifi.properties).

• Authentication/Authorization Failure on Access Behavior

XML configuration file (vsconfig.xml) and Java Properties configuration file
(vsnifi.properties).

• Product Name
XML configuration files (vsconfig.xml and/or vs<component>.xml) and Java
Properties configuration file (vsnifi.properties).

• Product Version
XML configuration files (vsconfig.xml and/or vs<component>.xml) and Java
Properties configuration file (vsnifi.properties).

• Identity
XML configuration file (vsconfig.xml) and the Properties tab of the Configure
Processor dialog box of the SecureDataProcessor Nifi Processor.

• API
XML configuration file (vsconfig.xml) and the Properties tab of the Configure
Processor dialog box of the SecureDataProcessor Nifi Processor.

3-59 CONFIDENTIAL
Shared Integration Architecture Developer Templates Integration Guide Version 5.0

• Format
XML configuration file (vsconfig.xml) and the Properties tab of the Configure
Processor dialog box of the SecureDataProcessor Nifi Processor.

• Translator Class
XML configuration files only (vsconfig.xml and/or vs<component>.xml).

• Translator Initialization Data

XML configuration files only (vsconfig.xml and/or vs<component>.xml).

• Field Index
XML configuration files only (vsconfig.xml and/or vs<component>.xml).

• Field Name
XML configuration files only (vsconfig.xml and/or vs<component>.xml).

• Authentication/Authorization Method
XML configuration file (vsauth.xml) and the Properties tab of the Configure
Processor dialog box of the SecureDataProcessor Nifi Processor.

• Shared Secret
XML configuration file (vsauth.xml) and the Properties tab of the Configure
Processor dialog box of the SecureDataProcessor Nifi Processor.

• Username
XML configuration file (vsauth.xml) and the Properties tab of the Configure
Processor dialog box of the SecureDataProcessor Nifi Processor.

• Password
XML configuration file (vsauth.xml) and the Properties tab of the Configure
Processor dialog box of the SecureDataProcessor Nifi Processor.

NOTE: The configuration settings described in section "Configuration Settings" on page 3-5
also include a number of XML infrastructure settings that are used within the XML files only
to categorize that configuration information and link it together in useful ways. These
settings include: Component-Specific Designator for Client ID, CryptId Name, CryptId AuthId
Name, CryptId Description, Component-Specific Designator for Fields, CryptId Name for
Field, and AuthId Name.

Hadoop Configuration
The following Java package and its associated Java source code provide classes to extend
several common configuration classes for reading and storing configuration properties that are
specific to the shared configuration files used by the Hadoop Developer Templates
(MapReduce, Hive, Sqoop, and Spark), as well as for reading those configuration files from
HDFS:

CONFIDENTIAL 3-60
Developer Templates Integration Guide Version 5.0 Shared Integration Architecture

Source code package: com.voltage.securedata.hadoop.config

Source code location: <install-dir>/src

NOTE: The NiFi Developer Template extends the same common configuration classes with
its own specific configuration classes. For more information, see "Processor Classes for the
NiFi Developer Template" (page 8-8).

Package Contents
This package defines the following set of classes in .java source files of the same name:

• HadoopConfigPopulator, HadoopXMLConfigPopulator - These classes

extend the ConfigPopulator and XMLConfigPopulator classes and provide
methods for getting the configuration settings from the XML configuration files used by
the Hadoop Developer Templates into their in-memory container classes.

• HadoopConfigSettings, HadoopAuthSettings, ConfigSettingsCache -

These classes extend the CryptoConfigSettings and AuthSettings classes and
provide per-user caching of in-memory containers for the configuration information
required by the Hadoop Developer Templates.

• HDFSConfigLoader - This class provides methods for loading the configuration

information from the Hadoop Developer Templates XML configuration files on HDFS.

This class wraps the call to the populate method of the HadoopConfigPopulator
class in code that reads several Java Properties files from a specified or default location
in HDFS. This approach allows for the addition of other types of configuration loaders
that load configuration data from different input sources; reading from HDFS is just one
example approach demonstrated in the Hadoop Developer Templates.

The class HDFSConfigLoader provides alternative ways to specify the locations of the
Hadoop configuration files:

• Using the -D generic option for Hadoop commands, where applicable.

• Using another configuration file that contains the locations of the primary
configuration files in HDFS, packaged into a well-known JAR file.

For more information about these alternative methods, see "Specifying the Location of
the XML Configuration Files" (page 3-47).

• HadoopUserInfo - This class implements the UserInfo interface to provide a

Hadoop-specific implementation for retrieving the current user.

• HDFSFileSystemChecker - This class provides methods for ensuring that the

default file system in a Hadoop cluster is set to HDFS (and not to the local file system).

3-61 CONFIDENTIAL
Shared Integration Architecture Developer Templates Integration Guide Version 5.0

• ReleaseInfo - This class provides methods for retrieving release information from
the Java Properties file vsrelease.properties. This information is logged when
configuration information is loaded and used in the REST API’s User Agent field.

These classes allow component-specific field configuration information to optionally be

specified in a third, component-specific XML configuration file:

• vsmr.xml - Optionally used to specify clientId and field configuration information, for
the MapReduce Developer Template that can also be specified in the shared
configuration file vsconfig.xml.

• vshive.xml - Optionally used to specify clientId configuration information, for the Hive
Developer Template that can also be specified in the shared configuration file
vsconfig.xml.

• vssqoop.xml - Optionally used to specify clientId and field configuration information,

for the Sqoop Developer Template that can also be specified in the shared configuration
file vsconfig.xml.

• vsspark-rdd.xml - Optionally used to specify clientId and field configuration

information, for the RDD and Dataset variants of the Spark Developer Template that
can also be specified in the shared configuration file vsconfig.xml.

• vsspark-udf.xml - Optionally used to specify clientId configuration information, for

the Dataframe, Spark SQL, and HiveUDF variants of the Spark Developer Template
that can also be specified in the shared configuration file vsconfig.xml..

If a particular component-specific configuration file is found, its configuration information will

be loaded from that file. If the component-specific configuration file is not found, the clientId
and field configuration information, if any, is read from the configuration file vsconfig.xml.

NOTE: In version 4.2, the field configuration information for the RDD and Dataset variants of
the Spark Developer Template use a component-specific configuration file. The field
configuration information for the MapReduce Developer Template and the Sqoop Developer
Template is provided in the shared configuration file vsconfig.xml. Beginning with
version 4.1, the UDF-based Developer Templates, including the Hive Developer Template
and the UDF variants of the Spark Developer Template, no longer require field configuration
information in a configuration file. Instead, they specify a cryptId to be used when protecting
or access the field to which they are applied.

Package Usage
The Hadoop Developer Templates (MapReduce, Hive, Sqoop, and Spark) share a pair of XML
configuration files and optionally use individual configuration files for each individual Hadoop
component (vsspark-rdd.xml being the only example included):

• vsauth.xml - Contains default and named sets of authentication and authorization

configuration settings.

CONFIDENTIAL 3-62
Developer Templates Integration Guide Version 5.0 Shared Integration Architecture

• vsconfig.xml - Contains configuration settings common to all of the Hadoop

Developer Templates and the configuration settings that are specific to the MapReduce,
and Sqoop Developer Templates.

• vsspark-rdd.xml - Contains configuration settings specific to the RDD and Dataset

variants of the Spark Developer Template.

The implementation that uses these configuration files provides an example of the types of
configuration information that your data nodes will need in order to be able to use one or more
of the three Voltage SecureData APIs while running their jobs. It is also serves as an example of
one possible approach to making this configuration information available to your data nodes.

NOTE: An example of configuration information, as provided in the Hadoop Developer

Templates, that you will almost certainly change for a production Hadoop environment is the
default authentication/authorization information that is currently provided in the
configuration file vsauth.xml. which should be closely guarded with stringent file
permission settings.

This package will read and populate the common configuration settings (see "Common
Configuration" (page 3-57)) and also an in-memory version of the following configuration
setting, required when Kerberos authentication is used with the Hadoop-specific Developer
Templates:

• Delegation Token HDFS Path

XML configuration files only (vsauth.xml).

Shared Code for the DataStream Developer Templates

The following Java package and its associated Java source code provide classes to perform
common tasks, such as reading and parsing configuration files, used by the DataStream
Developer Templates:

Source code package: com.voltage.securedata.stream.common

Source code location: <install-dir>/stream/stream-common/src

NOTE: The DataStream Developer Templates are as follows:

• StreamSets Developer Template - Introduced in version 5.0

• Kafka Connect Developer Template - Introduced in version 5.0
• Kafka-Storm Developer Template - Code reorganized in version 5.0

All of these Developer Templates are located together in the following directory:

<install-dir>/stream

3-63 CONFIDENTIAL
Shared Integration Architecture Developer Templates Integration Guide Version 5.0

Package Contents
This package defines the following set of classes in .java source files of the same name:

• StreamAuthSettings - This class provide in-memory containers for the

authentication/authorization information common to all of the DataStream Developer
Templates (identified above). This includes both the default auth settings and the
named auth settings, loadable from both Java Properties files and from XML files.

• StreamConfigSettings - This class provide in-memory containers for the

configuration information common to all of the DataStream Developer Templates
(identified above). This includes information about, for example, the REST API
hostname, information encapsulated in cryptIds (format name, identity, and so on), field-
to-cryptId mappings, and so on.

• StreamConfigLoader - This class provides methods for constructing the full paths
to the configuration files (and for legacy situations, for getting the configuration settings
from a Java Properties file into the in-memory container classes).

• StreamXMLConfigPopulator - This class provides methods for validating the

XML configuration files with the corresponding schema and reading the XML
configuration files into the in-memory container classes.

Package Usage
This package provides functionality common to the DataStream Developer Templates:
StreamSets, Kafka Connect, and Kafka-Storm. At this time, this functionality is used to read the
set of configuration settings that are required by the underlying Voltage SecureData APIs. As
shipped, these settings come from XML configuration files the local file system, but to support
legacy Kafka-Storm installations, can also come from Java Properties configuration files on the
local file system.

When first running the sample jobs and pipelines in the DataStream Developer Templates, you
can leave all the settings at their default values in the configuration files, except for possibly
customizing the location where you installed the Simple API on the data nodes, as follows:
XML: <simpleAPI installPath="/path/to/simpleapi/location" />

or
Java Properties: simpleapi.install.path = /path/to/simpleapi/location

You may also want to change the configuration settings for your own Voltage SecureData
Server, after first trying out the jobs and pipelines against the public-facing Voltage SecureData
Server dataprotection, hosted by Micro Focus.

CONFIDENTIAL 3-64
Developer Templates Integration Guide Version 5.0 Shared Integration Architecture

This package will read the specified configuration files on the local file system and populate in-
memory versions of the configuration settings described in "Common Configuration" (page 3-
57).

Data Conversion
The following Java package and its associated Java source code provide classes to convert
between specific data type objects (such as dates and doubles) and generic strings:

Source code package: com.voltage.securedata.converter

Source code location: <install-dir>/common-crypto/cryptofactory/src

Package Contents
This package defines the following set of classes and interfaces in .java source files of the
same name:

• DataConverter, BaseDataConverter, and DataConverterFactory - This

interface and these classes provide the base class for all of the converter classes and a
factory for the converter implementation classes.

• BigDecimalConverter, DoubleConverter,
FloatingPointNumberConverter, IntegerConverter, LongConverter,
and NumberConverter - These classes provide a class hierarchy for the classes used
to convert various Java numeric data types to and from the string data expected by the
SecureData APIs.

• LegacySimpleAPIDateConverter,
LegacySimpleAPIDateTimeConverter, and
LegacySimpleAPITimeOnlyConverter - These classes provide functionality to
convert between Java date/time objects and the date/time string format expected by
pre-4.3 versions of the Simple API.

NOTE: The DataConvertFactory class, as provided, does not use these classes.
They are only present to support legacy pre-4.3 Simple API date/time operations. If
needed, you will need to change the code in the DataConverterFactory class to
use one or more of the appropriate Legacy* classes when using a pre-4.3 version of
the Simple API.

• DateConverter, DateTimeConverter, and TimeOnlyConverter - These

classes provide functionality to convert between Java date/time objects and the date/
time string format expected by the Simple API (versions 4.3 and later) and the REST
API.

3-65 CONFIDENTIAL
Shared Integration Architecture Developer Templates Integration Guide Version 5.0

Package Usage
Because the SecureData APIs accept input only as strings, non-string input data sometimes
needs to be changed to a string format before passing it to a SecureData API. The classes in
this package are used to convert from non-string data types to strings prior to data protection
processing, and then to convert the data protection results from strings back to the specific
data types.

These classes are used for the Sqoop Developer Template integration, where the fields in the
database table have specific non-string data types. See the Developer Template code and
Javadocs for the DataConverter interface for more details.

When the Sqoop import integration runs, it automatically determines the data type of the fields
being protected, using the data types of the corresponding getter methods in the generated
object-relational mapping (ORM) class. For more information about how a generated ORM
class is used in the Sqoop template integration, see "Integration Architecture of the Sqoop
Template" (page 6-2). After the data types are determined, the Sqoop template integration
maps any non-string data types to an appropriate DataConverter implementation class,
which performs the specific conversions to and from the corresponding string formats.

For example, a date field would have a specific converter to convert the Java Date object to a
formatted string value, which is needed before calling either the Simple API or the REST API.
After the string result is returned by the relevant API, it is converted back into a Java Date
object for Sqoop to write as output by the ORM class.

NOTE: In most cases, the data type conversion is straightforward. However, there are
situations where it gets more complicated, such as for date processing in pre-4.3 versions of
the Simple API. date processing. Versions of the Simple API prior to 4.3 do not work on
formatted date/time input values directly, and requires translation into an internal syntax of
the following form:

<year>:<month>:<day>:<hour>:<minute>:<second>:::

For more information about more strict date formatting requirements in older versions of the
Simple API, see the Simple API Release Notes for versions 4.2 or earlier.

The mapping from the Simple API or REST API and a specific data type (such as Date) to its
corresponding DataConverter implementation class is encapsulated in the
DataConverterFactory class. Then, specific integrations such as the Sqoop template
integration can call this factory to request the appropriate converter implementation for the
field being processed.

The concrete implementation classes (with class names of the form <datatype>Converter)
implement the methods convertToString and convertFromString to convert between
the specific object type and its corresponding string representation.

CONFIDENTIAL 3-66
Developer Templates Integration Guide Version 5.0 Shared Integration Architecture

Because the Sqoop integration knows the data types of the input fields/columns, this
conversion is performed automatically, without requiring any custom settings in the Hadoop
configuration file vsconfig.xml.

CAUTION: Some of the converter classes provided by the converter package (such as the
converter classes for converting from and back to different numeric data types) are not
exercised by the Sqoop integration, as delivered. If you choose to use them, be sure to test
them thoroughly before you deploy them.

Data Translation
The following Java package and its associated Java source code provide classes to translate
from one string format to and from the string format expected by the SecureData APIs:

Source code package: com.voltage.securedata.translator

Source code location: <install-dir>/common/src

Package Contents
This package defines the following set of classes and interfaces in .java source files of the
same name:

• StringTranslator, BaseStringTranslator, and

StringTranslatorFactory - This interface and these classes provide the base
class for all of the translator classes and a factory for the translator implementation
classes.

• FloatingPointNumberTranslator - This class provides functionality to

translate numeric strings that may not include a decimal point to an equivalent string
that does include a decimal point, as expected by some Voltage SecureData numeric
formats. For example, the string 100 will be converted to and from 100.0 for processing
by a Voltage SecureData API.

• LegacySimpleAPIDateTranslator - This class provides functionality to

translate between the ISO 8601 date format (yyyy-MM-dd) to the internal Simple API
date format expected by pre-4.3 versions of the Simple API (yyyy:MM:dd::::::). As
delivered, it only supports date values, without a time-of-day component.

Package Usage
The translator package provides classes for translating between different string
representations of data before and after processing by a Voltage SecureData API. An input
value to be processed may need to be pre-processed into a different string format before a
Voltage SecureData API is invoked, and the data protection results post-processed back into
the original string format.

3-67 CONFIDENTIAL
Shared Integration Architecture Developer Templates Integration Guide Version 5.0

NOTE: For converting between different data types and their corresponding string
representations, as required by the Sqoop template import integration, see "Data
Conversion" (page 3-65).

When a Hadoop Developer Template integration cannot automatically determine the data
types of individual values, such as the string inputs in HDFS processed by the MapReduce
Developer Template, they may require translation before and/or after they are protected or
accessed. In such cases, you must configure the appropriate translator implementation class in
the Hadoop configuration file vsconfig.xml so that the correct translation will be performed.

The mapping from a specific translator class name (such as

LegacySimpleAPIDateTranslator) and its corresponding implementation of the
StringTranslator interface is encapsulated in the StringTranslatorFactory class. The
ConfigPopulator class uses this factory when parsing the configuration settings, requesting
the appropriate implementation class for the field settings being parsed.

The concrete classes that implement the StringTranslator interface contain

implementations of the methods preProcess and postProcess. These are the methods that
are called to translate between string representations of the data before and after its data
protection processing is performed, respectively. The immediate need for this pre- and post-
processing is to accommodate date processing when using older versions of the Simple API for
data protection. However, the template code demonstrates a generalized architecture that can
be adapted to different pre- and post-processing requirements, if needed.

The Developer Template code also shows an advanced option to initialize the translator with
optional custom settings. While this advanced option is not needed for the sample data and
configuration provided with the Developer Templates, the Javadocs and the Developer
Template code itself show how this is implemented. An example of how this custom
initialization could be used would be to update the configuration settings to allow full date/time
(with a time granularity of one second) processing when using the Simple API. For details, see
the method init in the class LegacySimpleAPIDateTranslator.

NOTE: In the case of Sqoop, an explicit translator is often not required because the data type
of the table field/column automatically determines the data conversion to perform, as
described in "Data Conversion" (page 3-65). However, if you are storing date values as a
string (VARCHAR) column in your database table, you might need to configure an explicit
translator to perform pre- and post-processing in the course of protecting or accessing that
field/column.

Cryptographic Abstraction
The following Java package and its associated Java source code provide classes to create a
cryptographic abstraction layer, hiding the details of the calls to the different SecureData APIs
behind a single generic data protection API:

Source code package: com.voltage.securedata.crypto

CONFIDENTIAL 3-68
Developer Templates Integration Guide Version 5.0 Shared Integration Architecture

Source code location: <install-dir>/common/src

Package Contents
This package defines the following set of interfaces, abstract classes, and classes in .java
source files of the same name:

• CryptoFactory - This class provides methods for caching and re-using crypto
instances for a given set of format information for each of the Voltage SecureData data
protection APIs.

• CryptoException - This class provides a custom exception for this package.

• Crypto, BaseCrypto, LocalCrypto, and RestCrypto - This set of interface,

abstract class, and classes provides the mechanism through which the multiple
SecureData APIs can be invokes through a single generic data protection API.

• SimpleApiTester - This class provides a mechanism for testing the Simple API
outside the context of the abstraction layer.

Package Usage
The protect and access APIs provided by this layer are defined in a general Crypto interface
and implemented in the LocalCrypto and RestCrypto classes (mostly implemented in their
shared abstract superclass BaseCrypto).

In the Hadoop Developer Templates, the MapReduce, Hive, and Sqoop template code calls the
getCrypto method of the class CryptoFactory, which returns either a new or recycled
instance of either the LocalCrypto or RestCrypto class, depending on either the global or
column-specific configuration settings that specify which Voltage SecureData data protection
API to use in each case.

In the NiFi Developer Template, the method that processes the NiFi processor’s input stream
calls the getCrypto method of the class CryptoFactory, which returns either a new or
recycled instance of either the LocalCrypto or RestCrypto class, depending on the API
type configured for that processor.

In the Kafka-Storm Developer Template, the Storm bolt template code calls the getCrypto
method of the class CryptoFactory, which returns either a new or recycled instance of the
LocalCrypto class by default, or RestCrypto class if rest is provided as an optional fourth
command line parameter to the script run-storm-topology.

Using this approach, the calls to the Voltage SecureData APIs are isolated in a specific section
of the code, and not called directly by the Hadoop job, Storm bolt, NiFi processor code, and so
on. In other words, instead of relevant Developer Template Java class calling the Simple API or
the REST API directly, they request a Crypto object from the CryptoFactory class, and then
use the returned Crypto instance to perform the data protection operations, without
knowledge of whether these operations are being performed locally by the Simple API or
remotely by the REST API.

3-69 CONFIDENTIAL
Shared Integration Architecture Developer Templates Integration Guide Version 5.0

The data protection abstraction layer, along with some configuration settings, hides which of
the two available SecureData APIs is actually performing the data protection operations on
behalf of the code that is calling the classes in the crypto package. It serves as a good
example of best practices template code that is ready for use in a production environment after
appropriate testing.

The static CryptoFactory instance is initialized using an instance of the class

ConfigSettings, which is populated in one of the following three ways:

• For most of the Developer Templates, from the XML configuration files vsauth.xml
and vsconfig.xml. If you change how configuration is performed for your production
Developer Templates solution, you will need to make corresponding changes in the
crypto package.

• For the NiFi Developer Template, from the configuration file vsnifi.properties and
from the properties configured for the template’s sample processor. Likewise, if you
change how configuration is performed for your production NiFi workflow, you will need
to make corresponding changes to the crypto package.

This factory/interface approach follows the recommended practice of loose coupling and
programming to an interface, not to an implementation.

Example Code Flow:

// Initialize the crypto factory using config settings.
CryptoFactory.init(configSettings);

// Get Crypto instance for specified API type and format info.
Crypto crypto = CryptoFactory.getCrypto(apiType, formatInfo);

// Use Crypto instance to perform protect operation.

String[] cipherTextList =
crypto.protectFormattedDataList(plainTextList);

NOTE: Using a data protection abstraction layer is a recommended best practice, but not a
requirement when calling the SecureData APIs. The Developer Templates show an example
of using this approach, but you can integrate the SecureData APIs into your Hadoop jobs
and NiFi workflows in different ways.

CONFIDENTIAL 3-70
Developer Templates Integration Guide Version 5.0 Using Old Versions of Other Voltage SecureData Software

Using Old Versions of Other Voltage SecureData Software

By default, the MapReduce Developer Template attempts to protect the data in the name
column in the sample data file plaintext.csv using the Simple API. The first 20 rows of this
sample data file use characters outside the ASCII range (such as the accent characters used in
many European languages) in the name column, such as are highlighted in the following names:
• Fabien Baillairgé

• Adélaïde Clérisseau

• Jean-Noël Emmanuelli

To protect more than the ASCII-range characters in these names, the XML configuration file
vsconfig.xml defines a cryptId named extended that uses the FPE2 format
AlphaExtendedTest1. In order to protect this field as expected, you must be using version
5.0 or later of the Simple API (because version 5.0 of the Simple API was the first version to
include support for FPE2 formats). If you are using an older version of the Simple API, there are
a few alternatives to make the MapReduce Developer Template run successfully:

• In the XML configuration file vsconfig.xml, you could change the cryptId extended
to use the REST API:

Before: <cryptId name="extended" format="AlphaExtendedTest1" />

After: <cryptId name="extended" format="AlphaExtendedTest1" api="rest"/>

In order for this solution to work, you must be using version 6.0 or later of the Voltage
SecureData Server (because version 6.0 of the Voltage SecureData Server was the first
version to include REST API support for FPE2 formats).

• In the XML configuration file vsconfig.xml, change name field (index 1) to use a
cryptId that does not specify a FPE2 format. For example, use the cryptId named alpha
instead:

Before: <fields component="mr">

After: <fields component="mr">

Note that the extended characters (outside the ASCII range), such as those highlighted
above, will not be protected and will pass through to the ciphertext unchanged.

• In the XML configuration file vsconfig.xml, comment out the field specification for
the name field, leaving it unprotected:

Before: <fields component="mr">

3-71 CONFIDENTIAL
Using Old Versions of Other Voltage SecureData Software Developer Templates Integration Guide Version 5.0

After: <fields component="mr">

In contrast to the associated Voltage SecureData software version assumed by the MapReduce
Developer Template, the Hive Developer Template, as shipped, does not assume that the
Simple API and the REST API support FPE2 extended characters. In the scripts
run-hive-join-query.hql and run-impala-join-query.sql, the following lines,
respectively, are commented out:
-- accessdata(s.name, 'extended') AS name_decrypted,

-- accessdataimpala(s.name, 'extended') AS name_decrypted,

To access the name field using the cryptId extended, remove the comment designators (--)
from the beginning of these lines (and to avoid retrieving the ciphertext version as well, remove
the s.name, from the line above).

If you are going to use your own Voltage SecureData Server of version 6.0 or later, you will need
to define an appropriate format named AlphaExtendedTest1, as shown above, or an
equivalent FPE format that includes the required extended characters in its protection alphabet
with a corresponding change to the format.name field(s) above. For information about how to
define this format, see "AlphaExtendedTest1" (page A-3).

For more information about the extended character support provided by the REST API and the
Simple API, see the Voltage SecureData REST API Developer Guide and the Voltage
SecureData Simple API Developer Guide (version 5.0 or later), respectively.

CONFIDENTIAL 3-72
Developer Templates Integration Guide Version 5.0 Shared Sample Data for the Hadoop Developer Templates

Shared Sample Data for the Hadoop Developer Templates

The Hadoop Developer Templates share two sample data files, located in the sampledata
subdirectory. The sample data files are CSV files that contain randomly-generated data that
has been tested with the software. The scripts that implement the sample jobs use this sample
data to demonstrate that data can be protected and accessed in your environment. The sample
data files are described in the following sub-sections.

plaintext.csv
This sample data file includes 10,000 rows of plaintext data consisting of the following
columns:

0 - Column identifier (integer)

1 - First and last name (text string)

2 - Street name (text string)

3 - City name (text string)

4 - State abbreviation (2 uppercase letters)

5 - Zip code (mix of 5-digit numbers and 9-digit numbers with a delimiter)

6 - Phone number (mix of digits and delimiters)

7 - Email address (text string that includes the @ and . delimiters)

8 - Date of birth (mix of digits with delimiters separating the year/month/day, in the
pattern YYYY-MM-DD)

9 - Credit card number (16-digit)

10 - US Social Security number (9 digits with delimiters in the pattern xxx-xx-xxxx)

For the MapReduce Developer Template, the configuration file vsconfig.xml includes
settings needed to protect and access the data in columns 1, 7, 8, 9, and 10 using the Simple
API for all cryptographic operations except for the credit card numbers in column 9, which are
processed by the REST API using an SST format.

For the Sqoop Developer Template, the configuration file vsconfig.xml includes settings
needed to protect and access the data in columns 7, 8, 9, and 10 using the Simple API for all
cryptographic operations except for the credit card numbers in column 9, which are processed
by the REST API using an SST format.

The data that is specified for protect and access operations by these settings in the
configuration file vsconfig.xml includes:

3-73 CONFIDENTIAL
Shared Sample Data for the Hadoop Developer Templates Developer Templates Integration Guide Version 5.0

• The names in column 1 are protected with a format named AlphaExtendedTest1.

This is a Variable-Length String (VLS) format that uses the extended character
capabilities of FPE2 to encrypt Personally Identifiable Information (PII).

NOTE: The names in column 1 of the first 20 rows in the sample data file
plaintext.csv include characters outside the normal ASCII range.

• The email addresses in column 7 are protected with a format named Alphanumeric.
This is a Variable-Length String (VLS) format that uses FPE to encrypt PII.

• The date of birth values in column 8 are protected using a format named DATE-ISO-
8601. This is a date format that uses FPE to protect PII.

• The credit card numbers in column 9 are protected using a format named cc-sst-6-4.
This is a credit card format that uses Secure Stateless Tokenization™ (SST) protection
to tokenize Payment Card Industry (PCI) data. In this format, the first six digits and the
last four digits remain in the clear, and the middle digits are tokenized.

• The US Social Security numbers in column 10 are protected using a format named SSN.
This is a US Social Security Number format that uses FPE to protect PII.

creditscore.csv
This file includes 10,000 rows of plaintext data consisting of the following two columns:

• US Social Security number (with values identical to those in column 10 of the file
plaintext.csv)

• Credit score (3-digit)

This file is used to demonstrate how to join tables in Hive.

CONFIDENTIAL 3-74
Developer Templates Integration Guide Version 5.0 Common Procedures for the Hadoop Developer Templates

Common Procedures for the Hadoop Developer Templates

This section provides instructions for performing procedures that are common to all of the
Hadoop Developer Templates, including common procedures for setting up HDFS as expected
by the templates and common procedures for working with Kerberos authentication.

Common HDFS Procedures for the Hadoop Developer Templates

This section provides instructions for several procedures related to HDFS that are common to
the Hadoop Developer Templates (MapReduce, Hive, and Sqoop). They are related to
establishing where you will be running the Hadoop Developer Templates within HDFS as well
as copying the common configuration and sample data files from the local file system to HDFS.

Creating a Home Directory in HDFS

Make sure your Hadoop cluster has a home directory in HDFS for the user account under which
you plan to run the sample code. For example, if you plan to run the templates as user “voltage”,
make sure that there is a /user/voltage directory in HDFS.

Run a command similar to the following to check for this directory:

hdfs dfs -ls /user/<user>

If this directory exists, the command prompt returns a message showing the items found, if any.

If this directory does not exist, a message indicates that there is no such file or directory. In this
case you must create the directory and set the owner using commands similar to the following:
sudo -u hdfs hdfs dfs -mkdir /user/<user>
sudo -u hdfs hdfs dfs -chown <user>:<user> /user/<user>

NOTE: For a MapR distribution, use commands similar to the following:

sudo -u mapr hdfs dfs -mkdir /user/<user>

sudo -u mapr hdfs dfs -chown <user>:<user> /user/<user>

If you see an error that the hdfs command is not found, you can add the hdfs script
location into the PATH variable, using a command similar to the following:

export PATH=$PATH:/opt/mapr/hadoop/hadoop-<version>/bin

Note that if this user directory does not exist, the sample scripts fail with the following Hadoop
security exception:
org.apache.hadoop.security.AccessControlException:
Permission denied: user=<user>, access=WRITE, inode="/user"...

3-75 CONFIDENTIAL
Common Procedures for the Hadoop Developer Templates Developer Templates Integration Guide Version 5.0

Subsequent commands in this chapter refer to relative paths for configuration and input and
output files in HDFS, and must be run as the user account for the home directory specified in
this section.

For example, the configuration files are specified as relative paths in HDFS:
voltage/config/vsauth.xml
voltage/config/vsconfig.xml

In Hadoop, the full absolute paths to these files are resolved relative to your home directory in
HDFS, as follows:
/user/<user>/voltage/config/vsauth.xml
/user/<user>/voltage/config/vsconfig.xml

Loading Hadoop Developer Template Files into HDFS

To load the Hadoop Developer Templates’ configuration, sample data, and JAR files into HDFS,
navigate to the bin directory and run the following script:
./copy-sample-data-to-hdfs

This script does the following:

• Removes the directory /user/<user>/voltage and then (re-)creates the following

directories:
/user/<user>/voltage
/user/<user>/voltage/config
/user/<user>/voltage/mr-sample-data
/user/<user>/voltage/hive-sample-data
/user/<user>/voltage/hiveudf

• Copies the updated configuration files vsauth.xml and vsconfig.xml to the HDFS
config directory created above.

NOTE: This script will copy these two default configuration files to their default
location in HDFS regardless of whether you are using either of the two alternative
methods for specifying configuration files for the Hadoop Developer Templates. For
more information about these alternative methods, see "Specifying the Location of
the XML Configuration Files" (page 3-47).

• Changes the permission of the configuration files vsauth.xml and vsconfig.xml so

that only authorized users can access them.

• Copies the sample data files plaintext.csv and creditscore.csv to HDFS. The
former sample data file is copied to the directory expected by the MapReduce protect
job (the HDFS mr-sample-data directory created above). The latter sample data file is
copied to the directory expected by the Hive table creation script (the HDFS
hive-sample-data directory created above).

CONFIDENTIAL 3-76
Developer Templates Integration Guide Version 5.0 Common Procedures for the Hadoop Developer Templates

• Changes the permission of the sample data files plaintext.csv and

creditscore.csv so that only authorized users can access them.

• To support the creation of permanent Hive UDFs, copies the required JAR files
(voltage-hadoop.jar and vibesimplejava.jar) to the HDFS hiveudf directory
created above.

NOTE: If you are using the JAR-based alternative configuration file location
approach, as described in "Config-Locator Properties File Packaged as a JAR File"
(page 3-48), you can uncomment a line in this script to also copy the JAR file
vsconfig.jar to the HDFS hiveudf directory created above.

Loading Updated Configuration Files into HDFS

To update the two primary Hadoop Developer Templates’ configuration files in their default
location in HDFS, navigate to the bin directory and run the following script:
./update-config-files-in-hdfs

This script does the following:

• Copies the updated configuration files vsauth.xml and vsconfig.xml to HDFS.

• Changes the permission of the configuration files vsauth.xml and vsconfig.xml so

that only authorized users can access them.

NOTE: This script must be run from the directory <install_dir>/bin.

Because this script does nothing other than copy two default configuration files to their default
location in HDFS, it is not useful (as is) if you are using either of the two alternative methods for
specifying different locations for the Hadoop Developer Template configuration files. For more
information about these alternative methods, see "Specifying the Location of the XML
Configuration Files" (page 3-47).

Depending on your scenario, the following two modifications to this script may be useful:

• If you are using an HDFS location other than /user/<user>/voltage/config for

your configuration files, you can modify the definition of the variable configDir in this
script to reference your HDFS configuration file directory.

• If you changed the MapReduce, Hive, and/or Sqoop Developer Templates such that you
are using one or more component-specific configuration files, you can modify this script
to also copy the relevant component-specific configuration files to the appropriate
directory in HDFS and set their file access attributes.

NOTE: In this regard, you could also follow the model used by the Spark Developer
Template, which provides the auxiliary script update-spark-config-files-in-
hdfs for updating its component-specific configuration file (vsspark-rdd.xml).

3-77 CONFIDENTIAL
Common Procedures for the Hadoop Developer Templates Developer Templates Integration Guide Version 5.0

Common Procedures for Working with Kerberos Authentication

The Hadoop Developer Templates allow you to use Kerberos for authentication when using the
following Developer Template integrations:

• MapReduce Developer Template

• Hive Developer Template

• Sqoop Developer Template

• Spark Developer Template

When you opt to do so, there are some extra steps that you must take to use this form of
authentication, including the use of several Kerberos-specific scripts provided with the Hadoop
Developer Templates. This section provides instructions for these additional steps. Also, it
begins with a short section that describes the prerequisites for using Kerberos authentication
with the Hadoop Developer Templates and a detailed description of the Kerberos-specific
scripts, their parameters, and other operational details.

NOTE: Kerberos authentication is not supported for the NiFi Developer Template and the
Kafka-Storm Developer Template.

Prerequisites for Using Kerberos Authentication

In order to configure the Hadoop Developer Templates to use Kerberos authentication for the
Simple API (for the cryptographic key requests used to perform local protect and access
operations) or the REST API (to perform remote protect and access operations), you must have
the following components and services:

• A Kerberized Hadoop cluster.

• A functional Kerberos Key Distribution Center (KDC), which the Hadoop cluster is
configured to use.

• Version 6.5 or higher of the Voltage SecureData Server, required for Kerberos server-
side authentication configuration and REST API support.

• Version 5.20 or higher of the Simple API, required for Kerberos authentication support
when requesting cryptographic keys for local protect and access operations.

• An LDAP server configured to associate Kerberos-authenticated usernames with

identities used to derive cryptographic keys and perform tokenization operations. After
the username is authenticated using the associated Kerberos credentials, LDAP is used
to authorize the authenticated user for the requested identity, typically through group
membership or some other user-to-identity association.

CONFIDENTIAL 3-78
Developer Templates Integration Guide Version 5.0 Common Procedures for the Hadoop Developer Templates

IMPORTANT: In order to build the Kerberos service ticket on the Hadoop client, the nodes in
the Hadoop cluster must be running Java 8 (Update 151 or higher). Earlier versions of Java
are not be able to build the required service tickets.

The Hadoop Developer Templates Delegation Token Scripts

When you are using Kerberos authentication, you use the following three Hadoop Developer
Templates scripts to manage your Kerberos delegation token(s).

• vskinit - This script parallels the Kerberos kinit command and can be run as follows
from the directory <install_dir>/bin:
> ./vskinit <optional parameters>

Use the vskinit command to initialize and store a Kerberos delegation token for the
current user that can be used for authentication of subsequent interactions with the
Voltage SecureData Server. This delegation token is short-lived, and automatically
expires after 24 hours.

• vsklist - This script parallels the Kerberos klist command and can be run as follows
from the directory <install_dir>/bin:
> ./vsklist <optional parameters>

Use the vsklist command to list information about the current user's delegation
token, if they have one. This command will also warn you if the delegation token has
possibly expired, based on the timestamp of the token file.

• vskdestroy - This script parallels the Kerberos kdestroy command and can be run
as follows from the directory <install_dir>/bin:
> ./vskdestroy <optional parameters>

Use the vskdestroy command to destroy the current user's delegation token (delete
the delegation token file), if they have one. This explicit step is useful for immediately
preventing subsequent jobs from using the token to authenticate with the Voltage
SecureData Server before it expires on its own.

For more information about this optional step, see "Optional Destruction of the
Delegation Token" (page 3-84).

NOTE: Make sure that the users running the Hadoop jobs and these vsk* scripts have
sufficient permission to read and write files in the HDFS directory specified by the
Delegation Token HDFS Path configuration setting (page 3-26), including permission to
create the directory if it does not already exist. Otherwise, Hadoop will throw a “Permission
denied” exception when it attempts to write the user's delegation token file when running
the vskinit script.

3-79 CONFIDENTIAL
Common Procedures for the Hadoop Developer Templates Developer Templates Integration Guide Version 5.0

Optional Command Line Parameters

All of the Kerberos-specific scripts provided with the Hadoop Developer Templates share
the same set of optional parameters:

• --help Show help (usage information). You can use -h

instead of --help.

• --verbose Verbose mode (display more informational logging

messages). You can use -v instead of --verbose.

• --config <HDFS-path> Specify a custom location (the path and filename) for
the Hadoop Developer Template configuration file in
HDFS (normally vsconfig.xml). This can be an
absolute or relative (to the user's home directory)
HDFS path. You can use -c instead of --config.

• --auth <HDFS-path> Specify a custom location (the path and filename) for
the Hadoop Developer Template authentication file in
HDFS (normally vsauth.xml). This can be an
absolute or relative (to the user's home directory)
HDFS path. You can use -a instead of --auth.

NOTE: There are comments at the top of each of these Kerberos-specific scripts with
more information about these optional command-line arguments. You can also use the
--help argument interactively to remind yourself about the available optional
parameters.

Configuration Values Used by the Kerberos-Specific Scripts

By default, all of the Kerberos-specific scripts will look for the configuration files
vsconfig.xml and vsauth.xml in the following HDFS directory:

/user/<user>/voltage/config

The vskinit script requests a delegation token from the Voltage SecureData Server. The
hostname used to connect to the Voltage SecureData Server is read from the XML
configuration file vsconfig.xml from either the optional webService element or the
required secureDataServer element. Make sure this setting is specified correctly in
(whatever) vsconfig.xml file you use with the vskinit command (in the default
location or, as described below, in a custom location), even if you are not using the REST
API for cryptographic operations.

The only information that the Kerberos-specific scripts use from the XML configuration file
vsauth.xml is the value of the delegationTokenHdfsPath attribute of the kerberos
element, which determines the HDFS location in which to store the delegation token
downloaded from the Voltage SecureData Server. No other settings from this configuration
file are relevant in the context of running the Kerberos-specific scripts. In particular, note
that all authMethod attribute settings are ignored, so it does not have to be explicitly set to

CONFIDENTIAL 3-80
Developer Templates Integration Guide Version 5.0 Common Procedures for the Hadoop Developer Templates

Kerberos to run these scripts. The scripts will initialize, list, or destroy the user's delegation
token using Kerberos authentication regardless of the authMethod attribute settings
configured in the instance of this configuration file used by each Kerberos-specific script. In
other words, you could run the vskinit script and generate a perfectly good delegation
token that never gets used by a Hadoop job using the same vsauth.xml file because that
file does not specify Kerberos authentication for (any of) its authMethod attribute
setting(s).

Specifying Custom Configuration File Locations for the Kerberos-Specific Scripts

As mentioned above, these scripts provide optional parameters for specifying custom
locations for these configuration files (--config and --auth). Because different Hadoop
jobs may use different instances of these configuration files, these parameters are
particularly useful in that scenario, when you are calling one or more of these Kerberos-
specific scripts from within another script that runs a specific Hadoop job. For example,
within a script that uses a Hive Developer Template UDF (that uses Kerberos
authentication and specifies a custom location for both configuration files) to perform a
query, it could be useful to include the following invocation of the vskinit script in that
same script to specify the same custom locations (using absolute HDFS paths) for both the
general and authentication XML configuration files:
./vskinit \
--config /apps/hive/voltage/config/vsconfig.xml \
--auth /apps/hive/voltage/config/vsauth.xml

NOTE: Sharing configuration files from a common location, as shown above, for multiple
users can be useful when their queries are all being run as the system user hive.
Likewise, you may also want to specify a shared location for the delegation token files
stored in HDFS, as described in the previous section. To do so, in the shared version of
the XML configuration file vsauth.xml, set the delegation token storage path to an
appropriate absolute path. For example:

delegation.token.hdfs.path = /apps/hive/voltage/config

If you are running these Kerberos-specific scripts interactively and you want to set a
different default location for the configuration files, you can do so by using the -D generic
option for specifying a property value on the yarn command line within these scripts. For
example, you could edit the script vskinit and add the following (highlighted) parameters
to the yarn command line:
yarn jar "$script_dir"/voltage-hadoop.jar \
com.voltage.securedata.hadoop.auth.VSKInit \
-DVOLTAGE_CONFIG_FILE=<custom_vsconfig_location> \
-DVOLTAGE_AUTH_FILE=<custom_vsauth_location> \
"$@"

For more information about this method of specifying custom locations for Hadoop
Developer Template configuration files, see "-D Generic Option to Specify a Property Value"
(page 3-47).

3-81 CONFIDENTIAL
Common Procedures for the Hadoop Developer Templates Developer Templates Integration Guide Version 5.0

CAUTION: The other method for specifying an alternate location for the configuration
files, which involves packaging the alternate location of the configuration files within a
properties file within a JAR file, is not supported when using Kerberos authentication.
This is because the -libjars option used for this approach does not work with yarn
commands that do not launch MapReduce jobs, such as the Hadoop Developer
Templates Kerberos-specific scripts.

Exceptions Related to Expired or Nearly Expired Kerberos TGTs

All of the Kerberos-specific scripts provided with the Hadoop Developer Templates
(vskinit, vsklist, and vskdestroy) require the current user to have a valid (not
expired) Kerberos ticket granting ticket (TGT) to succeed. You cannot initialize, list, or
destroy a delegation token without the TGT on which it is based. If you do not have the
required TGT (or it has expired), you will get the standard Kerberos security exception, and
the script will fail. For example:
java.io.IOException: Failed on local exception: java.io.
IOException: javax.security.sasl.SaslException: GSS initiate
failed [Caused by GSSException: No valid credentials provided
(Mechanism level: Failed to find any Kerberos tgt)]

If this occurs, check your Kerberos TGT, and when necessary, renew it by running the
Kerberos kinit command.

Sometimes your Kerberos TGT may be old, but not yet expired. In this case, Hadoop may
attempt to renew it when you run one of the Kerberos-specific scripts. If the Kerberos TGT
renewal fails for some reason, Hadoop will log a WARN message such as the following:
WARN security.UserGroupInformation: Exception encountered while
running the renewal command for <user>@<realm>. (TGT end time:
<timestamp>, renewalFailures: org.apache.hadoop.metrics2.lib.
MutableGaugeInt@<memory_address>,renewalFailuresTotal: org.
apache.hadoop.metrics2.lib.MutableGaugeLong@<memory_address>)

ExitCodeException exitCode=1: kinit: Ticket expired while

renewing credentials

Because the Kerberos TGT is still usable (not actually expired), the relevant delegation
token will still be initialized, listed, or destroyed, as requested. In other words, keep in mind
that the message shown above is just a warning, and does not necessarily indicate a failure
with respect to the delegation token processing. If the script ends with a successful INFO
message, then it succeeded.

NOTE: You may see this same Kerberos TGT renewal WARN message from Hadoop in
situations that have nothing to do with Hadoop Developer Template delegation token
processing. For example, you can see this same WARN message when you list files in
HDFS using the hdfs dfs command. You can avoid seeing this warning repeatedly by
running the Kerberos kinit command to get a fresh Kerberos TGT from your KDC.

CONFIDENTIAL 3-82
Developer Templates Integration Guide Version 5.0 Common Procedures for the Hadoop Developer Templates

Getting Your Kerberos Ticket Granting Ticket

Depending on the automatic login processing on the node in the Kerberized Hadoop cluster
from which you intend to launch the Hadoop Developer Template job or query, you may or may
not need to run the Kerberos kinit command explicitly to get your Kerberos ticket granting
ticket. On some systems, this step is performed automatically when you login. Otherwise, you
must run the kinit command explicitly:
> kinit

NOTE: This required step (whether performed as part of the login procedure or done
explicitly on the command line) is always necessary when using Kerberos and is not specific
to the Hadoop Developer Templates integration.

Getting And Storing Your Kerberos Delegation Token

Before you run any of the Hadoop Developer Template jobs or queries (MapReduce, Hive, and
so on) for the first time, and after you have implicitly (by logging in) or explicitly (by running
kinit) acquired a Kerberos ticket granting ticket for your user account, you must run the
vskinit script, which is located in the directory <install_dir>/bin:

> ./vskinit

This script, which parallels the Kerberos kinit command, is used to initialize and store a
Kerberos delegation token for the current user that can be used for authentication of
subsequent interactions with the Voltage SecureData Server. This delegation token is short-
lived, and automatically expires after 24 hours.

NOTE: The user’s previously acquired Kerberos ticket granting ticket is used to construct a
Kerberos service ticket for the Voltage SecureData Server hostname voltage-pp-
0000.<district-domain>, with the service principal name HTTP/voltage-pp-
0000.<district-domain>@<Kerberos-realm>. The service ticket is sent to the
Voltage SecureData Key Server in an HTTP request header and authenticated using its
configured keytab file. The returned delegation token is stored in an HDFS file with a name
of the following form:

<delegation.token.hdfs.path_config_value>/<hashed-and-encoded-username>.token

The filename (other than the .token extension) is constructed by hashing the username
using SHA-256 and then Base64-encoding it using a standard variant of that encoding
called “modified Base64 for filename” in which slashes are replaced with dashes.

The permissions on the delegation token file are set to -rw-r----- so that it is limited to
read/write by the user and read by the file’s group. This setting limits access to the token,
while still supporting the case when HiveServer2 doAs impersonation is turned off. For more
information about the Beeline/HiveServer2 scenario in which impersonation (doAs) is
disabled, see "Kerberos Authentication When Beeline/HiveServer2 Impersonation is
Disabled" (page 3-84).

3-83 CONFIDENTIAL
Common Procedures for the Hadoop Developer Templates Developer Templates Integration Guide Version 5.0

Run Your Hadoop Developer Template Job or Query

After running the vskinit script to acquire and store a Kerberos delegation token for your
username, you can run your Hadoop Developer Template job or query as you normally would.
For example, start a MapReduce job using the script run-mr-protect-job, or call the
protectdata Hive UDF in the context of a Hive query. When the tasks in the job run, they will
automatically detect that Kerberos is configured for authentication and read the delegation
token for the current user from HDFS. That delegation token will then be used when making
cryptographic key requests for local Simple API protect and access operations and protect and
access operations through the REST API.

These jobs and queries will either run directly as the current user launching the job, or as a
system user (such as hive), which has been added to the delegation token file’s assigned
group. For more information about running jobs and queries as a different user in the
delegation token file’s assigned group, see "Kerberos Authentication When Beeline/
HiveServer2 Impersonation is Disabled" (page 3-84).

On the Voltage SecureData Server side, the provided delegation token is used to perform
authentication, after which authorization is performed using the specified identity with the
configured Key Server Authentication Methods and Web Service Identity Authorization Rules.
The authorization step typically uses LDAP group membership.

Optional Destruction of the Delegation Token

Depending on your security requirements, it might be advisable to explicitly destroy your
delegation token as soon as your Hadoop job or query completes instead of waiting for it to
expire on its own after 24 hours. This will prevent the current user, and perhaps more
importantly, any user in the delegation token file’s assigned group, from using that same
delegation token to authenticate with the Voltage SecureData Server during the course of
running a subsequent Hadoop job or query. To destroy the delegation token (delete the
delegation token file from HDFS), run the following command from the directory
<install_dir>/bin:

> ./vskdestroy

Kerberos Authentication When Beeline/HiveServer2 Impersonation is Disabled

When you call Hive UDFs through HiveServer2 when impersonation (doAs) is disabled, the
UDF needs to be able to determine the current user running the query. This is accomplished
using the Generic UDF approach, as described in the section "Using the Generic Hive UDFs
When Impersonation is Disabled" (page 5-40).

In the context of Kerberos authentication, this also involves some additional configuration
steps, to allow the system user hive to load the configuration files and to locate and read the
delegation token file for the end-user running the query. When Hive UDFs are going to use
Kerberos authentication for multiple users, you would typically use a single set of configuration
files for all users. This is because there is no need for user-specific XML configuration files due
to the fact that, in the XML configuration file vsauth.xml, all authMethod attribute values are

CONFIDENTIAL 3-84
Developer Templates Integration Guide Version 5.0 Common Procedures for the Hadoop Developer Templates

set to Kerberos and no end-user credentials are specified. Similarly, the delegation token file
may be stored in a common location, specified as an absolute path instead of a path that is
relative to the current user.

The optional and required steps are as follows:

1. Optionally customize the HDFS directory where the delegation token files created for
the users running the UDFs are stored. Do so by updating the value of the
delegationTokenHdfsPath attribute of the kerberos element in the XML
configuration file vsauth.xml. For more information, see "Kerberos Delegation Token
HDFS Location" (page 3-3).

2. Optionally customize the location of the XML configuration files (vsconfig.xml and
vsauth.xml) to use a common path instead of a user-specific one. For more
information, see "Specifying Custom Configuration File Locations for the Kerberos-
Specific Scripts" (page 3-81).

3. Add the system user hive to the delegation token file’s assigned group. Because
HiveServer2 with impersonation (doAs) disabled runs the UDF as the system user
hive, that system user needs to be able to read the delegation token file created for the
user running the query. The Kerberos-specific script vskinit ends by locking down the
permissions on the delegation token file such that read/write access is allowed by the
current user, and read access by any user in the file’s group (-rw-r-----). Therefore,
in order for the system user hive to be able to read the delegation token file, that
system user must be added to the file’s group, as follows:
> usermod -aG <group-name> hive

IMPORTANT: Because the delegation token file permissions include read access by
the file's group (to support the case when HiveServer2 doAs impersonation is turned
off), it is very important to limit this file group’s membership to privileged users only.
If the group that has been assigned to the token file includes non-privileged
members, then those members will be able to read the user's sensitive delegation
token from the file, representing a significant security issue. It is therefore very
important that you understand and control the token file's group assignment
carefully, and limit access as appropriate.

After that is done, you can verify that the system user hive belongs to the file’s group
(in HDFS), by running the following HDFS command:
> hdfs groups hive

This command lists all the groups to which the system user hive belongs so that you
can verify that the groups assigned to the token files for the end-users who will run the
UDFs are included in this list of groups.

3-85 CONFIDENTIAL
Common Procedures for the Hadoop Developer Templates Developer Templates Integration Guide Version 5.0

NOTE: In some environments, the end-user may belong to a default group which has
the same name as the user, with the delegation token file assigned to that default
user group. In this case, you can add the system user hive to that username directly,
as follows:

> usermod -aG <username> hive

To verify that the system user hive now belongs to that username group in HDFS,
run the same hdfs groups hive HDFS command, as described above.

After you perform the optional (1 and 2) and required (3) steps above, you can run the Generic
UDFs through Beeline/HiveServer2, even with impersonation (doAs) disabled. The system user
hive will detect the user running the query and read his/her delegation token file in order to
perform Kerberos-based authentication with the Voltage SecureData Server.

CONFIDENTIAL 3-86
Developer Templates Integration Guide Version 5.0 Logging and Error Handling in the Hadoop Developer Templates

Logging and Error Handling in the Hadoop Developer Templates

The Hadoop Developer Templates log informational and error messages using the Apache
Commons Logging library. These log messages are written to the Hadoop job log files, which
you can view by using the Hadoop job history web UI or by using the hadoop job -history
command line.

NOTE: In general, the Hadoop Developer Templates are generous with respect to logging,
logging successful operations as well as failures. For performance reasons, the former may
not be appropriate in benchmarking and production environments. Change the logging
behavior as required by modifying the source code and rebuilding.

One change of note from releases prior to 3.0 is in the Hive UDF, in which logging of
successful operations is now commented out.

The logs contain general informational and debugging messages which can be helpful when
troubleshooting issues. Make sure to examine these logs if you encounter any errors when
executing the Hadoop Developer Templates.

When you adapt the Hadoop Developer Templates code to your own purposes, you can use
this logging facility for troubleshooting and debugging.

Other than this logging behavior, the Hadoop Developer Templates clients do not perform
error handling, other than by throwing the exceptions caught from the APIs. Proper handling of
returned errors is better left to developers creating production-level solutions based on the
Developer Templates according to the particular needs of their Hadoop jobs.

If any errors are returned by the REST API when it is called from the templates, they are logged
in the Hadoop job logs, just as in the case of the Simple API. The Developer Templates do not
attempt to handle the REST API errors in any way, other than just logging them. See the
Voltage SecureData REST API Developer Guide for this API’s full list of error codes and
associated messages.

The only exception to this rule is for the special case when the REST API call fails because the
version of the Voltage SecureData Server in use does not support the REST API. This would
happen if you attempt to use the REST API with a version of the Voltage SecureData Server
older than 6.0. This specific error condition is detected, an error is thrown causing the job to fail,
and the following message is written in the Hadoop job logs:
Unable to connect to REST service on specified SecureData server.
Check that SecureData server is at version 6.0 or higher.

3-87 CONFIDENTIAL
Handling Empty and Net-Empty Values Developer Templates Integration Guide Version 5.0

Handling Empty and Net-Empty Values

All of the Hadoop Developer Templates sometimes return empty and net-empty input data, as
is, unchanged:

This behavior applies to both plaintext to be protected and ciphertext/tokens (where

applicable) to be accessed.

Input data is empty when the input string upon which the cryptographic operation is being
performed contains zero (0) characters.

Input data is net-empty when, due to the nature of the relevant alphabet(s), no characters in
the input are subject to protection or access. The ways in which net-empty input data is
handled depends on the chosen FPE format, as explained below.

NOTE: For VLS formats, the net-empty concept is not relevant unless the Ignore
Characters not in Alphabet option is enabled for the format in question.

When this option is not enabled, valid input can be empty or it can be non-empty, containing
only valid characters in the specified relevant alphabet. If the input contains any characters
not in the specified relevant alphabet, an error is generated.

First, the general concept of net-empty only applies to some types of formats:
Net-Empty Relevant Format
Types Net-Empty Irrelevant Format Types
Credit Card (CC) Specified-Format String (SFS)
US Social Security Number (SSN) Date
Variable-Length String (VLS) Number

NOTE: The concept of net-empty also applies to the built-in format AUTO.

CC and SSN formats have the following implicit alphabets:

• FPE: 0-9 for plaintext and ciphertext

• eFPE: 0-9 for plaintext, and 0-9 and A-Z (uppercase) for ciphertext

VLS formats always have explicitly defined alphabets: a single alphabet for both plaintext and
ciphertext for FPE formats and separate plaintext and ciphertext alphabets for eFPE formats.

Also, for CC and SSN formats for which some leading and/or trailing characters are preserved,
net-empty determination becomes more complicated. A plaintext CC or SSN string is
considered net-empty if:

• There is no data to protect, and,

• The plaintext string meets all other requirements of the specified format.

CONFIDENTIAL 3-88
Developer Templates Integration Guide Version 5.0 Handling Empty and Net-Empty Values

Likewise for ciphertext CC and SSN strings during access operations.

With respect to meeting the other requirements of the specified format, it is valid, for example,
for a CC string to have no digits at all, but having fewer than 12 digits is not considered a valid
CC. Likewise, SSN strings can also have no digits at all, but if they have any digits, they must
have exactly nine of them. And even when a valid number of digits are present, if the format
specifies the preservation of some number of leading and/or trailing digits, there may not be
any digits remaining to be protected, resulting in a positive net-empty determination.

For example, consider a CC format with preserve leading six, preserve trailing six, and Luhn
ignore. If a 12 digit plaintext is provided for protection, it will be considered net-empty and
returned as the ciphertext, as is, because there are six leading digits to preserve and six trailing
digits to preserve.

Second, for the relevant formats, the rules for determining whether an input data string is net-
empty depend on whether the format is an FPE format or an eFPE format, the latter being more
restrictive:

• FPE Formats: Regular FPE formats specify a single alphabet for both plaintext and
ciphertext. Given this, an input data string is considered to be net-empty if it contains
only characters not in the format alphabet. Consider the following examples:

• For a credit card format, with an implicit input alphabet of digits only, the input
data string “----” is net-empty.

• For a Variable-Length String format with the explicit alphabet “A-Za-z”, the
input data string “01234” is net-empty, assuming that the “Ignore Characters not
in Alphabet” option, the default, was chosen when the format was created.

• eFPE Formats: Embedded FPE (eFPE) formats specify one alphabet for plaintext and a
different (and necessarily larger) alphabet for ciphertext. Given this, an input data string,
whether plaintext to a protection operation or ciphertext to an access operation, is
considered to be net-empty if it contains only characters that are in neither the plaintext
alphabet nor the ciphertext alphabet. Consider the following examples:

• For an eFPE US social security number format, with an implicit plaintext alphabet
of digits only, and an implicit ciphertext alphabet of digits and capital letters, the
input data string “--” is net-empty.

• For an eFPE Variable-Length String format with a plaintext alphabet of “a-z”

and a ciphertext alphabet of “a-z0-9”, the input data string “JOHN SMITH”,
regardless of whether this is plaintext or ciphertext, is net-empty.

• For an eFPE credit card number format, the input data string “-ABCD--” is not
net-empty because the letters A, B, C, and D, are in the implicit ciphertext
alphabet for eFPE credit card formats.

3-89 CONFIDENTIAL
Handling Empty and Net-Empty Values Developer Templates Integration Guide Version 5.0

When you are using the Hadoop Developer Templates, you must consider that any empty or
net-empty input data that you provide will be returned to you unchanged, but without any
other notification that this type of data was processed.

CONFIDENTIAL 3-90
Developer Templates Integration Guide Version 5.0 Known Limitations of the Developer Templates

Known Limitations of the Developer Templates

This section reviews aspects of the Developer Template code that were intentionally kept
simple for the sake of easier comprehension and not overshadowing more central aspects of
the code. As you develop your production-quality solution using one or more of the SecureData
APIs, keep in mind that these areas require additional improvement.

Hadoop Developer Template Code Needs a Full CSV Parser

The Hadoop Developer Template code employs simple CSV file parsing for its sample data that
does not allow for escaped delimiters within data values. Integration of a full CSV parser would
enable more robust data handling in a production solution.

Additional Verification of Converter and Translator Classes

The Developer Template code includes additional example converter and translator classes
that are not exercised by the supplied sample data, particularly in the area of numeric data
handling. If your solution must handle conversions between different numeric data types, the
provided code is a good starting point for your development. Keep in mind that thorough
testing will be required.

Additional Robustness for Java Properties Configuration File Parsing

The parsing of the Java Properties configuration files used by the NiFi Developer Template
(vsnifi.properties) is not robust. If you continue to use Java Properties configuration files
in your production Kafka-Storm and/or NiFi solution, consider using a more robust
configuration file parser.

3-91 CONFIDENTIAL
Known Limitations of the Developer Templates Developer Templates Integration Guide Version 5.0

CONFIDENTIAL 3-92
4 MapReduce Integration
The Hadoop Developer Templates demonstrate how to integrate Voltage SecureData data
protection technology in the context of MapReduce. This demonstration includes the use of the
Simple API (version 4.0 and greater) and the REST API.

This integration relies on a set of Java classes for MapReduce, as well as the classes in the
Hadoop Developer Templates common infrastructure. This chapter provides a description of
the former as well as instructions on how to run the MapReduce template using the provided
sample data. For more information about the common infrastructure used by the Hadoop
Developer Templates, see Chapter 3, “Common Infrastructure”.

Another important aspect of understanding the MapReduce template is to understand the

configuration settings it uses. Some of these configuration settings are global (used by all of the
Hadoop Developer Templates, by the NiFi Developer Template, and by the Kafka-Storm
Developer Template), such as the URL of the client policy file on the Voltage SecureData Server
and REST API hostname. Other configuration settings are relevant only to the components that
run in the context of a Hadoop cluster.

The documentation related to shared configuration settings is provided in Chapter 3, “Common

Infrastructure”, in the sections "Configuration Settings" (page 3-5), "XML Configuration Files"
(page 3-32), "Common Configuration" (page 3-57) and "Hadoop Configuration" (page 3-60).
These sections provides a description of the individual settings, the XML configuration files in
which they are specified, and information about the common infrastructure Java classes used to
read and create in-memory copies of the settings.

The remainder of this chapter is organized as follows:

• Integration Architecture of the MapReduce Template (page 4-1) - This section explains
the Java classes that are specific to the MapReduce template as well as information
about how batch processing is achieved for the different SecureData APIs that this
template can use.

• Configuration Settings for the MapReduce Template (page 4-4) - This section reviews
the configuration settings that are relevant to the MapReduce template and provides an
example of how and why you would want to change those settings as you adapt the
template to your own use.

• Running the MapReduce Template (page 4-6) - This section provides instructions for
running the MapReduce template using the provided sample data.

Integration Architecture of the MapReduce Template

The following Java package and its associated Java source code provide classes that
implement the MapReduce integration:

4-1 CONFIDENTIAL
Integration Architecture of the MapReduce Template Developer Templates Integration Guide Version 5.0

Source code package: com.voltage.securedata.hadoop.mapreduce

Source code location: <install-dir>/src

The MapReduce integration performs data protection processing on an input CSV file and
produces an output CSV file, both within HDFS. Specific columns in the input file are either
protected or accessed, depending on which of the following two classes you specify on the
yarn command line:

com.voltage.securedata.hadoop.mapreduce.Protector

com.voltage.securedata.hadoop.mapreduce.Accessor

The integration includes an abstract base class called BaseMapReduce, which implements the
core functionality for this integration. An important aspect of this integration is the batching of
input plaintext or ciphertext for data protection processing, which is especially important to
minimize network overhead when using the REST API. For more information about batching in
the MapReduce template, see "Batch Processing for MapReduce" (page 4-3).

NOTE: The MapReduce template code does not perform any additional reducer processing.

The data protection processing is isolated in the crypto package abstraction layer. For more
information about this common infrastructure package, see "Cryptographic Abstraction" (page
3-68).

In the MapReduce template, the configuration settings are loaded using the Hadoop
configuration class HDFSConfigLoader and the class CryptoFactory is initialized in the
overridden method setup of the MapReduce Mapper class. For more information, see
"Hadoop Configuration" (page 3-60) and "Cryptographic Abstraction" (page 3-68),
respectively.

At runtime, the overall process, as implemented in the indicated shell scripts, involves the
following general steps:

1. Copy the sample data and configuration settings into HDFS, providing data upon which
the MapReduce template code can perform data protection processing and defining
which columns to protect or access and how to protect or access them.

Shell script: copy-sample-data-to-hdfs

2. Run the yarn command to launch the MapReduce job for the specific class
(Protector or Accessor), providing the Developer Template libraries (JAR files) and
the paths to the job input and output locations in HDFS.

Shell scripts: run-mr-protect-job or run-mr-access-job

The primary methods in the class BaseMapper are:

• initConfiguration(Configuration config)

CONFIDENTIAL 4-2
Developer Templates Integration Guide Version 5.0 Integration Architecture of the MapReduce Template

This method initializes the configuration by loading the settings from the configuration
files vsauth.xml and vsconfig.xml in HDFS using the class HDFSConfigLoader to
initialize a static instance of the class HadoopConfigSettings and then using that
instance to initialize the class CryptoFactory.

• initCryptoList()

This method initializes the list of Crypto instances to use to perform the protect/access
operations, based on the settings read from the configuration files vsauth.xml and
vsconfig.xml.

• map(LongWritable k, Text val, Context context)

This method is the overridden method map of the MapReduce Mapper class, which the
yarn command expects to find in the class specified on its command line. This is also
where the batching logic can be found.

Batch Processing for MapReduce

Because the Simple API does not support the protection of lists of plaintext or the accessing of
lists of ciphertext, it is up to the client code to protect or access the individual values in
whatever list or array data structure contains the values to be processed. The REST API, on the
other hand, allow you to make calls to protect or access both individual values and lists of
values.

The batching of plaintext and ciphertext for efficient processing by the REST API is
accomplished within the MapReduce template code in the overridden method map of the
abstract static class BaseMapper. The BaseMapper class is nested within the abstract class
BaseMapReduce, which extends the Hadoop MapReduce class Mapper. This logic performs
the following steps:

1. Reads a set of lines from a CSV file in HDFS. The NLinesInputFormat and
NLinesRecordReader classes in the util package are used to retrieve multiple lines
at a time rather than the default of one line at a time.

2. Unpacks those lines and the columns within those lines.

3. Based on the settings in vsconfig.xml, assembles lists of column values to protect or

access as a batch (especially important when using the REST API).

4. Invokes the Simple API or the REST API, as specified, to protect or access each batch of
column values. When using the Simple API, the looping over the batch of plaintext or
ciphertext values is done in the methods protectFormattedDataList or
accessFormattedDataList of the template class LocalCrypto. When using the
REST API, the list processing is performed remotely as part of a single REST call.

5. Loops through the input lines again, and for the relevant columns, replaces plaintext
with ciphertext (protection operations) or replaces ciphertext with plaintext (access
operations), and writes the output line, one at a time, as output of the map method.

4-3 CONFIDENTIAL
Configuration Settings for the MapReduce Template Developer Templates Integration Guide Version 5.0

Configuration Settings for the MapReduce Template

There are three classes of configuration settings used by the MapReduce Developer Template:

• General configuration settings that define characteristics of the Voltage SecureData

Server(s) with which the MapReduce Developer Template will interact as well as a few
other common characteristics. These settings are in the XML configuration file
vsconfig.xml. For more information about these settings, For more information about
these settings, see "Configuration Settings" (page 3-5), "vsconfig.xml" (page 3-33),
"Common Configuration" (page 3-57) and "Hadoop Configuration" (page 3-60).

• Authentication/authorization settings in the configuration file vsauth.xml that provide

default and named sets of authentication (and authorization) settings for use by the
MapReduce Developer Template. For more information about these settings, see For
more information about these settings, see "Configuration Settings" (page 3-5),
"vsauth.xml" (page 3-38), and "Common Configuration" (page 3-57).

• Multiple MapReduce-specific settings in the XML configuration file vsconfig.xml

that provide information about which CSV file columns to protect or access and the
cryptId associated with each column (the latter of which leads to information about how
to protect or access them, and potentially any pre- and post-processing required for the
columns to protect or access). For example:

For more information about these settings, see "Common Configuration" (page 3-57).

Before you begin to modify the MapReduce Developer Template XML configuration files for
your own purposes, such as using your own Voltage SecureData Server or different data
formats, Micro Focus Data Security recommends that you first run the MapReduce Developer
Template samples as provided, giving you assurance that your Hadoop cluster is configured
correctly and functioning as expected.

You will need to update many of the values in the XML configuration files vsauth.xml and
vsconfig.xml in order to protect your own data using your own Voltage SecureData Server.

CONFIDENTIAL 4-4
Developer Templates Integration Guide Version 5.0 Configuration Settings for the MapReduce Template

Updating Field Settings for the MapReduce Developer Template

Consider that upon examination of the sample data in the file plaintext.csv, you determine
that you want to also protect the phone number field (shown highlighted in the first three rows
of data from that file):
1,Fabien Baillairgé,Verl Plaza,New Lianemouth,LA,44638,(405)920-0731,oheidenreich@gm...
2,Adélaïde Clérisseau,Rinda Parkway,East Everetberg,KS,08669,(802)813-5936,monna67@g...
3,Jean-Noël Emmanuelli,Bell Crossing,Claudeton,HI,44774-5632,(764)630-0707,ewisoky@c...

In order to add protection and access of the phone number field when running the MapReduce
Developer Template, you would need to edit the XML configuration file vsconfig.xml to add
new configuration settings to the MapReduce fields element, as follows (addition
highlighted):

Important points to note about this addition include:

• The index attribute of the new field element for the phone number is set to 6
because the phone numbers appear as the seventh CSV column (remember, zero-
based) in the sample data file plaintext.csv.

• The cryptId attribute of the new field element for the phone number is set to
alpha, which is the name of a cryptId that uses the built-in format Alphanumeric and
the Simple API (the default API). This format will produce ciphertext phone numbers
with plaintext digits replaced with either digits or letters (upper and lowercase). Non-
alphanumeric characters such as +, (, and ) will be preserved, as is.

NOTE: Remember that if you edit the versions of the configuration files vsauth.xml and/or
vsconfig.xml on the local file system, you must copy the updated versions to HDFS,
which is where the Hadoop Developer Templates will access them when they are run. For
instructions, see "Loading Updated Configuration Files into HDFS" (page 3-77).

4-5 CONFIDENTIAL
Running the MapReduce Template Developer Templates Integration Guide Version 5.0

Running the MapReduce Template

First, remember that you must perform the common preparatory steps, as described in
"Common Procedures for the Hadoop Developer Templates" on page 3-75.

To protect data in the plaintext.csv file, navigate to the bin directory and run the
following script:
./run-mr-protect-job

This script uses YARN to initiate a MapReduce job that protects the sample data, then writes
the protected output files to the following directory in HDFS (relative to your HDFS home
directory):
voltage/protected-sample-data

NOTE: The output is also copied to your local sampledata directory, for later use when
creating the Hive table.

To decrypt or de-tokenize the protected data that is now located in the directory voltage/
protected-sample-data, navigate to the bin directory and run the following script:

./run-mr-access-job

This invokes a MapReduce job that accesses the protected output from the previous command,
then writes the accessed output files to the following directory in HDFS (relative to your HDFS
home directory):
voltage/accessed-sample-data

NOTE: The output is also copied to your local sampledata directory, where you can use it
to validate that your original data is restored.

CONFIDENTIAL 4-6
5 Hive Integration
The Hadoop Developer Templates demonstrate how to integrate Voltage SecureData data
protection technology in the context of Hive. This demonstration includes the use of the Simple
API (version 4.0 and greater) and the REST API.

To make use of the data stored in Hadoop, users typically run data analytics applications on a
computers outside of the Hadoop cluster. Those applications connect to the Hadoop cluster,
using connections such as Open Database Connectivity (ODBC) or Java Database Connectivity
(JDBC) to execute Hive queries. These queries can work seamlessly with data that has been
encrypted or tokenized use Voltage SecureData APIs. However, if an application needs access
to the plaintext, you can configure the Hadoop cluster to access the protected data. This
eliminates the need to install additional software on the computer running the analytics
application.

This integration relies on a set of Java classes that implement several Hive user-defined
functions (UDFs) for protecting and accessing data using Voltage SecureData APIs, as well as
the Java packages in the common infrastructure. This chapter provides a description of the
former as well as instructions on how to run the Hive template using the provided sample data.
For more information about the common infrastructure used by the Hadoop Developer
Templates, see Chapter 3, “Common Infrastructure”.

Another important aspect of understanding the Hive template is to understand the

The documentation related to shared configuration settings is provided in Chapter 3, “Common

The remainder of this chapter is organized as follows:

• History of Hive Support in the Hive Developer Template on page 5-2 - This section
provides information about how the Hive Developer Template has changed over the
previous several releases. These changes have occurred in response to changes in Hive
itself, as well as with the addition of other Hadoop components designed to improve
upon Hive’s performance, such as Impala and LLAP.

• Different Types of Hive UDFs on page 5-4 - This section explains the Java classes that
are specific to the Hive template as well as information about the batch processing
limitations inherent to the Hive template.

5-1 CONFIDENTIAL
History of Hive Support in the Hive Developer Template Developer Templates Integration Guide Version 5.0

• Integration Architecture of the Hive Template on page 5-12 - This section explains the
Java classes that are specific to the Hive template as well as information about the batch
processing limitations inherent to the Hive template.

• Configuration Settings for the Hive Template on page 5-21 - This section reviews the
configuration settings that are relevant to the Hive template and provides an example
of how and why you would want to change those settings as you adapt the template to
your own use.

• Running the Hive Developer Template on page 5-23 - This section provides
instructions for running the Hive template using the provided sample data.

History of Hive Support in the Hive Developer Template

As the supported Hadoop distributions have issued new releases, the newer versions have
included newer versions of Hive, necessitating changes to the Hive Developer Template as new
versions of the Voltage SecureData for Hadoop Developer Templates have been released.
Further, newer versions of the Hadoop Developer Templates have added support for other
Hadoop components related to Hive, such as Impala and LLAP (Live Long and Process). This
section provides a brief history of those changes and summarizes how the Hive Developer
Template supports these various components.

Hive Developer Template 3.10 and Earlier

The Hive Developer Template, through version 3.10, emphasized the Hive command line. The
corresponding supported Hadoop distributions used versions of Hive prior to version 3.0.

Using the Hive command line, you could create and use protect and access UDFs that extend
the Hive class UDF as either temporary or permanent UDFs.

Brief instructions related to using the Hive command line remain in this version of the Hive
Developer Template documentation:

• Creating and Calling Hive UDFs from the Hive Prompt (page 5-14)

Hive Developer Template 3.2

Version 3.2 of the Hive Developer Template emphasized the use of Beeline and HiveServer2
instead of the Hive command line. The corresponding supported Hadoop distributions
continued to use versions of Hive prior to version 3.0.

Because jobs executed by HiveServer2 run as the hive system user, its DoAs impersonation
property became very important, necessitating the introduction of a more complex, but
sophisticated version of the protect and access UDFs that extend a different Hive class:
GenericUDF. When the DoAs property is set to false, this new type of UDF is required in
order for Voltage SecureData authentication and authorization to work properly.

CONFIDENTIAL 5-2
Developer Templates Integration Guide Version 5.0 History of Hive Support in the Hive Developer Template

For more information, see the following sections in this documentation:

• Running Hive Queries Using the Beeline Shell (page 5-29)

• Using the Generic Hive UDFs When Impersonation is Disabled (page 5-40)

Hive Developer Template 4.0

Version 4.0 of the Hive Developer Template continued to emphasize the use of Beeline and
HiveServer2 instead of the Hive command line. However, at least some of the corresponding
supported Hadoop distributions had begun to use Hive 3.0, necessitating a number of non-
trivial changes to the Hive Developer Template.

Versions 4.0 of the Hive Developer Template also introduced support for Apache Impala as a
way to create and run high performance UDFs to query the Hive metastore database. This
scenario has running-as-user considerations similar to HiveServer2, but with the impala user
instead of the hive system user and the added complication that you cannot use generic UDFs
with Impala due to non-compatible method signatures.

For more information about these components and the related considerations, see the
following sections in this documentation:

• Major Changes in Hive 3.0 and the Script Changes They Required (page 5-18)

• Limitations and Special Requirements When Using Impala (page 5-20)

• Running Queries Using Apache Impala (page 5-44)

Hive Developer Template 4.1

Version 4.1 of the Hive Developer Template added support for Hortonworks LLAP, which, like
Impala, provides an infrastructure for improving the performance of Hive queries. A major
aspect of this infrastructure is a number of persistent daemons that execute as the hive
system user. For the same reasons as when the HiveServer2 impersonation setting DoAs is set
to false, this running-as-user consideration forces the use of the generic Hive UDFs. Further,
unlike Beeline/HiveServer2 and Impala, LLAP requires the creation and use of permanent
UDFs.

NOTE: Support for running Hive queries in the context of LLAP did not require any code
changes to the relevant Hive Developer Template Java classes (BaseHiveGenericUDF,
ProtectDataGeneric, and AccessDataGeneric). Therefore, although LLAP testing
began with version 4.1, any version of the Hive Developer Templates with these classes,
beginning with version 3.2, should work as is with LLAP.

For more information about executing Hive queries in the context of LLAP, see the following
section in this documentation:

• Running Queries in the Context of Hive LLAP (page 5-43)

5-3 CONFIDENTIAL
Different Types of Hive UDFs Developer Templates Integration Guide Version 5.0

Hive Developer Template 4.2

Version 4.2 of the Hive Developer Template added support for protect and access operations
on unstructured binary data. These cryptographic operations use Identity-Based Symmetric
Encryption (IBSE), which is based on the Advanced Encryption Standard (AES) to protect
binary data (such as images and audio/video clips) and unstructured string data (such as free-
form comments and SMS messages).

For more information about using the Hive UDFs for unstructured binary data, see the
following section in this documentation:

• Hive UDFs for Binary Data (page 5-6)

Hive Developer Template 5.0

Other than using Maven instead of Ant for building, version 5.0 of the Hive Developer
Template is the same as version 4.2.

Different Types of Hive UDFs

As described in “Java Classes for the Hive UDFs” (page 5-13), the Hive Developer Template
provides Java source code that defines eight different classes that you can use to define
permanent and temporary UDFs with the create function and create temporary
function SQL commands, respectively. In the scripts provided with the Hive Developer
Template, such as create-hive-udf.hql and create-hive-perm-udf.hql, these SQL
commands are used to create UDFs with the same names as their corresponding classes, but
using all lowercase letters. For example:
create function
protectdata as 'com.voltage.securedata.hadoop.hive.ProtectData'
using
jar 'hdfs:///user/<username>/voltage/hiveudf/vibesimplejava.jar',
jar 'hdfs:///user/<username>/voltage/hiveudf/voltage-hadoop.jar';

Where <username> is the name of the user who will run the Hive queries.

Continuing with that convention here, assume the creation of all eight possible Hive UDFs,
paired along three axes: A) protect operations versus access operations, B) formatted data
versus binary data, and C) extends the Hive class UDF versus extends the Hive class
GenericUDF:

Extends Hive Class UDF Extends Hive Class GenericUDF

UDF: protectdata UDF: protectdatageneric
Formatted Data UDF: accessdata UDF: accessdatageneric
UDF: protectbinarydata UDF: protectbinarydatageneric
Binary Data UDF: accessbinarydata UDF: accessbinarydatageneric

CONFIDENTIAL 5-4
Developer Templates Integration Guide Version 5.0 Different Types of Hive UDFs

Two of these axes divide the UDFs in ways that are more obvious (protect versus access) or are
explained in full elsewhere (extends the Hive class UDF versus extends the Hive class
GenericUDF; see "Using the Generic Hive UDFs When Impersonation is Disabled" on page 5-
40).

The third axis of differentiation (formatted data versus binary data) is worthy of additional
explanation in the remainder of this section:

• Hive UDFs for Formatted Data (page 5-5)

• Hive UDFs for Binary Data (page 5-5)

Hive UDFs for Formatted Data

The Hive UDFs for formatted data protect and access formatted string data, such as names,
credit card numbers, and dates, by calling one of several Voltage SecureData APIs (the Simple
API and the REST API) for FPE or SST cyrptographic operations. Such operations use format
information provided by an FPE or SST format defined using the Voltage SecureData
Management Console.

NOTE: SST formats are only supported for the REST API.

Using the class-name-in-lowercase naming convention, the Hive Developer Template provides
four UDFs for working with formatted data:

• protectdata and accessdata - Protects or accesses formatted data using FPE or

SST, respectively, when the Hive query is being run as the user executing the query,
such as when using the Hive command line, or when using Beeline/HiveServer2 with the
HiveServer2 setting doAs is set to true (impersonation is enabled).

• protectdatageneric and accessdatageneric - Protects or accesses formatted

data using FPE or SST, respectively, when the Hive query is being run as the hive
system user, such as when the using Beeline/HiveServer2 with the HiveServer2 setting
doAs is set to false (impersonation is disabled). These more complex UDFs extend
the Hive class GenericUDF, which allows the use of the SessionState API to retrieve
the name of the user actually executing the Hive query.

These two pairs of UDFs have slightly different function signatures. They both take the same
two first parameters, which are required:

1. The formatted data to be protected or accessed, often in the form of a database column
name.

2. The name of a cryptId element in the configuration file vsconfig.xml that contains
information governing the protect or access operation to be performed.

CryptIds used for formatted data cryptographic operations must specify a format other
than AES, which is reserved for binary data cryptographic operations; otherwise the
following runtime exception will occur:

5-5 CONFIDENTIAL
Different Types of Hive UDFs Developer Templates Integration Guide Version 5.0

FPE (formatted data) operations cannot be performed for 'AES'

format: CryptId [<cryptId-details>]

The UDFs in the first pair, protectdata and accessdata, also take an optional third
parameter, the API type, which, if present, will override the API specified for the cryptId (either
globally or with a cryptId-specific setting). Valid choices are: simpleapi and rest.

All four of these UDFs return the resulting ciphertext (protect operations) or plaintext (access
operations) as their return value.

See the Hive Developer Template scripts run-hive-join-query.hql and

run-impala-join-query.sql for examples of the accessdata and accessdataimpala
UDFs, respectively, being used in the context of different queries, including a more complex
JOIN query.

Hive UDFs for Binary Data

The binary Hive UDFs protect and access binary data (such as images and audio/video clips)
and string data that has a less predictable structure than the types of data mentioned above for
formatted data (such as free-form comments and SMS messages). The binary UDFs call one of
two Voltage SecureData APIs (the Simple API or the REST API) to perform IBSE/AES
cryptographic operations. Such operations do not use format information defined using the
Voltage SecureData Management Console.

Using the class-name-in-lowercase naming convention, the Hive Developer Template provides
six UDFs for working with binary data:

• protectbinarydata and accessbinarydata - Protects or accesses binary data

using IBSE/AES, respectively, when the Hive query is being run as the user executing
the query, such as when using the Hive command line, or when using Beeline/
HiveServer2 with the HiveServer2 setting doAs is set to true (impersonation is
enabled).

NOTE: The scripts create-hive-udf.hql and create-hive-perm-udf.hql

create temporary and permanent instances, respectively, of the binary Hive UDFs,
both named according to this naming convention: protectbinarydata and
accessbinarydata.

• protectbinarydatageneric and accessbinarydatageneric - Protects or

accesses binary data using IBSE/AES, respectively, when the Hive query is being run as
the hive system user, such as when using Beeline/HiveServer2 with the HiveServer2
setting doAs set to false (impersonation is disabled). These more complex UDFs
extend the Hive class GenericUDF, which allows the use of the SessionState API to
retrieve the name of the user actually executing the Hive query.

CONFIDENTIAL 5-6
Developer Templates Integration Guide Version 5.0 Different Types of Hive UDFs

NOTE: The script create-hive-perm-udf.hql creates permanent (but

commented out) instances of the generic binary Hive UDFs, both named according
to this naming convention: protectbinarydatageneric and
accessbinarydatageneric.

• protectbinarydataimpala and accessbinarydataimpala - Protects or accesses

binary data using IBSE/AES, respectively, in an Impala query.

NOTE: The scripts create-impala-temp-udf.sql and

create-impala-perm-udf.sql create temporary and permanent instances,
respectively, of the binary Impala UDFs, both named according to this naming
convention: protectbinarydataimpala and accessbinarydataimpala. The
temporary UDFs also demonstrate overloaded UDFs that offer an optional
parameter, explained below.

These three pairs of UDFs have slightly different function signatures. They both take the same
two first parameters, which are required:

1. The binary data to be protected or accessed, often in the form of a database column
name.

2. The name of a cryptId element in the configuration file vsconfig.xml that contains
information governing the AES protect or access operation to be performed. IBSE/AES
operations use the special format keyword AES as the value of the format attribute of
all cryptId elements defined for use with the binary Hive UDFs.

CryptIds used for binary data cryptographic operations must specify the format AES;
otherwise the following runtime exception will occur:
AES (binary data) operations cannot be performed for regular
FPE (non-'AES') format: CryptId [<cryptId-details>]

NOTE: The identity attribute of the cryptId element is only relevant for protect
operations, which include the specified identity as part of the full identity packaged
with AES ciphertext to comprise the full IBSE ciphertext payload. This means that the
identity specified in a cryptId used for an access operation is ignored. Instead, the
required identity is retrieved from the IBSE/AES ciphertext itself.

Also note that the following attributes of the cryptId element are not relevant for
“AES” cryptIds and will be ignored if present:

• translatorClass
• translatorInitData

The UDFs in the first pair (protectbinarydata and accessbinarydata) and the third pair
(protectbinarydataimpala and accessbinarydataimpala) also take an optional third
parameter, the API type, which, if present, will override the API specified for the cryptId (either
globally or with a cryptId-specific setting). Valid choices are: simpleapi and rest.

5-7 CONFIDENTIAL
Different Types of Hive UDFs Developer Templates Integration Guide Version 5.0

All six of these UDFs return the resulting ciphertext (protect operations) or plaintext (access
operations) as their return value.

See the Hive Developer Template scripts run-hive-binary-query.hql and

run-impala-binary-query.sql for examples of the accessbinarydata and
accessbinarydataimpala UDFs, respectively, being used in simple queries.

Special Considerations When Using the Binary Hive UDFs

This section describes special considerations that are relevant to the binary Hive UDFs. It
includes the following sub-sections:

• Size Restrictions When Using the REST API (page 5-8)

• AES Mode for Encryption is Always CBC (page 5-9)

• BINARY versus STRING Data (page 5-9)

• Ciphertext Expansion (page 5-10)

• Interoperability With Other Voltage SecureData APIs (page 5-10)

• Kerberos Authentication Not Available (page 5-12)

Size Restrictions When Using the REST API

If you have chosen to use the REST API when using the Binary Hive UDFs, be aware that, by
default, the Voltage SecureData Server limits the size of its REST payload to 25 MB, which
includes the JSON syntax required to format the REST request. This is done to prevent
performance degradation.

NOTE: The Simple API, the default API, is strongly recommended over the REST API
when performing binary protect and access operations of large data, such as images and
video. It is not going to be efficient to send this type of plaintext and ciphertext data
across the network for processing.

If you exceed the Voltage SecureData Server’s size limit for Web Service data, you will see
the following generic socket exception on the client and message in the debug.log file for
the Web Service:

Generic Socket Exception on the Client:

com.voltage.securedata.ws.rest.http.HttpException:
java.net.SocketException: Connection reset by peer: socket
write error

Web Service debug.log File Message:

Request size <incoming-bytes> bytes exceeded size limit
(<limit> bytes)

CONFIDENTIAL 5-8
Developer Templates Integration Guide Version 5.0 Different Types of Hive UDFs

If such errors occur, you have two choices:

1. Switch this binary Hive UDF to use the Simple API by setting the api attribute of the
relevant cryptId element to simpleapi (or remove the api attribute from that
cryptId when the default API is set to simpleapi). This is the recommended
solution.

2. Get your Voltage SecureData administrator to increase the Web Service data size
limit for the Voltage SecureData Server.

AES Mode for Encryption is Always CBC

While some Voltage SecureData IBSE APIs allow a choice of AES modes, the binary Hive
UDFs always use the Cipher Block Chaining (CBC) mode when protecting plaintext.
Interoperability with respect to the AES mode used to encrypt plaintext is achieved for
these APIs by recording the AES mode in the full identity that is included in the IBSE
envelope that accompanies the AES ciphertext. This allows the binary Hive UDFs to
properly decrypt IBSE ciphertext, even when it was encrypted using a different AES mode,
such as EMES.

BINARY versus STRING Data

The binary Hive UDFs operate with both the BINARY data type and the STRING data type.

If the input data to a binary Hive UDF is determined to be of type BINARY, such as for
images and audio/video clips, then the raw bytes in that data will be protected or accessed.
The input bytes are passed, as is, to the specified API (the Simple API or the REST API) and
the resulting output ciphertext bytes are returned as follows:

• For protect operations, the resulting AES ciphertext is packaged in an IBSE

envelope, which contains the full identity constructed from the identity provided in
the specified cryptId (the second parameter to Hive UDFs). The identity is used to
derive the AES cryptographic key and the AES mode used to protect the plaintext,
CBC, is also recoded in the full identity.

• For access operations, the resulting recovered plaintext, decrypted using the identity
and AES mode from the full identity packaged with the IBSE ciphertext input data, is
returned.

NOTE: When running the binary Hive UDFs in the context of Impala, note that the
BINARY data type is not supported. For more information about this Impala limitation,
see "Data Type Limitation" (page 5-21).

If the input data to a binary Hive UDF is determined to be of type STRING, such as for free-
form comments and SMS messages, then the input bytes from that input string data will be
protected or accessed with extra steps, including Base64 encoding and decoding to assure
that the ciphertext can be stored in that same (or a different) STRING column or variable, as
follows:

5-9 CONFIDENTIAL
Different Types of Hive UDFs Developer Templates Integration Guide Version 5.0

• For protect operations, the bytes from the input plaintext string are retrieved as
UTF-8. The bytes in this UTF-8 string are passed to the specified API (the Simple
API or the REST API) and the resulting output AES ciphertext bytes, which contain
the full identity constructed from the identity provided in the specified cryptId (the
second parameter to Hive UDFs) and the AES mode used for encryption (CBC), are
then Base64-encoded into the final output, a Java String.

• For access operations, the input ciphertext string is Base64-decoded to retrieve the
IBSE/AES ciphertext bytes, which are then passed to the specified API (the Simple
API or the REST API) for decryption using the identity and AES mode from the
contained full identity. The resulting bytes are a UTF-8 encoded version of the
original plaintext string, which is then used to build the output Java String to be
returned.

Ciphertext Expansion

When you use Voltage SecureData IBSE/AES encryption, the ciphertext is always larger
than the plaintext, based on the length of the identity specified for cryptographic key
derivation. This is because the full identity, which includes the specified identity and other
fields, such as the AES mode used for encryption (CBC), is included in the IBSE envelope
included with the ciphertext itself.

For BINARY plaintext data, this accompanying information will cause the ciphertext to be
from between 140 to 185 bytes larger than the plaintext, a modest increase when the data
is an image or an audio/video clip.

For STRING plaintext data, the IBSE envelope overhead remains the same (140-185
additional bytes), but the Base64 encoding of the IBSE ciphertext adds very close to an
additional 33% to the size of the final ciphertext (every 3 bytes of data is expressed as 4
bytes when Base64-encoded, plus possibly one or two added dummy bytes to make the
total number of Base64-encoded output bytes divisible by 4). You must take this into
account when storing the result of a protect operation in a database column or SQL variable
in order to avoid errors or worse, truncation of the ciphertext, making recovery of the
original plaintext impossible.

Interoperability With Other Voltage SecureData APIs

Interoperability with other Voltage SecureData APIs that support the protection of binary
data depends on a couple of important factors:

First, while different Voltage SecureData APIs might or might not allow a choice of AES
modes when protecting binary data (the binary Hive UDFs do not allow a choice, always
using the CBC mode), interoperability between these clients is achieved in this regard by
including enough information as part of the IBSE ciphertext to make that ciphertext self-
describing with respect to the information needed to decrypt it. This includes the AES mode
used to encrypt the AES ciphertext as well as the identity required to derive the relevant
AES cryptographic key.

CONFIDENTIAL 5-10
Developer Templates Integration Guide Version 5.0 Different Types of Hive UDFs

Second, due to the nature of their data storage mechanisms and/or message format, some
Voltage SecureData APIs perform Base64 encoding of their IBSE/AES ciphertext. For
example:

• The REST API always Base64 encodes the IBSE/AES ciphertext it returns for a
protect operation because it needs to be sure that ciphertext byte values are
represented using characters appropriate for the JSON syntax used in the HTTP
response. This API even requires its binary plaintext to be protected to be Base64-
encoded for the same reason (regardless of whether it is really just string data): so
that it can be safely transported in the JSON syntax of an HTTP request.

• Database-based APIs, such as the Teradata UDFs and the Hive UDFs (being
described here), selectively Base64 encode the IBSE/AES ciphertext they return,
based on the data type of the data they are protecting and the fact that the IBSE/
AES ciphertext will be returned as the same data type. In particular, some byte
values in the IBSE/AES ciphertext might not be acceptable in string data types, an
issue solved by Base64 encoding the IBSE/AES ciphertext.

Specifically, the Teradata UDFs Base64 encode IBSE/AES ciphertext originating

from CLOB and VARCHAR data so that it can safely be returned as that same data
type. IBSE/AES ciphertext originating from BLOB data is not Base64-encoded
because any byte value will be acceptable in the BLOB data being returned.

Likewise, the Hive UDFs Base64 encode IBSE/AES ciphertext originating from
STRING data so that it can safely be returned as STRING data. IBSE/AES ciphertext
originating from BINARY data is not Base64-encoded because any byte value will be
acceptable in the BINARY data being returned.

• The Simple API, on the other hand, expects its plaintext as an array of bytes when
performing an IBSE/AES protect operation (unsigned char* in C and byte[] in
Java and C#). When the data to be protected is inherently binary, such as an image
or video clip, it will already be an array of bytes. When the data to be protected is a
string, such as a free-form text field, you must make sure that the characters in the
string are represented using the UTF-8 encoding when they are interpreted as a
sequence of bytes (UTF-8 is the character encoding used by all Voltage SecureData
APIs when retrieving bytes from a string to encrypt using IBSE/AES).

The Simple API does not do any Base64 encoding or decoding as an automatic part
of protect and access operations, respectively (although it does provide separate
APIs for performing Base64 encoding and decoding). Therefore, when using the
Simple API to create IBSE/AES ciphertext that will be accessed by other Voltage
SecureData APIs, you must consider whether those other APIs will expect that
ciphertext to also be Base64-encoded, and if so, which component is responsible for
that separate step, and the mechanism by which that component will know whether
the Base64 encoding step should be performed (such as yes for string data but no
for binary data).

5-11 CONFIDENTIAL
Integration Architecture of the Hive Template Developer Templates Integration Guide Version 5.0

Kerberos Authentication Not Available

Kerberos authentication is not (yet) available for the underlying IBSE/AES APIs used by the
binary Hive UDFs. If you configure Kerberos as the authentication method for a cryptId that
specifies AES as its format, you will get the following exception at runtime:

• REST API: Kerberos-based authentication not yet supported for

AES through the REST API.

• Simple API: Unable to set authentication token in Simple API AES

object. Kerberos-based authentication not yet supported for AES
in your version of Simple API. Check with Voltage Support for
availability of Simple API release that adds this feature.

NOTE: Updating to a newer version of the Simple API that supports Kerberos
authentication for IBSE/AES cryptographic operations will automatically allow the binary
Hive UDFs to use Kerberos authentication for that API. In order to use the 4.2 version of
the binary Hive UDFs with a newer version of the REST API that supports Kerberos
authentication for IBSE/AES cryptographic operations, you will need to make a minor
source code modification to the Hive Developer Template source code and rebuild. For
more information about the required change, contact Micro Focus Data Security support.

Integration Architecture of the Hive Template

The following Java package and its associated Java source code provide classes that
implement the Hive integration:

Source code package: com.voltage.securedata.hadoop.hive

Source code location: <install-dir>/src

The Hive Developer Template demonstrates the integration of Voltage SecureData data
protection processing through the use of Hive UDFs. UDFs provide a ready-to-use integration
mechanism that is implemented by extending the Hive base classes UDF and GenericUDF,
providing an implementation of their method evaluate.

NOTE: The more complex “generic” versions allow dynamic retrieval of the user running a
query by using the Hive SessionState API. This is useful on contexts where the query is
executed as the Hive system user, such as when using Beeline/HiveServer2 when the
HiveServer2 setting doAs is set to false.

CONFIDENTIAL 5-12
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Hive Template

Java Classes for the Hive UDFs

The following classes in Java package com.voltage.securedata.hadoop.hive provide
versions of the evaluate method that can be used to create UDFs for protecting and
accessing both formatted text data (using FPE and SST) and binary data (using IBSE/AES),
including the more complex “generic” versions:

• Formatted text data (FPE and SST) when run as a particular user and when run as the
Hive system user and HiveServer2 impersonation is enabled (HiveServer2 setting
doAs is set to true):

Classes:
com.voltage.securedata.hadoop.hive.ProtectData
com.voltage.securedata.hadoop.hive.AccessData

Inheritance hierarchy: ProtectData and AccessData

extends BaseHiveUDF
extends UDF (Hive class)

Formatted text data (FPE and SST) when run as the Hive system user and HiveServer2
impersonation is disabled (HiveServer2 setting doAs is set to false):

Classes:
com.voltage.securedata.hadoop.hive.ProtectDataGeneric
com.voltage.securedata.hadoop.hive.AccessDataGeneric

Inheritance hierarchy: ProtectDataGeneric and AccessDataGeneric

extends BaseHiveGenericUDF
extends GenericUDF (Hive class)

• Binary data (IBSE/AES) when run as a particular user and when run as the Hive system
user and HiveServer2 impersonation is enabled (HiveServer2 setting doAs is set to
true):

Classes:
com.voltage.securedata.hadoop.hive.ProtectBinaryData
com.voltage.securedata.hadoop.hive.AccessBinaryData

Inheritance hierarchy: ProtectBinaryData and AccessBinaryData

extends BaseBinaryHiveUDF
extends UDF (Hive class)

• Binary data (IBSE/AES) when run as the Hive system user and HiveServer2
impersonation is disabled (HiveServer2 setting doAs is set to false):

Classes:
com.voltage.securedata.hadoop.hive.ProtectBinaryDataGeneric
com.voltage.securedata.hadoop.hive.AccessBinaryDataGeneric

5-13 CONFIDENTIAL
Integration Architecture of the Hive Template Developer Templates Integration Guide Version 5.0

Inheritance hierarchy: ProtectBinaryDataGeneric and

AccessBinaryDataGeneric
extends BaseBinaryHiveGenericUDF
extends GenericUDF (Hive class)

Much of the important, shared logic is implemented within the helper class HiveUDFHelper.
Code in this class supports the client/server architecture of Beeline/HiveServer2 by performing
per-user caching of configuration settings with a short (10 second) refresh interval. This
ensures that:

• Users are using their own configuration settings, as controlled by the HiveServer2
setting doAs being set to true or when the generic variant of the Hive UDFs are being
used (see "Using the Generic Hive UDFs When Impersonation is Disabled" on page 5-
40).

• Any (per-user) configuration changes are picked up quickly without forcing a refresh
during a multi-UDF query. Per-user caching combined with the regular and generic
forms of the Hive UDFs allows full support for multi-user/multi-session cases under
Beeline/HiveServer2, with each user loading and using their own current configuration
settings.

Like all of the Hadoop Developer Templates, the Hive Developer Template uses a number of
the packages in the shared integration architecture, including the crypto package abstraction
layer (com.voltage.securedata.crypto) that isolates calls to the Voltage SecureData
APIs that perform the actual cryptographic operations (the Simple API and the REST API). For
more information, see "Shared Integration Architecture" (page 3-54) and "Cryptographic
Abstraction" (page 3-68).

Creating and Calling Hive UDFs from the Hive Prompt

At runtime, assuming that the Hive tables have already been created as described in the first
few steps of "Running the Hive Developer Template" (page 5-23), the overall process involves
the following general steps:

1. At the Hive prompt, add the JAR files required by the UDFs:
hive> add jar ../simpleapi/vibesimplejava.jar;
hive> add jar voltage-hadoop.jar;

NOTE: The JAR file voltage-hadoop.jar is built as an uber JAR file that contains
vsrestclient.jar and its JSON and HTTP Client library dependencies.

2. Specify the ProtectData and AccessData classes as UDFs (shown at two prompts to
enhance readability):
hive> create temporary function accessdata as
> 'com.voltage.securedata.hadoop.hive.AccessData';

hive> create temporary function protectdata as

> 'com.voltage.securedata.hadoop.hive.ProtectData';

CONFIDENTIAL 5-14
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Hive Template

3. Run HQL queries against the data in the Hive tables. For example:
hive> SELECT id, name, accessdata(cc, 'cc')
> FROM voltage_sample WHERE id <= 5;

NOTE: In this example, the second parameter, 'cc', is a cryptId name, which acts as a way to
look up a specific group of settings in the configuration file vsconfig.xml. It can be the
same as the column name, but it does not necessarily have to be the same.

The script run-hive-join-query.hql performs all three of these steps, including a table
JOIN query at the end:

add jar hdfs:///user/<username>/voltage/hiveudf/vibesimplejava.jar;

add jar hdfs:///user/<username>/voltage/hiveudf/voltage-hadoop.jar;

create temporary function accessdata as

'com.voltage.securedata.hadoop.hive.AccessData';
create temporary function protectdata as
'com.voltage.securedata.hadoop.hive.ProtectData';

SELECT s.id, s.name,

-- accessdata(s.name, 'extended') AS name_decrypted,
accessdata(s.email, 'alpha') AS email_decrypted,
accessdata(s.birth_date, 'date') AS bd_decrypted,
accessdata(s.cc, 'cc') AS cc_decrypted,
accessdata(s.ssn, 'ssn') AS ssn_decrypted,
cs.creditscore
FROM voltage_sample s
JOIN voltage_sample_creditscore cs
ON (s.ssn = cs.ssn)
WHERE s.id <= 10;

NOTE: The Hive Developer Template, as shipped, does not assume that the Simple API and
the REST API support FPE2 extended characters, as suggested by the commented out call
to the UDF accessdata with the cryptId extended, above. To access the name field using
the cryptId extended, remove the comment designators (--) from the beginning of that
line (and to avoid retrieving the ciphertext version as well, remove the s.name, from the
line above).

Limitations of the Hive UDFs

The Hive Developer Template and its UDFs have several inherent limitations that are important
to understand. This section describes these limitations.

Hive UDFs Work on One Data Value at a Time

When you call the UDF on a column, such as in the example above for the email column, the
UDF is called once for each row in the query. There is no concept of batching, as there is in the
MapReduce and Sqoop template integrations. Each UDF invocation in each protected or

5-15 CONFIDENTIAL
Integration Architecture of the Hive Template Developer Templates Integration Guide Version 5.0

accessed data row in the query requires a network round-trip for REST API calls. This will have
a significant performance impact if the query processes a large number of rows. Try to limit any
such UDF calls to a relatively small number of processed rows, such as by using a WHERE
clause or other such filter.

For UDF calls that use the Simple API, there is no such performance impact because each
Simple API operation is performed locally, as individual calls.

Hive UDF Failures with Literal Values

When you call one of the UDFs on a column, such as in the example above for the email
column, the UDF is called in the context of a MapReduce job that is run by Hive. This works
correctly when using either the REST API or the Simple API (which uses JNI to load the native
library). However, if you invoke a UDF for a literal value, that UDF is called “locally”, before the
MapReduce job is run by Hive. This can cause failures for UDFs that call the Simple API, since
the local call goes through a hierarchy of Java class loaders that may fail to load the JNI native
library properly.

For example, the following UDF call on a literal value (not a column), using the Simple API, will
fail:
hive> SELECT id, name, email FROM voltage_sample
> WHERE email = protectdata('[email protected]',
> 'alpha');

The error message indicates a failure to initialize the Simple API, with the following JNI error:
java.lang.UnsatisfiedLinkError: com.voltage.toolkit.
vtksimplejavaJNI.LibraryContext_LIB_CTX_NO_STORAGE_get()J

Alternatively, you might see this error instead:

java.lang.NoClassDefFoundError: Could not initialize class com.
voltage.toolkit.LibraryContext

If you run into this situation, where you want to call the UDF on a literal input value instead of a
column, you have two options:

Option 1:

Use the REST API for the UDF calls on literal values. Since these Web Services calls do
not use JNI, they do not have any class loader issues. Note that the protect and access
UDFs have an advanced feature that allows you to pass an optional third argument to
the call, explicitly specifying the API type: either rest or simpleapi. If provided, this
explicit API type overrides the default one specified for the cryptId in the configuration
file vsconfig.xml. If you pass in rest for the third parameter in the UDF call, the data
protection processing will be performed using the REST API.

CONFIDENTIAL 5-16
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Hive Template

NOTE: Since this is a single protect or access operation on a literal value, there is no
significant performance impact for using the REST API for this operation. Only one
value is being processed in this case, so multiple remote calls are not required.

For example:
hive> SELECT id, name, email FROM voltage_sample
> WHERE email =
> protectdata('[email protected]', 'alpha', 'rest');

Alternatively, you can configure a separate cryptId in the configuration file

vsconfig.xml to use the REST API by including rest as the API setting for individual
fields. In other words, provide duplicate cryptIds for a particular column with different
APIs specified for each one.

For example:
hive> SELECT id, name, email FROM voltage_sample
> WHERE email =
> protectdata('[email protected]', 'alpha-rest');

You may need to do this if you have a custom translator (such as the class
SimpleAPIDateTranslator) configured for the cryptId when using the Simple API,
but which is not appropriate when using the REST API. A good solution is duplicate
cryptIds for use with a given column, each specifying different APIs and one (the one
using the Simple API) specifying the required Translator class.

Option 2:

Add the two JAR files (vibesimplejava.jar and voltage-hadoop.jar) needed to

run the Voltage SecureData UDF to the hive/lib directory on all nodes in your cluster.
The exact location of this hive/lib directory depends on your specific Hadoop
distribution. For example, in some Hadoop distributions, the hive/lib directory is: /
usr/lib/hive/lib.

Once these two JAR files are copied to these hive/lib directories, you will probably
need to restart the Hive service in order for it to find and use the JAR files you have
added. This is true when using Beeline and when using some newer versions of the Hive
prompt, which operate using HiveServer2. For some older versions of the Hive prompt,
starting up the Hive prompt may be sufficient. In any event, continue by following the
usual steps to create and call the relevant Voltage SecureData UDF. This time, the call to
the UDF for literal values will work, even when using the Simple API, because the native
library is loaded from the correct (parent) class loader.

5-17 CONFIDENTIAL
Integration Architecture of the Hive Template Developer Templates Integration Guide Version 5.0

Hive UDF Failures with HortonWorks HDP 2.2 and later

HDP 2.2 and later have optimizations for simple queries that do not involve table JOIN
directives. Depending on whether the Hive UDF is using the REST API or the Simple API,
different considerations and work-arounds may be required. For more information, see the
troubleshooting topic "Simple Queries using Hive UDFs Fail on Some Hadoop Distributions"
(page 12-3).

Major Changes in Hive 3.0 and the Script Changes They Required
Some newer Hadoop distributions, such as HortonWorks HDP 3.0.0, have upgraded to Hive
3.0, which includes important architectural changes that affect the Hive Developer Template.
For example, in previous versions of Hive, running the Hive command line would launch a local
session and perform basic operations relative to the node upon which it was started. In previous
versions of the Voltage SecureData for Hadoop Developer Templates, the Hive Developer
Template relied on this fact in its use of resources on the local node’s file system (such as JAR
files and input data files).

In Hive 3.0, running the Hive command line is now equivalent to running Beeline sessions
automatically connect to the remote HiveServer2 service. This means that any references to
local files would resolve relative to the remote node that is running the HiveServer2 service, not
to the current local node.

There are also changes to how tables are created and managed in Hive 3.0. While the two table
types, managed and external, are not new in Hive 3.0, the access control of managed tables has
been restricted such that only the Hive service can freely access and manipulate the data in
managed tables. In addition, the default file format for managed tables is now Optimized Row
Columnar (ORC).

These changes in Hive 3.0 affect the Hadoop Developer Templates in the following ways:

• The JAR files and input data files used by the Hive Developer Template can no longer
reside on the local file system of the node on which the Hive command line is launched.
The solution is to copy these files to well-known locations in HDFS (this same solution is
used elsewhere in the Hadoop Developer Templates).

• Unless they are granted full permissions to the HDFS table directory (/warehouse/
tablespace/managed/hive), no user other than Hive can create managed tables
using the CREATE TABLE command.

• The Hive Developer Template has always used CSV files as the input data file type.
However, unless you change the setting hive.default.fileformat.managed from
the default file format ORC to the file format TEXTFILE, the LOAD DATA command will
not be able to load the Hive Developer Template’s CSV input data files.

In order to accommodate the possibility that you are using a Hadoop distribution that uses
Hive 3.0, with the types of configurations described above, the script create-hive-
table.hql in the Hive Developer Template includes alternative, commented out, table
creation commands that create external tables with explicit HDFS locations specified for the

CONFIDENTIAL 5-18
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Hive Template

table data. The specification of explicit HDFS locations allows the cleanup script to find the
table data files for explicit deletion, something not done automatically when an external table is
dropped. For example, the first of the three commented-out table creation commands in this
HQL script is as follows, with the relevant comment-designators and keyword changes
highlighted:
-- CREATE EXTERNAL TABLE
-- voltage_sample
-- (id INT, name STRING, street STRING,
-- city STRING, state STRING, postcode STRING,
-- phone STRING, email STRING, birth_date STRING,
-- cc STRING, ssn STRING)
-- ROW FORMAT DELIMITED FIELDS TERMINATED BY ","
-- LOCATION '/user/<username>/voltage/hive-sample-tables/volt...

If you are using Hive 3.0, as will be the case if you are using HDP 3.0.0, when you are editing
this script to replace instances of the <username> placeholder, as described in "Edit the Hive
Scripts to Replace <username> Placeholder" (page 5-24), you will want to comment out the
original three CREATE TABLE commands and uncomment the three replacement CREATE
EXTERNAL TABLE commands (and replace their instances of the <username> placeholder).

NOTE: When the configuration setting doAs is set to false and all Hive queries are run as
the system user Hive, you must make sure that the Hive user has full permission to access:

• The JAR files and input data files required by the Hive Developer Template, and,

• The directory specified by the LOCATION directive shown above, to which the table
data files will be moved when the tables are created.

With respect to the latter permissions, you can omit the LOCATION directive altogether,
resulting in the table data files being moved to the default location: /warehouse/
tablespace/managed/external/hive. However, if you do this, the cleanup script will
fail to find the table data files for explicit deletion.

No Batch Processing for Hive

In the case of Hive, processing is done record-by-record, with no way to batch data protection
operations with the Voltage SecureData UDF call for more efficient processing. This has
implications for the types of HQL queries that are reasonable to attempt when using the
Voltage SecureData UDFs, especially on the large data sets characteristic of Hadoop
environments. For more information, see "Hive UDFs Work on One Data Value at a Time" (page
5-15).

Integration with Apache Impala

Apache Impala provides the ability to create and run high performance UDFs to query the Hive
metastore database. This high performance, up to an order of magnitude faster than Hive, is
achieved by running an Impala daemon process on each data node in the Hadoop cluster to:

5-19 CONFIDENTIAL
Integration Architecture of the Hive Template Developer Templates Integration Guide Version 5.0

• Avoid start-up overhead by utilizing daemon processes that are always running,

• Distribute work so that the daemon processes work on data that exists in the local file
system, avoiding network overhead, and,

• Avoid MapReduce jobs, such as those used during most Hive queries.

Single instances of two additional types of Impala daemons, the Impala Statestore daemon and
the Impala Catalog Service daemon, help manage Impala queries.

The Hadoop Developer Templates includes scripts for creating and running Impala UDFs. For
information about running these scripts to perform Impala queries, see "Running Queries Using
Apache Impala" (page 5-44).

Limitations and Special Requirements When Using Impala

This section identifies and describes several limitation and special requirements when using
Impala to run the Voltage SecureData Hive UDFs.

Authentication Limitation

The biggest limitation when using Impala concerns fine-grained, role-based authentication.
Without also using Apache Sentry, all Impala jobs are run as user impala. And if Sentry is
enabled for Impala, it must also be enabled for Hive, which then prevents the HiveServer2
setting doAs from being set to true (the default), thereby preventing queries being run as
the current user rather than the hive user (for more information about this setting, see
Enable Impersonation for Beeline and Remote Hive Queries on page 5-26).

This Impala limitation means that if Sentry is not used, regardless of which user is running
the Impala shell or another Impala command, Impala does all read and write operations with
the privileges of the impala user, who must therefore have the appropriate read and read/
write access to the related resources, such as JAR files and output directories, respectively.

In this no-Sentry scenario, the choice of Voltage SecureData authentication method

becomes difficult. Shared Secret, which does not do per-user authorization, is workable, but
limited in that regard, providing only identity pattern-matching authorization. The other
available authentication methods (Username and Password, LDAP + Shared Secret, and
Kerberos) will only work if you are willing to also give the impala user appropriate
privileges in LDAP or Kerberos, respectively. However, this does not effectively limit queries
to only certain users because all users are running their queries as the impala user.

If you intend to use Impala in a production environment, and given the security
requirements typically required, you will need to decide about the right compromises with
respect to the use of Sentry for Impala authorization versus the use of Hive impersonation
using the HiveServer2 doAs setting.

CONFIDENTIAL 5-20
Developer Templates Integration Guide Version 5.0 Configuration Settings for the Hive Template

Generic UDFs Not Supported

Also note that the Generic Hive UDFs, as described in "Using the Generic Hive UDFs When
Impersonation is Disabled" (page 5-40), cannot be used with Impala due to incompatible
function signatures.

For information about running the scripts associated with Impala, see "Running Queries
Using Apache Impala" (page 5-44).

Data Type Limitation

The BINARY data type is not supported by Impala for table columns, UDF arguments, and
UDF return values. Since binary data cannot be stored in an Impala table, the Hive binary
UDFs cannot be used to protect binary data when using Impala.

However, Impala does support the STRING data type, which is the other input (and output)
data type supported by the Voltage SecureData binary Hive UDFs. This offers the
possibility of several types of work-arounds, one of which is demonstrated by the Impala
scripts that create and populate and Impala table with protected binary data (create-
impala-table.sql), and that access that protected binary data (run-impala-binary-
query.sql). These scripts work with Base64-encoded binary data, as a STRING.

NOTE: If you choose to implement this approach to protecting and accessing binary data
using Impala, you must bear in mind several Impala limitations that apply:

• Both the STRING data type and entire rows of data are limited to 2 GB. If your
binary plaintext to be protected is Base64-encoded to be stored in an Impala table
as a STRING, it will have already grown by approximately 33%. As described
elsewhere in this guide (see "Ciphertext Expansion" on page 5-10), protecting a
STRING using the Voltage SecureData binary Hive UDFs will add an additional 33%
for Base64 encoding the STRING AES ciphertext (and its IBSE envelope) to the
starting size of the plaintext. This double Base64 encoding adds its 33% overhead
to the original binary data twice by the time it is protected. Given that the
ciphertext result is returned as an Impala STRING, this results in an effective
maximum size of the original binary data that is closer to 1 GB than to 2 GB.

• If your data is going to be stored in a Parquet file, the limit is even lower, 1 GB,
and you will need to make the same type of calculations with respect to ciphertext
expansion.

• To guarantee interoperability with all Impala implementations, the STRING data

type is limited to the ASCII character set.

Configuration Settings for the Hive Template

There are two classes of configuration settings used by the Hive Developer Template:

5-21 CONFIDENTIAL
Configuration Settings for the Hive Template Developer Templates Integration Guide Version 5.0

• General SecureData settings that define characteristics of the Voltage SecureData

Server(s) with which the Hive Developer Template will interact as well as a few other
common characteristics. These settings are in the XML configuration file
vsconfig.xml. For more information about these settings, see "Configuration Settings"
(page 3-5), "vsconfig.xml" (page 3-33), "Common Configuration" (page 3-57) and
"Hadoop Configuration" (page 3-60).

• Authentication/authorization settings in the configuration file vsauth.xml that provide

default and named sets of authentication (and authorization) settings for use by the
Hive Developer Template. For more information about these settings, see "Configuration
Settings" (page 3-5), "vsauth.xml" (page 3-38), and "Common Configuration" (page 3-
57).

NOTE: Note that the third class of configuration settings required by the Hive Developer
Template in releases prior to version 4.1 are no longer necessary. This third class was
previously used to define aliases, which served much the same purpose as cryptIds in
version 4.1 and higher. In versions of the Hive Developer Template prior to version 4.1, the
second parameter to Hive UDFs was the name of an alias defined in the configuration file
vsconfig.properties (or vshive.properties). Beginning with version 4.1, the
second parameter to Hive UDFs is the name of a cryptId, specified in the XML configuration
file vsconfig.xml (or vshive.xml).

Before you begin to modify the Hive Developer Template XML configuration files for your own
purposes, such as using your own Voltage SecureData Server or different data formats, Micro
Focus Data Security recommends that you first run the Hive Developer Template as provided,
giving you assurance that your Hadoop cluster is configured correctly and functioning as
expected.

You will need to update many of the values in the XML configuration files vsauth.xml and
vsconfig.xml in order to protect your own data using your own Voltage SecureData Server.

Example of Updating Settings in the HIVE Section

Suppose that before you start protecting Social Security numbers, you decide to tokenize them
using Secure Stateless Tokenization™ (SST) rather than encrypting them using FPE. To do so,
you can edit the configuration file vsconfig.xml to add a new cryptId named ssn-sst, as
follows:

<cryptIds defaultIdentity="[email protected]" defaultApi="simpleapi">

CONFIDENTIAL 5-22
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template

Important points to note about this change include:

• The emphasis in the paragraph above, before you start protecting Social Security
numbers, highlights the fact that you cannot protect the SSN plaintext using the cryptId
ssn and then expect to access the SSN ciphertext using the cryptId ssn-sst.

• The new cryptId has been given a different name: ssn-sst. While not strictly required
(you could have modified the existing cryptId), this was done to make the type of
operation more clear within any scripts you might use to call a Hive Developer Template
UDF with this cryptId (ssn-sst) as its second parameter.

• The format attribute of the new cryptId is set to ssn-tokens, which is an SSN
tokenization format defined on the public-facing Voltage SecureData Server
dataprotection maintained by Micro Focus Data Security.

• The api attribute of the new cryptId is set to rest, indicating that the REST API should
be used to perform the tokenization.

NOTE: In order to use the REST API, your Voltage SecureData Server must be
version 6.0 or higher.

• Finally, it is important to note that now that you have switched SSN protection from local
FPE to remote SST, the same performance warning as given in the comments for credit
card data now applies to SSNs as well.

NOTE: Remember that if you edit the versions of the configuration files vsauth.xml or
vsconfig.xml on the local file system, you must copy the updated versions to HDFS,
which is where the Hadoop Developer Templates will access them when they are run. For
instructions, see "Loading Updated Configuration Files into HDFS" (page 3-77).

Running the Hive Developer Template

The bin directory includes scripts that you can run to create Hive tables and run Hive queries,
including scripts that use Apache Impala. Also note that with the addition of support for
Beeline/HiveServer2 in the Hive Developer Template, the Hive command line has been de-
emphasized.

This section provides instructions for setting up to run, and then running, the Hive UDFs
provided by the Hive Developer Template in a variety of ways. This includes as temporary
UDFs, as permanent UDFs, from the Hive command line, using Beeline and HiveServer2 both
with and without impersonation, and running the Hive UDFs from a remote computer using
ODBC and JDBC.

5-23 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0

Setting Up to Run the Hive Developer Template

The Hive Developer Template requires that you run the MapReduce Developer Template first
in order to create a CSV file with protected values. Doing so involves:

• Performing some common HDFS setup steps required by all of the Hadoop Developer
Templates. For more information, see “Common HDFS Procedures for the Hadoop
Developer Templates” (page 3-75).

• Running the MapReduce Developer Template to create a CSV file with protected values
that can be loaded into the Hive metastore database. For more information, see
“Running the MapReduce Template” (page 4-6).

NOTE: By adopting this approach, rather than including a pre-packaged Hive input
file with the Voltage SecureData for Hadoop Developer Templates installation, the
Hive template will work just as well with your own Voltage SecureData Server as with
the public-facing Voltage SecureData Server dataprotection, the default Voltage
SecureData Server specified in the configuration file vsconfig.xml.

After completing these steps to create the input data expected by the Hive Developer
Template, several other Hive-specific setup steps are required, as explained in the following
sub-sections:

• Edit the Hive Scripts to Replace <username> Placeholder (page 5-24)

• Copy the Required JAR Files to Your Data Nodes (page 5-25)

• Create the Hive Tables for the Hive Developer Template (page 5-25)

• Enable Impersonation for Beeline and Remote Hive Queries (page 5-26)

Edit the Hive Scripts to Replace <username> Placeholder

As provided, a number of the Hive scripts require you to edit them before use and replace the
<username> placeholder in HDFS paths with the name of the user account under which you
are running the Hadoop Developer Templates. This is the same user account for which you
checked for the presence of a home directory in HDFS, as described in "Creating a Home
Directory in HDFS" (page 3-75).

The set of Hive scripts, in the bin directory, that require this editing are:
• create-hive-table.hql
• create-hive-udf.hql
• create-hive-perm-udf.hql
• run-hive-join-query.hql
• run-hive-binary-query.hql

CONFIDENTIAL 5-24
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template

Edit these Hive scripts using a text editor of your choice and change all non-comment
instances of the <username> placeholder to the actual username.

Copy the Required JAR Files to Your Data Nodes

In order to avoid ClassLoader issues in some Hadoop distributions when loading the Simple
API native library through the Java Native Interface (JNI), copy the following two (or three) JAR
files from the directory <install_dir>/bin to the hive/lib directory on all data nodes in
your Hadoop cluster:

• voltage-hadoop.jar - required, an uber JAR file that contains

vsrestclient.jar and its JSON and HTTP Client library
dependencies.

• vibesimplejava.jar - required, the JAR file for the Simple API.

• vsconfig.jar - optional, if using this approach to configuration; for more
information, see "Config-Locator Properties File Packaged as a JAR
File" (page 3-48).

The hive/lib directory is where the Hive service is installed on the data nodes in your cluster.
For example, for Hortonworks HDP 2.6.4, the Hive service is installed in the directory /usr/
hdp/2.6.4/hive/lib.

After this copy operation is complete, restart HiveServer2.

NOTE: As is the case for all of the Hadoop Developer Templates, the full Simple API package
(including the file libvibesimplejava.so and the trustStore directory) must be
installed on all nodes in your Hadoop cluster that will be running the UDF. In the case of
HiveServer2 and simple (non-join) queries, that would usually be the node(s) in the cluster
running the HiveServer2 service itself. But because the UDFs may launch MapReduce jobs in
some Hadoop distributions, it is important to install the Simple API package on all data
nodes in your Hadoop cluster, not just these HiveServer2 nodes.

And, as usual, the Simple API installation location must be the same on all data nodes, as
specified using the installPath attribute of the simpleAPI element in the configuration
file vsconfig.xml.

Create the Hive Tables for the Hive Developer Template

To create the Hive tables expected by the Hive Developer Template, follow these steps:

1. Navigate to the bin directory.

2. Type hive to access the Hive prompt.

3. Run the script create-hive-table.hql at the Hive prompt as follows:

hive> source create-hive-table.hql;

5-25 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0

This creates several Hive tables that contain the protected data, including protected
binary data, used in subsequent queries using the Hive Developer Template.

NOTE: After you have run the scripts copy-sample-data-to-hdfs and

run-mr-protect-job for the first time (either as part of running the MapReduce
Developer Template or as part of the preparation for running the Hive Developer Template),
the three CSV files that you need to be able to create the Hive tables using the HQL script
create-hive-table.hql (mr-protected-data.csv, creditscore.csv, and
encoded_binary.csv) will be in the correct HDFS directory: hive-sample-data. Table
creation moves these files to the Hive metastore database, so, if you ever want to re-create
the Hive tables, run this script (copy-hive-data-to-hdfs) to re-copy the input CSV files
for Hive table creation.

Enable Impersonation for Beeline and Remote Hive Queries

If you intend to run Hive queries using the Beeline shell or if you intend to run Hive queries
from a computer outside your Hadoop cluster, impersonation must be enabled in HiveServer2,
controlled by the HiveServer2 setting hive.server2.enable.doAs. This setting must be
set to true (the default) so that queries are run as the current user (and not as the hive
system user).

NOTE: In some environments, particularly when access to the Hive tables is secured using
Sentry or Ranger, impersonation is disabled by setting the property doAs to false. In this
case, because the jobs will be run as the hive system user, the regular Hive UDFs provided
with the Hive Developer Template cannot determine the user running the query. The Hive
Developer Template addresses this situation by providing two additional UDF
implementations, ProtectDataGeneric and AccessDataGeneric. For more information,
see "Using the Generic Hive UDFs When Impersonation is Disabled" (page 5-40).

Depending on your Hadoop distribution, you can view and change the doAs setting for the
HiveServer2 service using Ambari or the Hive configuration XML file.

If you are using Ambari:

1. Click the Hive service and then click Configs tab.

2. Check the value of the hive.server2.enable.doAs setting. If the value is set to

false, change it to true.

NOTE: The specific location of this setting depends on your Hadoop distribution. For
example, under HDP 2.6, it is in the Advanced hive-interactive-site section on the
Advanced sub-tab, with the setting name Run as end user instead of Hive user.

3. If you changed this value, click Save at the bottom of the page, and then restart the
HiveServer2 service.

CONFIDENTIAL 5-26
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template

If you are using the Hive configuration XML file:

1. Locate the file hive-site.xml. In some distributions, the file is in the following
location:
/etc/hive/conf/hive-site.xml

2. Set the following XML name/value pair to the following:

<property>
<name>hive.server2.enable.doAs</name>
<value>true</value>
</property>

3. If you changed this value, restart the HiveServer2 service.

Running Queries Locally From a Node Within Your Hadoop Cluster

The simplest way to run Hive queries is locally from a node within your Hadoop cluster using
the Hive command line. This section provides steps for running local queries, both with and
without using the provided scripts.

NOTE: Before running Hive queries locally from a node within your Hadoop cluster, you
must complete the setup steps (except the final one) described in “Setting Up to Run the
Hive Developer Template” (page 5-24). The final setup step, which concerns the
HiveServer2 setting doAs, is not relevant when running Hive queries locally from a node
within your Hadoop cluster. This is because Hive queries that run locally are run as the
interactive user and the Hive UDFs will look automatically look for the configuration files in
the directory /user/<interactive_user>/voltage/config.

Running a JOIN Query Using a Script

To run a JOIN query from the Hive prompt using the provided JOIN query script, follow these
steps:

1. If necessary, navigate to the bin directory and type hive to access the Hive prompt.

2. Run the script run-hive-join-query.hql at the Hive prompt as follows:

hive> source run-hive-join-query.hql;

When this script completes, the first 10 rows of the sample data are shown on the
console.

5-27 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0

Running a Binary Data Query Using a Script

To run a simple query that uses the Hive access UDF for binary data from the Hive prompt
using the provided script, follow these steps:

1. If necessary, navigate to the bin directory and type hive to access the Hive prompt.

2. Run the script run-hive-binary-query.hql at the Hive prompt as follows:

hive> source run-hive-binary-query.hql;

When this script completes, the first 10 rows of the sample data are shown on the
console.

Running a Simple HiveQL Query Interactively

To create temporary Hive UDFs in the current Hive session and run a simple HiveQL query
interactively, follow these steps:

1. If necessary, navigate to the bin directory and type hive to access the Hive prompt.

2. Run the script create-hive-udf.hql at the Hive prompt to create temporary

protect and access UDFs as follows:
hive> source create-hive-udf.hql;

3. Run a simple HiveQL query interactively to access protected data fields in the table:
hive> SELECT id, name,
> accessdata(email, 'alpha'),
> accessdata(birth_date, 'date'),
> accessdata(cc, 'cc'),
> accessdata(ssn, 'ssn')
> FROM voltage_sample
> WHERE id <= 10;

When this script completes, the first 10 rows of the sample data are shown on the
console.

NOTE: If you run this command on a Cloudera CDH 5.2 or 5.3 distribution, the
following messages are generated by the Hive JOIN query:

an attempt to override final parameter: mapreduce.job.end

notification.max.attempts; Ignoring.
DEPRECATED: Configuration property hive.metastore.local no
longer has any effect. Make sure to provide a valid value
for hive.metastore.uris if you are connecting to a remote
metastore.

These messages do not affect the sample query, and can be ignored.

CONFIDENTIAL 5-28
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template

Running Hive Queries Using the Beeline Shell

Another common way to run Hive queries is using the Beeline shell to interact with
HiveServer2. This section provides steps for using the Beeline shell to run Hive queries that
access protected data.

NOTE: Before running Hive queries using the Beeline shell, you must complete the setup
steps described in “Setting Up to Run the Hive Developer Template” (page 5-24). This
includes the final setup step, which concerns the HiveServer2 setting doAs that controls
impersonation. This is required because HiveServer2 runs as the hive system user.

Start the Beeline shell and run queries by performing the following steps:

1. Change the owner of the current login session to the user you want to be running the
Beeline shell (<user>), including switching to that user’s home directory and
environment variables:
su - <user>

2. Navigate to the Hadoop Developer Template bin directory:

cd <install_dir>/bin

3. Start the Beeline shell (shown on two lines for improved readability):
beeline -n <user> -p <password>
-u "jdbc:hive2://<host>:<port>/<database>;principal=<principal>"

Where:

• <user> and <password> are the username and password values, respectively,
to connect to the specified database through HiveServer2.

• <host> is the host name of the HiveServer2 node.

• <port> is the relevant port number on the HiveServer2 host. For example,
10000.

• <database> is the name of the relevant Hive database. For example default.

• <principal> is the Hive service principal name, including the @<REALM> part.

NOTE: This part of the command to start the Beeline shell is only needed if
the Hadoop cluster or Hive service is Kerberized. Otherwise, tt can be omitted.

4. In the Beeline shell, run the script create-hive-udf.hql ot the script

create-hive-perm-udf.hql to create the temporary or the permanent Hive UDFs,
respectively:
!run create-hive-udf.hql

5-29 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0

or
!run create-hive-perm-udf.hql

5. Run a simple query using the temporary or permanent accessdata UDF you just
created. For example:
SELECT id, accessdata(name, 'name'),
accessdata(cc, 'cc'),
accessdata(ssn, 'ssn')
FROM voltage_sample
WHERE id <= 10;

6. Run a join query using the temporary or permanent accessdata UDF you just created.
For example:
SELECT s.id, accessdata(s.name, 'name'),
accessdata(s.ssn, 'ssn'),
accessdata(s.cc, 'cc'),
cs.creditscore
FROM voltage_sample s
JOIN voltage_sample_creditscore cs
ON (s.ssn = cs.ssn)
WHERE s.id <= 10;

Running a Hive Query from a Remote Computer

You can use a remote computer (outside of the Hadoop cluster and without any Voltage
SecureData software installed) to run a Hive query that uses a Voltage SecureData UDF that
has been configured on your Hadoop cluster. This enables analytic software on that remote
computer to access the decrypted or detokenized data, while the data within the Hadoop
cluster remains protected. This section provides steps for using both JDBC and ODBC to run
Hive queries that access protected data, including the requirement to first create permanent
Hive UDFs for your Hadoop cluster:

• Creating Permanent Hive UDFs Using the Hive Command Line (page 5-31)

• Running a Remote Query Using JDBC (page 5-31)

• Running a Remote Query Using ODBC (page 5-38)

NOTE: Before running Hive queries a remote computer outside of the Hadoop cluster, you
must complete the setup steps described in “Setting Up to Run the Hive Developer
Template” (page 5-24). This includes the final setup step, which concerns the HiveServer2
setting doAs that controls impersonation. This is required because HiveServer2 runs as the
hive system user.

CONFIDENTIAL 5-30
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template

Creating Permanent Hive UDFs Using the Hive Command Line

To create the permanent Hive UDFs using the Hive command line, follow these steps:

1. Navigate to the bin directory for the Hadoop Developer Templates.

2. Type hive to access the Hive prompt.

3. Run the script create-hive-perm-udf.hql at the Hive prompt:

hive> source create-hive-perm-udf.hql;

NOTE: To remove the permanent UDFs, run the script drop-hive-perm-udf.hql

at the Hive prompt:

hive> source drop-hive-perm-udf.hql;

NOTE: The Cloudera CDH 5.1 distribution does not support permanent UDFs, which means
that remote queries are not supported.

Running a Remote Query Using JDBC

After creating the permanent Hive UDFs and ensuring that the HiveServer2 setting doAs is set
to true, you must also complete the following procedures on a build machine in order to use
JDBC:

• Generate the JAR file hive-client-test.jar, described in "Generating the Hive

Client JAR File" (page 5-32).

• Update the properties file vshive.properties, described in "Updating the

Configuration File" (page 5-34).

After performing these procedures, the JAR file hive-client-test.jar and the Java
Properties file vshive.properties, along with script files for running a sample Hive query,
can be copied to any computer on which the Java Runtime Environment (JRE) is installed.

The following sub-sections describe the contents of the directory <install_dir>/

clientsamples/jdbc, provide steps for the preparatory procedures describes above, and
the steps to perform remote queries using JDBC.

JDBC Folder Contents

The directory <install_dir>/clientsamples/jdbc is available on the computer on

which you installed the Hadoop Developer Templates, as described in "Developer
Templates Installation Payload" (page 2-8). The following sub-directories and files are
available in this directory:

• The sub-directory src includes sample code to run a Hive query with a UDF call via
remote JDBC connection from a client outside of the Hadoop cluster.

5-31 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0

• The sub-directory bin/config includes the vshive.properties file, which is

used by the sample scripts.

• The sub-directory bin/lib includes the file README.txt that contains a list of the
Hadoop JAR files that you add to this directory, which are needed to run the scripts.

• The file bin/hiveclient is a script that you can run on a Linux machine to
perform a remote Hive query via JDBC

• The file bin/hiveclient.bat is a script that you can run on a Windows machine
to perform a remote Hive query via JDBC

• The file pom.xml is used by the Maven build process

Copy this entire directory from the installation computer to your build computer.

Generating the Hive Client JAR File

You must generate a Hive client JAR file using the files in the directory <install_dir>/
clientsamples/jdbc. To generate this file:

1. Ensure that both mvn and javac are in your path, and that JAVA_HOME is set to the
installation location of version 8 of the JDK.

2. Navigate to the directory <install_dir>/clientsamples/jdbc and edit the

Maven build script pom.xml to supply values for the following Hadoop Libs
Properties elements within the properties element:

Element: repo.id

This element provides a descriptive name of the remote repository, such as

hortonworks or cloudera, depending on the Hadoop distribution source
you are using.

Element: repo.url

This element provides the full URL of the remote repository from which the
relevant dependency JAR files will be pulled. Standard repository URLs for
Hortonworks, Cloudera, MapR, and EMR, respectively, are as follows (where
appropriate, shown on two lines to improve readability):
• http://repo.hortonworks.com/content/groups/public

• https://repository.cloudera.com/artifactory/
cloudera-repos

• https://repository.mapr.com/nexus/content/groups/
mapr-public/

CONFIDENTIAL 5-32
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template

• https://<s3-endpoint>/<region-ID>-emr-artifacts/
<emr-release-label>/repos/maven/

Where:

<s3-endpoint> is the Amazon S3 endpoint of the region for the

repository. For example: s3.us-west-1.amazonaws.com.

<region-id> is the corresponding region in the S3 endpoint. For

example: us-west-1.

<emr-release-label> is the release label for the running

Amazon EMR cluster, in the form emr-n.n.n. For example: emr-
5.30.0.

An example of the full URL is the following (shown on two lines to

improve readability):

https://s3.us-west-1.amazonaws.com/
us-west-1-emr-artifacts/emr-5.30.0/repos/maven

Element: log4j.version

This element provides the version of the Apache Log4j library that you want
to use.

Element: slf4j.api.version

This element provides the version of the slf4j API library that you want to
use.

Element: slf4j.log4j.version

This element provides the version of the slf4j log4j-12 library that you want
to use.

Element: http.client.version

This element provides the version of the Apache HttpComponents Client

library that you want to use.

Element: http.core.version

This element provides the version of the Apache HttpComponents Core

library that you want to use.

Element: commons.logging.version

This element provides the version of the Apache Commons Logging library
that you want to use.

5-33 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0

Element: hadoop.common.version

This element provides the version of the Apache Hadoop Common library
that you want to use.

Element: hive.exec.version

This element provides the version of the Apache Hive Exec library that you
want to use

Element: hive.jdbc.version

This element provides the version of the Hive JDBC library that you want to
use.

Element: hive.service.version

This element provides the version of the Hive Service library that you want to
use.

Example Cloudera Element Values

<repo.id>cloudera</repo.id>
<repo.url>https://repository.cloudera.com/artifactory/cloudera-repos</re...
<log4j.version>1.2.17-cloudera6</log4j.version>
<slf4j.api.version>1.7.36</slf4j.api.version>
<slf4j.log4j.version>1.7.36</slf4j.log4j.version>
<http.client.version>4.5.13</http.client.version>
<http.core.version>4.4.15</http.core.version>
<commons.logging.version>1.2</commons.logging.version>
<hadoop.common.version>3.0.0-cdh6.3.2</hadoop.common.version>
<hive.exec.version>2.1.1-cdh6.3.2</hive.exec.version>
<hive.jdbc.version>2.1.1-cdh6.3.2</hive.jdbc.version>
<hive.service.version>2.1.1-cdh6.3.2</hive.service.version>

3. From the directory <install_dir>/clientsamples/jdbc, run the mvn

command.

When the build is complete, the directory <install_dir>/clientsamples/

jdbc/bin includes the hive-client-test.jar file.

Updating the Configuration File

The configuration file vshive.properties is in the directory <install_dir>/

clientsamples/jdbc/bin/config.

You must, at minimum, update this configuration file to point to your Hive server and to run
as a specific username. You can also configure the following additional settings in this
configuration file:

JDBC Driver

The default value is:

CONFIDENTIAL 5-34
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template

jdbc.driver = org.apache.hive.jdbc.HiveDriver

No change to this value is needed.

JDBC URL

The default value is:

jdbc.url = jdbc:hive2://<HIVE-SERVER2-HOSTNAME>:10000/default

You must replace <HIVE-SERVER2-HOSTNAME> with the hostname or IP address of

the server in your Hadoop cluster that is running the HiveServer2 service.

If the Hive server is listening on a port other than 10000, replace that value as well.
You can also change the database name to the actual name, rather than using the
value default.

Username

The default value is:

conn.username = <USERNAME-TO-RUN-QUERY-AS>

You must replace <USERNAME-TO-RUN-QUERY-AS> with the name of the user

under which you want to run the query.

Password

This parameter name does not have a default value.

conn.password =

If needed, add the password for the username under which you will run the query.

Query

The default value is:

query = SELECT s.id, s.name, \
accessdata(s.email, 'alpha') AS email_decrypted, \
accessdata(s.birth_date, 'date') AS bd_decrypted, \
accessdata(s.cc, 'cc') AS cc_decrypted, \
accessdata(s.ssn, 'ssn') AS ssn_decrypted, \
cs.creditscore \
FROM voltage_sample s \
JOIN voltage_sample_creditscore cs \
ON (s.ssn = cs.ssn) \
WHERE s.id <= 10

This default value shows an example of running a query to access protected fields,
using the Voltage SecureData 'accessdata' UDF. You can customize this value to run
a different test query on the server.

Display Fields

5-35 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0

The default value is:

display.fields = s.id, s.name, email_decrypted, bd_decrypted,

cc_decrypted, ssn_decrypted, cs.creditscore

This default value shows an example of how to display fields that were accessed
using the default query. You can customize this value if you want to display different
fields from a query.

After completing the changes, save the file.

Running the Sample Query

To display the data from the sample query:

1. Copy the directory <install_dir>/clientsamples/jdbc/bin from the build

machine to the computer from which you will run the Hive query.

This directory now includes the file hive-client-test.jar, as well as the

updated configuration file vshive.properties in the directory bin/config.
Make sure that the computer you are running the query from includes the JRE.

2. Add the following Hadoop JAR files to the directory <install_dir>/

clientsamples/jdbc/bin/lib:

• commons-logging-<version>.jar

• hadoop-common-<version>.jar

• hive-exec-<version>.jar

• hive-jdbc-<version>.jar

• hive-service-<version>.jar

• httpclient-<version>.jar

• httpcore-<version>.jar

• log4j-<version>.jar

• slf4j-api-<version>.jar

• slf4j-log4j12-<version>.jar

See the documentation for your specific Hadoop distribution for the location of
these files.

3. Run the sample query.

CONFIDENTIAL 5-36
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template

• If you are running the sample query from Linux, navigate to the directory bin
and run the script hiveclient. This script sets up the classpath and runs the
Java program.
> cd bin
> chmod +x hiveclient
> ./hiveclient

• If you are running the sample query from Windows, navigate to the folder
<install_dir>\clientsamples\jdbc\bin and double-click the file
hiveclient.bat to set up the classpath and run the Java program.

The first 10 lines of the query display, with the output from the first row similar to
the following:
Row: 0
s.id: [1]
s.name: [Fabien Baillairgé]
email_decrypted: [[email protected]]
bd_decrypted: [3/2/2007]
cc_decrypted: [5225629041834450]
ssn_decrypted: [675-03-4941]
cs.creditscore: [621]

Running a Query Using Hive Views

The most secure way to access unprotected data from a remote computer is by running
queries against the Voltage SecureData UDFs. However, some Business Intelligence tools
do not allow the direct query of any Hive UDFs. In this case, you can create Hive views to
access the unprotected data.

CAUTION: The entire set of unprotected data in the cluster is available to any user who
has access to the view, which is a potential security risk. Be aware that the level of
security of the cluster is reduced if you create and run queries using Hive views.

To create the sample view in Hive and run a query from that view:

1. Type hive to access the Hive prompt.

2. Run the following command to create sample view:

hive> CREATE VIEW voltage_sample_decrypted
AS SELECT s.id, s.name,
accessdata(s.email, 'alpha')
AS email_decrypted,
accessdata(s.birth_date, 'date')
AS bd_decrypted,
accessdata(s.cc, 'cc')
AS cc_decrypted,
accessdata(s.ssn, 'ssn')
AS ssn_decrypted,
cs.creditscore

5-37 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0

FROM voltage_sample s
JOIN voltage_sample_creditscore cs
ON (s.ssn = cs.ssn);

3. Run a query against the view you just created:

hive> SELECT * FROM voltage_sample_decrypted WHERE id <= 10;

Running a Remote Query Using ODBC

To demonstrate the use of an ODBC driver, you can copy the voltage-hive-odbc-
sample.odc file to a remote computer and run a sample Hive query. To use this file, you must
have the following:

• Permanent Hive UDFs on your cluster. See "Creating Permanent Hive UDFs Using the
Hive Command Line" (page 5-31) for instructions.

• A working connection between the computer running Microsoft Excel and your Hadoop
cluster.

• The correct version of the Hive ODBC driver installed on the computer you are using to
run Microsoft Excel.

NOTE: You must use the 32-bit version of the driver if you are using a 32-bit version
of Excel, even if your computer is 64-bit.

You can obtain the correct ODBC driver from the provider of your Hadoop distribution:

• Cloudera

• Hortonworks

• MapR

To display the sample query:

1. Double-click the file voltage-hive-odbc-sample.odc to open it. If you see a

message showing that data connections have been blocked, click Enable.

This launches Microsoft Excel, with the Select Data Source dialog box open.

2. Click the Machine Data Source tab on the dialog box, and then choose the data source
name that corresponds to the driver that you downloaded for your distribution.

For example, if you downloaded the ODBC driver for Hortonworks, the data source
name is Sample Hortonworks Hive DSN.

3. Click OK.

CONFIDENTIAL 5-38
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template

This opens an ODBC driver connection dialog. Note that the exact title of the dialog box
varies, depending on which ODBC driver you are using.

4. In the Host field, enter the hostname for the server in your cluster that is running the
HiveServer2 service.

5. In the User Name field, type the user name under which you want to run the query.

6. In the Password field, type the password value associated with the specified the user
name.

If your cluster does not have a password configured for the user, do one of the following:

• Enter one or more characters into the Password field.

• Click inside the Mechanism field, and then scroll up to display User Name.
Choosing User Name in this field disables the Password field. It also clears the
User Name field, which means that you must re-enter the user name.

7. Click Test to test your connection.

5-39 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0

If the connection is valid, you see a message indicating success. If you do not see the
success message, you cannot proceed until you have established a valid connection.

8. Click OK. A message displays at the bottom of the Excel file indicating that it is waiting
for the query to be executed. This might take up to several minutes.

When the query completes, the Excel file is populated with plaintext values for email,
birthdate, credit card number, and social security number (ssn) for the first 10 rows of
the sample data stored in the Hadoop cluster, and accessed by the Hive UDF.

To change the sample query:

1. In the Excel document Data menu, choose Connections.

2. Select the connection and click Properties.

3. Click the Definition tab and edit the string in the Command text field.

4. Click OK.

A dialog indicates that the query is no longer identical to the one in the original .odc
file.

5. Click Yes to run the updated query.

Using the Generic Hive UDFs When Impersonation is Disabled

In some environments, particularly when you secure access to the Hive tables using Sentry or
Ranger, the HiveServer2 setting doAs is set to false, disabling impersonation. In this case,
because queries are run as the hive system user, the regular Hive Developer Template UDFs
will fail because they are not designed to detect this change in the user actually running the
query.

Running a permanent Hive Developer Template UDF query using Beeline when the
HiveServer2 setting doAs is set to false will generate the following error in the log file /var/
log/hive/hiveserver2.log (or an alternative location of HiveServer2 log files):
Caused by: com.voltage.securedata.config.ConfigException: Failed to load config/
auth properties from HDFS
at com.voltage.securedata.hadoop.config.HDFSConfigLoader.load(HDFSConfigLoader.java:189)
at com.voltage.securedata.hadoop.config.HDFSConfigLoader.load(HDFSConfigLoader.java:103)
at com.voltage.securedata.hadoop.hive.BaseHiveUDF.initConfig(BaseHiveUDF.java:56)
at com.voltage.securedata.hadoop.hive.BaseHiveUDF.getCrypto(BaseHiveUDF.java:82)
at com.voltage.securedata.hadoop.hive.BaseHiveUDF.evaluate(BaseHiveUDF.java:207)
... 33 more
Caused by: java.io.FileNotFoundException: File does not exist: /user/hive/voltage/config/
vsconfig.xml

This happens because the query is run as the hive system user, and the expected XML
configuration files are not found in the following directory in HDFS:
/user/hive/voltage/config

CONFIDENTIAL 5-40
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template

One way to solve this problem would be to copy the configuration files to the configuration
directory for the hive system user in HDFS: /user/hive/voltage/config. However, if
multiple users are allowed to create and run UDFs in the context of Beeline/HiveServer2, there
are security implications to these users having access to these shared configuration files,
particularly the XML configuration file vsauth.xml and the authentication credentials it
contains.

Hive Developer Template solves this security issue by providing two additional Hive UDF
implementations:
• com.voltage.securedata.hadoop.hive.ProtectDataGeneric

• com.voltage.securedata.hadoop.hive.AccessDataGeneric

These UDF classes inherit from a class hierarchy that ultimately extends the class GenericUDF
(by contrast, the regular Hive UDF classes, for the ProtectData and AccessData UDFs,
inherit from the class UDF).

The code implemented in the generic versions of the Hive UDFs is more complex, but does
provide access to the user running the query through the Hive SessionState API. This
ability, combined with LDAP Plus SharedSecret authentication, allows the generic UDFs to
make cryptographic key requests to the Key Server as the relevant user even when the
HiveServer2 setting doAs is set to false and the job itself runs as the hive system user.

To create and use the generic UDFs, follow these steps:

1. Choose one of the schemes that allow you to put your Hive Developer Template
configuration files in a custom location (of your choosing). For more information, see
"Specifying the Location of the XML Configuration Files" (page 3-47).

2. In your chosen custom location, in the XML configuration file vsauth.xml, configure
authentication/authorization to use LDAP Plus Shared Secret, as described in "Other
Approaches to Providing Configuration Settings" (page 3-52).

3. Set the file permissions on the configuration files in your custom location in HDFS to
make them readable only by the hive system user (when the HiveServer2 setting doAs
is set to false, the hive system user is the user that must be able to read the
configuration files). This step is important so that access to the sensitive information,
such as the authentication credentials in the XML configuration file vsauth.xml, is
appropriately limited.

CAUTION: Because this approach requires that the hive system user be granted
read permission to these configuration files, any job running as that system user will
be able to read the sensitive information contained in them. In environments where
untrusted users can create and run new UDFs that end up running as the hive
system user, this approach may not be sufficiently secure.

4. Copy the following required two (or three) JAR files to a common (not user-specific)
location in HDFS, for reference when creating the generic UDFs (for example: /apps/
hive/voltage/hiveudf/):

5-41 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0

vibesimplejava.jar, voltage-hadoop.jar, (vsconfig.jar)

5. Create the generic UDFs, protectdatageneric and accessdatageneric,

referencing the corresponding classes ProtectDataGeneric and
AccessDataGeneric. You can do this easily by uncommenting the provided code at
the end of the file create-hive-perm-udf.hql, and then running that file either on
the Hive command line or by using Beeline (as described elsewhere).

For example, the following HiveQL command, an uncommented version of the

commented version in file create-hive-perm-udf.hql, will create a generic UDF
named accessdatageneric:
create function accessdatageneric as
'com.voltage.securedata.hadoop.hive.AccessDataGeneric'
using jar 'hdfs:///apps/hive/voltage/hiveudf/vibesimplejava.jar',
jar 'hdfs:///apps/hive/voltage/hiveudf/vsconfig.jar',
jar 'hdfs:///apps/hive/voltage/hiveudf/voltage-hadoop.jar';

Make sure to update the paths to the JAR files in HDFS to the location where you
copied them, if different than shown above.

6. Call the newly created generic UDFs in the context of Hive queries, just as you would for
the regular protectdata and accessdata UDFs. For example:
SELECT id, name, accessdatageneric(cc, 'cc')
FROM voltage_sample
WHERE id <= 10;

The generic UDFs have the same parameters as the regular UDFs:
protectdatageneric(<value/column>, <cryptId>)

accessdatageneric(<value/column>, <cryptId>)

If you do not need both variants of the UDFs to run side-by-side, you could easily create the
generic UDFs with the regular names, protectdata and accessdata, and just use them in
almost all circumstances. While the generic UDFs are only strictly required in circumstances
where the HiveServer2 setting doAs must be set to false (such as when Sentry or Ranger are
being used), they can still be used in situations in which the doAs setting is set to true or is
irrelevant (the Hive command line being an example of the latter). In such cases, the generic
UDF implementation will authenticate the correct user (the current Hadoop user as reported by
the UserGroupInformation API) when performing LDAP Plus Shared Secret authentication.

Nevertheless, because their code is simpler and easier to understand, the regular UDFs remain
as the primary UDF sample implementations in the Hive Developer Template.

CAUTION: If you are using Apache Impala to run the Hive UDFs provided in the Developer
Templates, note that the generic UDFs described in this section do not work with Impala.
Impala does not support Hive UDFs that extend from class GenericUDF.

CONFIDENTIAL 5-42
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template

Running Queries in the Context of Hive LLAP

Hortonworks Data Platform (HDP) implements the Apache Hive service known as Live Long
and Process (LLAP). This service consists of several different components, including an Apache
Thrift server (HiveServer2-interactive), which provides a JDBC interface for Beeline, and a set
of persistent daemons, all running as the system user hive. Together, these components
execute fragments of Hive queries, replacing direct interaction with Hadoop cluster data nodes.
Any work not performed by the daemons is performed in standard YARN containers.

Running Hive Developer Template UDF queries in the context of LLAP faces the same
challenges as when running Hive Developer Template UDF queries using Beeline when the
HiveServer2 setting doAs is set to false (that is, with impersonation disabled). Because such
queries are run as the hive system user, using the regular Hive Developer Template UDFs
makes the choice of Voltage SecureData authentication method difficult. Shared Secret, which
does not do per-user authorization, is workable, but limited in that regard, providing only
identity pattern-matching authorization. The other available authentication methods
(Username and Password, LDAP + Shared Secret, and Kerberos) will only work if you are willing
to also give the hive user appropriate privileges in LDAP or Kerberos, respectively. However,
this does not effectively limit queries to only certain users because all users are running their
queries as the hive user.

The solution for running the Hive Developer Template in the context of LLAP is the same as for
the Beeline/HiveServer2 scenario described above: using the alternative, more sophisticated
UDFs that extend a base class that allows the UDFs to make cryptographic key requests to the
Key Server as the relevant user even when the LLAP daemons are running as the hive system
user. Creating and running the generic UDFs protectdatageneric and
accessdatageneric, as described in "Using the Generic Hive UDFs When Impersonation is
Disabled" (page 5-40), works in the context of LLAP without any additional changes required,
including with any of the available authentication methods.

Also, due to security concerns, Hive UDFs are required to be created as permanent UDFs. If you
attempt to run temporary UDFs in the context of Hive, and depending on the LLAP mode
(only, all, auto, map, or none), they will either fail or fall back to run on the external Hive
queues.

To successfully execute the generic Hive UDFs in the context of LLAP, follow these steps:

1. Drop any generic Hive Developer Template UDFs (normally protectdatageneric

and accessdatageneric) that you have created.

2. Launch a Beeline session using the appropriate Beeline JDBC URL. For example (shown
on two lines to improve readability):
beeline -u "jdbc:hive2://<hive-server-interactive-host>:<port>/
;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive"

NOTE: This URL is different than what you use to launch Beeline for Hiveserver2,
including the host and port.

5-43 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0

3. Create generic Hive Developer Template UDFs as permanent UDFs. For an example, see
the creation of the generic Hive UDFs protectdatageneric and
accessdatageneric in the script create-hive-perm-udf.hql.

4. Verify that the Hive LLAP execution mode is set as desired (generally either only or
all in order to force LLAP execution as much as possible).

5. Restart the Hive service. This is required to avoid errors associated with the generic
Hive UDFs you created in Step 3 above. Depending on the version of HDP you are
using, such errors can be associated with not finding the UDF(s) or not being allowed to
run the UDF(s).

6. Re-launch Beeline as described in Step 2 above. This is required because of the Hive
service restart.

7. Run Hive query using the generic Hive UDFs created in Step 3 above.

NOTE: For information about additional steps, both optional and required, that you must
also perform if you are using Kerberos authentication in the context of LLAP, see "Kerberos
Authentication When Beeline/HiveServer2 Impersonation is Disabled" (page 3-84).

Running Queries Using Apache Impala

After you have familiarized yourself with the steps involved in running Hive queries using a
variety of scenarios, including using Beeline and HiveServer2, you can use a sequence of similar
steps (and scripts) to run Impala queries. This section describes those steps.

Copy JAR Files to the Nodes Running the Impala Daemon

In order to use the Hive UDFs with Impala, you must manually copy the following JAR files to
the directory /var/lib/impala on all of the nodes in your Hadoop cluster on which the
Impala daemon (impalad) is running:

• voltage-hadoop.jar (Hadoop Developer Templates uber JAR file)

• vibesimplejava.jar (Simple API JAR file)

When Impala runs the Hive UDFs, it will compile these JAR files into a single JAR file within the
sub-directory /var/lib/impala/udfs and then use that combined JAR file to run the UDFs.

Preparing Protected Data to Load into Impala Tables

Like the scripts you use to run the Hive queries, as explained above, the scripts you use to run
Impala queries require similar prerequisites. This involves creating protected data that is stored
in well-known CSV files in both HDFS and in a local directory. This file, named mr-protected-
data.csv, is created by running the MapReduce protect job. It contains a large set of records,
some fields of which will have been protected, and which can be used in conjunction with
another CSV file, creditscore.csv, provided with the Hadoop Developer Templates.

CONFIDENTIAL 5-44
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template

Running the MapReduce protect job involves both common infrastructure steps and a specific
MapReduce step:

1. Make sure a home directory exists in HDFS for the user that will be running the Impala
queries. This is usually the user impala, but if Apache Sentry is being used for
authorization, it could be a different user. For more information, see "Creating a Home
Directory in HDFS" (page 3-75).

2. Run the script copy-sample-data-to-hdfs to copy the files required for the
MapReduce protect job to HDFS using:
./copy-sample-data-to-hdfs

For more information, see "Loading Hadoop Developer Template Files into HDFS" (page
3-76).

3. Run the MapReduce protect job using the script run-mr-protect-job:

./run-mr-protect-job

For more information, see "Running the MapReduce Template" (page 4-6).

After the MapReduce protect job has created the output CSV file mr-protected-data.csv
in the local directory sampledata, you can run the first of the Impala-specific scripts:
copy-impala-data-to-hdfs. As the user impala (or potentially another user of Sentry is
being used), run this Impala-specific script:
./copy-impala-data-to-hdfs

This script:

• If necessary, creates the HDFS directory voltage/config (within the user’s home
directory), and then copies the XML configuration files vsconfig.xml and
vsauth.xml from the local file system to that directory.

• If necessary, creates the HDFS directory voltage/hiveudf (within the user’s home
directory), and then copies the JAR files voltage-hadoop.jar and
vibesimplejava.jar from the local file system to that directory.

NOTE: If you are using the JAR-based alternative configuration file location
approach, as described in "Config-Locator Properties File Packaged as a JAR File"
(page 3-48), you can uncomment a line in this script to copy the JAR file
vsconfig.jar to the voltage/hiveudf directory as well.

• Creates the HDFS directory voltage/impala-sample-data (within the user’s home

directory), and then copies the CSV files mr-protected-data.csv (output of the
MapReduce protect job), creditscore.csv, and encoded_binary.csv (to
demonstrate the binary Hive UDFs) from the local file system to this directory. This is
necessary because, in order to maintain acceptable performance and in contrast to the
creation of Hive tables, Impala can only load table data from files stored in HDFS.

5-45 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0

Create the Impala Tables

The Hadoop Developer Templates provides an Impala-specific script to create the Impala
tables expected by the JOIN query script run-impala-join-query.sql and the binary
query script run-impala-binary-query.sql. The name of the Impala-specific table
creation SQL script is:

• create-impala-table.sql

Before using this SQL script, you must edit it and replace all non-comment instances of
<username> with the name of the user you are using to run the Impala scripts. Normally, this is
the user impala, but it might be another user if you are using Sentry for authorization.

As the user whose name you edited into the table creation script, in the Impala shell, create the
Impala tables as follows:
[<daemon-node-name>:21000] > source create-impala-table.sql;

NOTE: When you run the impala-shell command to launch the Impala shell, if you are on
a computer that is not running the Impala daemon, your Impala shell prompt will be:

[Not connected] >

To connect to a computer running the Impala daemon, use the Impala shell command
connect:

[Not connected] > connect <daemon-node-name>;

If you connect successfully, your Impala shell prompt will be:

[<daemon-node-name>:21000] >

This SQL script creates and loads tables that are similar to those created by the Hive HQL
script create-hive-table.hql, but because Hive and Impala share the Hive metastore
database, in order to avoid naming conflicts and problems when dropping the tables, it
appends _impala to all of the names of the tables it creates.

NOTE: When you create the Impala tables using the CSV files mr-protected-data.csv,
creditscore.csv, and encoded_binary.csv in the HDFS directory voltage/
impala-sample-data, those files are moved in the course of table creation to the Hive
data warehouse in HDFS. This means that if you want to re-create these tables for any
reason, you must run the script copy-impala-data-to-hdfs again in order for the SQL
script create-impala-table.sql to find its source CSV files where it expects to find
them.

CONFIDENTIAL 5-46
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template

Create the Impala UDFs

The Hadoop Developer Templates provide Impala-specific SQL scripts to create both
temporary and permanent Impala UDFs using the same Java code as is used for the (non-
generic) Hive UDFs:
• create-impala-temp-udf.sql

• create-impala-perm-udf.sql

NOTE: Temporary Impala UDFs do not persist across an Impala service restart while
permanent Impala UDFs do persist across an Impala service restart.

Before using one or the other of these SQL scripts, you must edit it and replace all non-
comment instances of <username> with the (same) name of the user account under which
you are running the Impala SQL scripts, either user impala or another user if you are using
Sentry for authorization.

As the user whose name you edited into either the temporary and/or permanent UDF creation
script(s), in the Impala shell, create the Impala UDFs as follows:
[<daemon-node-name>:21000] > source create-impala-temp-udf.sql;

or
[<daemon-node-name>:21000] > source create-impala-perm-udf.sql;

These scripts create the temporary or permanent UDFs using different syntax and both using
the same set of UDF names:

• protectdataimpala
• accessdataimpala

• protectbinarydataimpala
• accessdbinaryataimpala

As with the Impala table names, this is done to avoid name conflicts with the Hive UDFs that
could prevent issues when creating permanent UDFs and when dropping any of the UDFs (due
to ownership differences).

NOTE: These SQL scripts use the same names for the temporary and permanent Impala
UDFs: protectdataimpala, accessdataimpala, protectbinarydataimpala, and
accessbinarydataimpala. To avoid name conflicts, create either temporary Impala UDFs
or permanent Impala UDFs, but not both.

5-47 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0

Run Impala Queries

The Hadoop Developer Templates provides two Impala-specific SQL scripts that run a JOIN
query and a binary data query, respectively, similar to those in the Hive HQL scripts
run-hive-join-query.hql and run-hive-binary-query.hql:

• run-impala-join-query.sql

• run-impala-binary-query.sql

To run the JOIN query, as the user impala (or potentially another user if Sentry is being used),
in the Impala shell, run the SQL script run-impala-join-query.sql to use the temporary
or permanent Impala UDF for ciphertext access (depending on which type you created) named
accessdataimpala in a JOIN query on tables created by the script create-impala-
table.sql:

[<daemon-node-name>:21000] > source run-impala-join-query.sql;

When this script completes, the first 10 rows of the sample data are shown on the console.

NOTE: This Impala script, as shipped, does not assume that the Simple API and the REST
API support FPE2 extended characters, as suggested by the commented out call to the UDF
accessdataimpala with the cryptId extended:

-- accessdataimpala(s.name, 'extended') AS name_decrypted,

To access the name field using the cryptId extended, remove the comment designators
(--) from the beginning of that line (and to avoid retrieving the ciphertext version as well,
remove the s.name, from the line above).

To run the binary data query, as the user impala (or potentially another user if Sentry is being
used), in the Impala shell, run the SQL script run-impala-binary-query.sql to use the
temporary or permanent Impala UDF for ciphertext access (depending on which type you
created) named accessbinarydataimpala in a simple query on the binary data tables
created by the script create-impala-table.sql:
[<daemon-node-name>:21000] > source run-impala-binary-query.sql;

When this script completes, the Base64-encoded binary data of the two .PNG images, and their
associated data, are shown on the console.

Drop Impala UDFs

The Hadoop Developer Templates provides an Impala-specific SQL script that drops any
temporary and permanent Impala UDFs that you may have created using the SQL scripts
create-impala-temp-udf.sql and create-impala-perm-udf.sql, respectively:

• drop-impala-udf.sql

CONFIDENTIAL 5-48
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template

As the user impala (or potentially another user of Sentry is being used), in the Impala shell, run
the SQL script drop-impala-udf.sql to drop both temporary and permanent Impala UDFs
with the names accessdataimpala, protectdataimpala, accessbinarydataimpala,
protectbinarydataimpala from the Hive metastore database, if any:

[<daemon-node-name>:21000] > source drop-impala-udf.sql;

5-49 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0

CONFIDENTIAL 5-50
6 Sqoop Integration
The Hadoop Developer Templates demonstrate how to integrate Voltage SecureData data
protection technology in the context of Sqoop. This demonstration includes the use of the
Simple API (version 4.0 and greater) and the REST API.

The Sqoop integration is trickier than the MapReduce and Hive integrations because Sqoop
does not provide an explicit mechanism through which its import process can be extended to
support a transformation phase. Nevertheless, for Sqoop 1.x at least, this integration was
accomplished by devising a way to wrap the object-relational mapping (ORM) class generated
using the codegen command, combined with the Java packages in the common infrastructure.
This chapter provides a description of the former as well as instructions on how to run the
Sqoop template using the provided sample data. For more information about the common
infrastructure used by the Hadoop Developer Templates, see Chapter 3, “Common
Infrastructure”.

Another important aspect of understanding the Sqoop template is to understand the

The documentation related to shared configuration settings is provided in Chapter 3, “Common

The remainder of this chapter is organized as follows:

• Integration Architecture of the Sqoop Template (page 6-2) - This section explains the
Java classes that are specific to the Sqoop template as well as information about how
batch processing is achieved for the different SecureData APIs that this template can
use.

• Configuration Settings for the Sqoop Template (page 6-4) - This section reviews the
configuration settings that are relevant to the Sqoop template and provides an example
of how and why you would want to change those settings as you adapt the template to
your own use.

• Running the Sqoop Template (page 6-6) - This section provides instructions for
running the Sqoop template using the provided sample data.

6-1 CONFIDENTIAL
Integration Architecture of the Sqoop Template Developer Templates Integration Guide Version 5.0

Integration Architecture of the Sqoop Template

The following Java package and its associated Java source code provide classes that
implement the Sqoop integration:

Source code package: com.voltage.securedata.hadoop.sqoop

Source code location: <install-dir>/src

Unlike MapReduce and Hive, Sqoop does not provide an explicit mechanism through which to
integrate custom code. The base Sqoop import job runs from the command-line, using
command-line arguments, and imports data from an external relational database table into
HDFS directly, without a well-defined or documented way to transform or otherwise process
the data as it is being imported. Sqoop is an efficient Hadoop-based extract-load (EL) tool but
not a full-fledged extract-transform-load (ETL) tool. There is no transform phase that is
designed to be customized.

NOTE: Some Sqoop options may not work at all with the Sqoop integration, while others .

However, Sqoop does provide a way to generate code for the object-relational mapping (ORM)
class that is used to perform the import, with this class being provided as an explicit command-
line input to the Sqoop import command. This opens a mechanism by which custom data
processing code can be integrated into the import data flow, effectively providing a custom
transform phase.

NOTE: The Sqoop integration architecture described in this section works in the context of
Sqoop 1.x. Sqoop 2 uses an entirely different server-side architecture, which does not
support code generation (using the codegen command). Therefore, the approach outlined
here does not apply to Sqoop 2.

Also note that some newer Hadoop distributions have changed such that they no longer use
the deprecated Cloudera package for their Sqoop-generated ORM classes. The Maven build
script (pom.xml) for the Hadoop Developer Templates dynamically adapts to the use of
either the older, deprecated Cloudera Sqoop package or the newer Apache Sqoop package.
For more information, see “Support for Newer Apache Sqoop 1.x Versions” (page 2-16).

The Developer Template for Sqoop accomplishes this by wrapping the generated ORM class in
the class SqoopRecordWrapper. This class uses the Java Reflection API to read data from,
and write data to, the wrapped ORM class. Using this approach, custom code is integrated into
this wrapper to protect the data after it is read from the JDBC ResultSet, and before it is
returned as an output string for Sqoop to write to HDFS.

The abstract base class SqoopRecordWrapper demonstrates the first phase of the batching
functionality for this integration, which is especially important to minimize network overhead
when using the REST API. For more information about batching in the Sqoop Developer
Template, see "Batch Processing for Sqoop" (page 6-3).

CONFIDENTIAL 6-2
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Sqoop Template

At runtime, the overall process, as implemented in the indicated shell scripts and assuming that
you have loaded data into your MySQL database table, involves the following general steps:

1. Run the sqoop codegen command to generate the ORM class for your table.

Shell script: codegen

2. Run the sqoop import command to use the generated ORM class and Developer
Template libraries to protect data as it is being imported into HDFS.

Shell script: run-sqoop-import

The primary methods in the class SqoopRecordWrapper are:

• readFields(ResultSet __dbResults)

This method reads the data into the ORM class from the JDBC ResultSet, in batches.

• toString(DelimiterSet delimiters, boolean useRecordDelim)

This method protects the batch of data by calling the configured Voltage SecureData
data protection API (either the Simple API or the REST API), and returns the processed
batch results as a string to Sqoop for writing to HDFS.

Whether a particular Sqoop import option works using this batching approach depends on
whether the option in question calls both the readFields and toString methods, allowing
both of their overridden SecureData counterparts to perform their roles in the cryptographic
processing. Some Sqoop import options, such as options to import into HCatalog, only call the
readFields method, making the batching approach unworkable. However, if your
cryptographic needs can be satisfied with the Simple API (without the REST API), a different,
non-batched approach is possible. For more information about how to determine whether the
Sqoop integration can work with a particular Sqoop import option, see “Advanced Sqoop Import
Options” (page 6-10).

Batch Processing for Sqoop

The Sqoop integration, by default, work on its data record-by-record. However, the Developer
Template for Sqoop has been written so that the data to be protected or accessed is batched
together for efficient processing, regardless of which type of API is used. When the Simple API
is being used, the batched list of plaintext or ciphertext is looped over in the Sqoop template
code itself, and when the REST API is used, the batched list of plaintext or ciphertext is
processed in a single REST list operation.

6-3 CONFIDENTIAL
Configuration Settings for the Sqoop Template Developer Templates Integration Guide Version 5.0

The batching of plaintext and ciphertext for efficient data protection processing by the REST
API within the Sqoop template code is not as straightforward as the batching mechanism for
the MapReduce template. This is because, unlike MapReduce and Hive, Sqoop does not
provide an explicit mechanism through which to integrate custom code. The base Sqoop
import job runs from the command-line, using command-line arguments, and imports data from
an external relational database table into HDFS directly, without a well-defined or documented
way to transform or otherwise process the data as it is being imported, whether batched or not.

The integration technique of wrapping the generated object-relational mapping (ORM) class in
the class SqoopRecordWrapper can also be used for batching plaintext or ciphertext together
for efficient data protection processing. In the default Sqoop import data flow, data is read from
the JDBC ResultSet into the fields of the ORM class via the method readFields, and then
the values from these fields are returned to Sqoop as a string via the method toString. This is
all done record-by-record, which is inefficient when using the REST API.

As with the MapReduce Developer Template, this problem is solved in the class
SqoopRecordWrapper by altering the logic of its overridden methods readFields and
toString. The method readFields batches up the records it reads from the JDBC
ResultSet. Then, when the method toString is called, the data protection processing is
performed on the entire batch and the data protection results are compiled into the result
string returned to Sqoop for writing to HDFS.

NOTE: The Sqoop integration batches the import operation for efficient REST API
processing. However, as a side effect of this batching, the Sqoop import command reports
the number of batches processed as the number of records retrieved.

For example, the Sqoop import job might display the following result message:
INFO mapreduce.ImportJobBase: Retrieved 8 records.

This message actually indicates that eight batches of records (not just eight records) were
processed/imported. Each batch contains up to 2000 records, which is the default batch size
configured in the Sqoop Developer Template code. Some batches might contain fewer
records if a partial batch was processed.

Configuration Settings for the Sqoop Template

There are three classes of configuration settings used by the Sqoop template:

• General SecureData settings that define characteristics of the Voltage SecureData

Server(s) with which the Sqoop Developer Template will interact as well as a few other
common characteristics. These settings are in the XML configuration file
vsconfig.xml. For more information about these settings, see "Configuration Settings"
(page 3-5), "vsconfig.xml" (page 3-33), "Common Configuration" (page 3-57) and
"Hadoop Configuration" (page 3-60).

CONFIDENTIAL 6-4
Developer Templates Integration Guide Version 5.0 Configuration Settings for the Sqoop Template

• Authentication/authorization settings in the configuration file vsauth.xml that provide

• Multiple Sqoop-specific settings in the XML configuration file vsconfig.xml that

provide information about which database columns to protect or access, how to protect
or access them, and potentially any pre- and post-processing required for the columns
to protect or access. For example:

Before you begin to modify the Sqoop Developer Template XML configuration files for your
own purposes, such as using your own Voltage SecureData Server or different data formats,
Micro Focus Data Security recommends that you first run the Sqoop Developer Template
samples as provided, giving you assurance that your Hadoop cluster is configured correctly and
functioning as expected.

You will need to update many of the values in the XML configuration files vsauth.xml and
vsconfig.xml in order to protect your own data using your own Voltage SecureData Server.

Example of Updating Settings in the SQOOP Section

Upon examination of the sample data in the file plaintext.csv, which is loaded into a
MySQL database table in the first phase of running the Sqoop Developer Template, you
determine that you want to also protect the phone number column as it is imported into HDFS.
Steps 2 and 3 in the procedure "Load Sample Data into MySQL" (page 6-7) show that this
column is imported under the column name phone.

In order to add protection and access of the phone column when running the Sqoop template,
you would need to edit the XML configuration file vsconfig.xml to add new configuration
settings to the Sqoop fields element, as follows (addition highlighted):

6-5 CONFIDENTIAL
Running the Sqoop Template Developer Templates Integration Guide Version 5.0

<field name = "email" cryptId = "alpha"/>

Important points to note about this addition include:

• The name attribute of the new field element for the phone number is set to the
column name phone. This is the column name under which the phone numbers in the
sample data file plaintext.csv are imported into the MySQL database table during
the first phase of running the Sqoop Developer Template.

• The cryptId attribute of the new field element for the phone number is set to
alpha, which is the name of a cryptId that uses the built-in format Alphanumeric and
the Simple API (the default API), which will provide the best performance. This format
will produce ciphertext phone numbers with plaintext digits replaced with either digits
or letters (upper and lowercase). Non-alphanumeric characters such as +, (, and ) will
be preserved, as is.

Running the Sqoop Template

The MySQL JDBC driver is already included in some Hadoop distributions (such as HDP). The
sample scripts assume that the sample data exists in a MySQL database. You must load the
sample data into MySQL, generate a JAR file, and then use Sqoop to import the data while also
protecting some of the fields.

NOTE: The Sqoop integration in the Hadoop Developer Templates is primarily meant for the
main Sqoop import use-case of loading data from a relational database table into HDFS.
Sqoop supports several advanced import options that provide different output options, not
all of which will work with the Sqoop integration. For more information, see “Advanced Sqoop
Import Options” (page 6-10).

CONFIDENTIAL 6-6
Developer Templates Integration Guide Version 5.0 Running the Sqoop Template

Load Sample Data into MySQL

The sample data, which is in the sample data file plaintext.csv, must be loaded into a
MySQL database. You can then run queries to verify that the data was loaded. To load and
verify the sample data:

1. Access the MySQL command prompt:

mysql -u <username> -p

Enter the password for the MySQL user when prompted.

NOTE: Subsequent commands in this section are issued at the mysql> prompt.

2. Create a new database named voltage for the sample table:

mysql> CREATE DATABASE voltage;
mysql> USE voltage;
mysql> CREATE TABLE voltage_sample (id INT PRIMARY KEY,
> name VARCHAR(64), street VARCHAR(32), city VARCHAR(32),
> state VARCHAR(32), postcode VARCHAR(32),
> phone VARCHAR(32), email VARCHAR(128), birth_date DATE,
> cc VARCHAR(32), ssn VARCHAR(32));

3. Load the sample table into the MySQL database:

mysql> LOAD DATA LOCAL INFILE '/<your_absolute_path>/plaintext.csv'
> INTO TABLE voltage_sample
> CHARACTER SET UTF8
> FIELDS TERMINATED BY ','
> LINES TERMINATED BY '\n';

NOTE: Specifying CHARACTER SET UTF8 is now required due to the extended
characters in the name column.

4. Verify that the sample data has been loaded correctly:

mysql> SELECT count(*) FROM voltage_sample;
mysql> SELECT id, name, email, birth_date, cc, ssn
> FROM voltage_sample WHERE id <= 10;

5. Exit MySQL:
mysql> exit;

Generate an ORM JAR File

An object-relational mapping (ORM) JAR file is required before you can import the data using
Sqoop. To generate this JAR file, run the script codegen.

1. Navigate to the bin directory.

2. Edit the script file codegen to update the following variables:

6-7 CONFIDENTIAL
Running the Sqoop Template Developer Templates Integration Guide Version 5.0

• DATABASE_HOST=<database host IP or name>

• DATABASE_NAME=<database name>

• TABLE_NAME=<table name that needs to be imported using sqoop>

• DATABASE_USERNAME=<database user name>

3. By default, the database fields imported into HDFS by Sqoop, some of which are
protected in the process, are delimited by the comma character (,). If you want to use a
different delimiter character in the HDFS import file(s), you must change the following
line in the script codegen to specify an alternative delimiter, as follows:

Change the line:

--table $TABLE_NAME \

To:
--table $TABLE_NAME --fields-terminated-by "<delimiter>" \

Where <delimiter> is an alternative delimiter character, such as a colon (:) or

vertical bar (|).

4. Save the file, then run the script from the bin directory:
./codegen

5. At the prompt, enter the password for your MySQL database <username> (the value
specified for the DATABASE_USERNAME variable).

If the script runs successfully, the following file is generated:

com.voltage.sqoop.DataRecord.jar.

If the script does not run successfully (if the variables are not updated correctly, for example),
you see messages such as the following:
./codegen: line 12: syntax error near unexpected token 'newline'

./codegen: line 12: 'DATABASE_HOST=<database host IP or name>'

CONFIDENTIAL 6-8
Developer Templates Integration Guide Version 5.0 Running the Sqoop Template

NOTE: The ORM source files (.java and .class) are created in a package directory
structure within the parent directory from which you are running the codegen script. This
means that the user account running the command must have permissions to create a new
sub-directory within the current working directory. If your user account does not have the
required permissions, the codegen script might fail with the following error:

ERROR tool.CodeGenTool: Encountered IOException running

codegen job: java.io.FileNotFoundException: ./com/voltage
/sqoop/DataRecord.java (No such file or directory)

If you see this error message, grant the required permissions for the current working
directory and try again.

Import and Protect Your Data

To protect the sample data during sqoop import:

1. Navigate to the bin directory.

2. Edit the script file run-sqoop-import to update the following variables:

• DATABASE_HOST=<database host IP or name>

• DATABASE_NAME=<database name>

• TABLE_NAME=<table name that needs to be imported using sqoop>

• DATABASE_USERNAME=<database user name>

3. Save the file, then run the script from the bin directory:

./run-sqoop-import

4. At the prompt, enter the password for your MySQL database <username> (the value
specified for the DATABASE_USERNAME variable).

If the script runs successfully, the data from the voltage_sample table in MySQL is imported
into the following directory in HDFS (relative to your home directory):
voltage/protected-sqoop-import

The fields specified in the configuration file vsconfig.xml are protected during the import.

If the script does not run successfully (if the variables are not updated correctly, for example),
you see messages such as the following:
./run-sqoop-import: line 11: syntax error near unexpected token `newline'

./run-sqoop-import: line 11: `DATABASE_HOST=<database host IP or name>'

6-9 CONFIDENTIAL
Running the Sqoop Template Developer Templates Integration Guide Version 5.0

Display the Protected Data

To display the first 2500 rows of encrypted and tokenized data imported by Sqoop, run the
following script from the bin directory:
./show-sqoop-output

Advanced Sqoop Import Options

The Sqoop integration in the Hadoop Developer Templates is primarily meant for the main
Sqoop import use-case of loading data from a relational database table into HDFS. Other
advanced Sqoop import options, as listed in the Sqoop documentation, may or may not work
with the approach used by the Sqoop integration, using either its main form, the batched
version of the Sqoop integration, or its alternative form, the non-batched version of the Sqoop
integration (the latter being appropriate for use with the Simple API only). Regardless of which
version is used, the Sqoop integration relies on the use of a generated ORM class (as specified
by the --class-name parameter to the sqoop import command), which may not be
supported by a particular advanced Sqoop import option.

As described above, the batched version of the Sqoop integration works by overriding two
methods in the generated ORM class: readFields and toString. Together, the overridden
versions of these methods work in tandem to perform cryptographic processing:

• readFields reads the data into the ORM class from the JDBC ResultSet, in batches.

• toString protects the batch of data and returning the processed batch results as a
string to Sqoop for writing to HDFS.

Sqoop provides a number of import options that vary with respect to whether their processing
calls both of these methods, allowing the batched version of the Sqoop integration to work as
expected. For example:
• --fields-terminated-by

In the context of the codegen/import flow in the Sqoop integration, this import option
is ignored if it is specified in the import command itself. However, as explained step 3
of “Generate an ORM JAR File” (page 6-7), this import option can be specified in the
codegen step in the flow to allow the use a delimiter other than the default, the comma
character (,). This is required because the delimiter character is defined as a constant in
the generated ORM class and is used during the import operation, regardless of any
value specified for this option in the import command.

• --hive-import and --hive-table

Because they call both of the overridden readFields and toString methods of the
generated ORM class, both of these import options, used to import directly from a
relational database table into a table in Hive, work using the batched version of the
Sqoop integration.

CONFIDENTIAL 6-10
Developer Templates Integration Guide Version 5.0 Running the Sqoop Template

• --hcatalog-database and --hcatalog-table

Because they only call the overridden readFields method of the generated ORM
class, these import options, used to import directly from a relational database table into
a table in HCatalog, do not work using the batched version of the Sqoop integration. If
you attempt to use the batched version of the Sqoop integration with either of these
HCatalog import options, you will see that the number of output records is less than the
number of input records (because the toString method is not called to process and
clear the batches) and that the specified output fields have not been protected (also
performed by the uncalled toString method).

NOTE: The job logs will contain evidence of the method readFields being called
(messages of the form Reading fields from result set; web service
batch size:), but without corresponding evidence of the method toString being
called (no messages of the form Writing data to HDFS; batch size:).

However, because the method readFields is called, and if local cryptographic

processing using the Simple API works for your scenario, you can use the non-batched
version of the Sqoop integration with these import options. For more information, see
“Non-Batched Version of the Sqoop Integration” (page 6-12).
• --as-parquetfile

As with the HCatalog import options described above, this import option, used to import
directly from a relational database table into Apache Parquet Files, does not work with
the batched version of the Sqoop integration for the same reason: because only the
overridden readFields method of the generated ORM class is called. Job log evidence
of this fact is the same: Reading messages without correspond Writing messages.
Likewise, the workaround is the same, again assuming that exclusive use of the Simple
API for cryptographic processing: use the non-batched version of the Sqoop integration,
as described in “Non-Batched Version of the Sqoop Integration” (page 6-12).

Other advanced Sqoop import options may work with the Sqoop integration, depending on
whether they:

• Call both of the (overridden) readFields and toString methods of the generated
ORM class, allowing the batched version of the Sqoop integration to be used.

• Only call the (overridden) readFields method of the generated ORM class, allowing
the alternative non-batched version of the Sqoop integration to be used (with the
Simple API only due to the network overhead associated with non-batched use of the
REST API).

If a particular Sqoop import option does not conform to either of these scenarios, it cannot be
used with the Sqoop integration. For more information, see “Determine Whether the Sqoop
Integration Supports a Sqoop Import Option” (page 6-13).

6-11 CONFIDENTIAL
Running the Sqoop Template Developer Templates Integration Guide Version 5.0

Non-Batched Version of the Sqoop Integration

If the troubleshooting in step 3 of “Determine Whether the Sqoop Integration Supports a Sqoop
Import Option” (page 6-13) indicates that the readFields method is getting called (the job
log contains messages of the form Reading fields from result set; web service
batch size:) but that the toString method is not getting called (the job log does not
contain messages of the form Writing data to HDFS; batch size:), the alternative
non-batched version of the Sqoop integration may work to protect the data during import.

The Sqoop integration provides the following alternative ORM wrapper class:
NonBatchedSqoopImportProtector

Unlike the main SqoopRecordWrapper base class, which performs the cryptographic
processing in stages using a combination of both the readFields and toString methods for
optimized batch processing, the ORM wrapper class NonBatchedSqoopImportProtector
performs all cryptographic processing within the readFields method.

The main disadvantage to this alternative approach is that all cryptographic operations are
performed on individual records, one at a time, with no batching. This is problematic in the case
of remote cryptographic processing using the REST API, which will not be able to handle large
data sets efficiently. However, if you are only performing local cryptographic processing using
the Simple API, this alternative approach may allow you to perform the SecureData integration
successfully, even when using an advanced Sqoop import option that was failing when using
the batched (SqoopRecordWrapper) version of the Sqoop integration.

To use the non-batched version of the Sqoop integration, create a version of the run-sqoop-
import script that replaces the SqoopImportProtector class (which extends the
SqoopRecordWrapper base class) with the NonBatchedSqoopImportProtector class,
and which adds the specific advanced Sqoop import option that was otherwise failing (using
the option --hcatalog-database as an example here):
sqoop import \
-libjars com.voltage.sqoop.DataRecord.jar,../common-crypto/simpleapi/vibesimple...
--username $DATABASE_USERNAME \
-P \
--connect jdbc:mysql://$DATABASE_HOST/$DATABASE_NAME \
--table $TABLE_NAME \
--jar-file voltage-hadoop.jar \
--class-name com.voltage.securedata.hadoop.sqoop.NonBatchedSqoopImportProtector \
--target-dir voltage/protected-sqoop-import \
--hcatalog-database

Then run the modified script and check the results. Also, check the job logs for the following log
messages, indicating that cryptographic processing was performed in the method
readFields:

Reading fields from ResultSet, for recordCount:

[PROTECT]: Calling SecureData [<crypto-api-type>] crypto operation

for format: <format-name>

CONFIDENTIAL 6-12
Developer Templates Integration Guide Version 5.0 Running the Sqoop Template

As mentioned above, the <crypto-api-type> in the message above needs to specify the
Simple API. The use of remote cryptographic processing, such as with the REST API, with the
non-batched version of the Sqoop integration will result in individual network calls for each
protect operation, resulting in severe performance issues and potential job timeouts or other
failures when importing large data sets.

Determine Whether the Sqoop Integration Supports a Sqoop Import Option

This section provides a series of steps you can use to determine whether a particular advanced
Sqoop import option has the right characteristics that will allow it to use the Sqoop integration
of the Hadoop Developer Templates. These characteristics are:

• The advanced Sqoop import option in question must recognize and use the
--class-name parameter to the sqoop import command, which is used to provide
the class path of a generated ORM class to call particular methods in that class.

• In the generated ORM class specified by the --class-name parameter to the sqoop
import command, the advance Sqoop import option in question must either A) call
both of the methods readFields and toString (batched version of the Sqoop
integration), or B) call only the method readFields (non-batched version of the
Sqoop integration).

The batched version of the Sqoop integration relies on both of the methods
readFields and toString being called, the overridden versions of which work in
tandem to provide batched cryptographic processing. This version is appropriate for
use with either the Simple API or the REST API.

The non-batched version of the Sqoop integration (see "Non-Batched Version of the
Sqoop Integration" on page 6-12) relies only on the method readFields being called,
the overridden version of which provides non-batched cryptographic processing. This
version is appropriate for use only with the Simple API due to the high network
overhead associated with non-batched use of the REST API.

The steps to determine compatibility of a particular advanced Sqoop import option with one or
the other of the versions of the Sqoop integration are as follows:

1. Try the batched version of the Sqoop integration on a small sample dataset (a few
thousand records, as provided in the sample input data) with the advanced Sqoop
import option of interest, such as by creating a variant of the run-sqoop-import
script with the relevant option added. After the import job completes, examine the
resulting output to make sure that all of the input records were processed and that the
specified fields in the records were protected as expected.

In particular, look for the following types or errors:

• The output is not formatted properly, such as if the specified delimiter character
was not used. This may indicate a failure to apply the specified Sqoop import
option in conjunction with the generated ORM class.

6-13 CONFIDENTIAL
Running the Sqoop Template Developer Templates Integration Guide Version 5.0

• The specified fields are not protected in the output (they are still plaintext). This
indicates a failure to call both of the methods readFields and toString in the
generated ORM class correctly.

• The number of records in the output is less than the number of records in the
input. This indicates a failure to call the method toString in the generated
ORM class to process and clear the batches.

If none of these errors are found, the advanced Sqoop import option of interest appears
to work correctly with the batched version of the Sqoop integration. Even more
confidence can be gained by looking in the job logs for the messages described in
step 3 below.

If any of these errors are found, proceed to step 2 to determine whether the advanced
Sqoop import option of interest is compatible with the use of a generated ORM class.

2. If step 1 above fails (one or more of the types of errors described is found), the next
step is to try the option again, still specifying a generated ORM class but one without
any SecureData integration.

Use the same generated ORM JAR file (com.voltage.sqoop.DataRecord.jar)

that you generated by following the steps in “Generate an ORM JAR File” (page 6-7) in
the following sqoop import command:
sqoop import \
--username <database-user-name> \
-P \
--connect jdbc:mysql://<database-host-IP-or-name>/<database-name> \
--table <name-of-table-to-be-imported-using-sqoop> \
--jar-file com.voltage.sqoop.DataRecord.jar \
--class-name com.voltage.sqoop.DataRecord \
--target-dir voltage/base-sqoop-import \
<advanced-Sqoop-import-option-of-interest>

IMPORTANT: This sqoop import command specifies and uses a generated ORM
class but does not include the Hadoop Developer Templates JAR file nor the Simple
API JAR file, completely bypassing the SecureData integration.

After the import job completes, check the resulting output again, to make sure that all of
the input records were imported directly, as is, to the output location and that the
specified advance Sqoop import option of interest was applied. If any errors are found,
the advanced Sqoop import option of interest may just not work with any generated
ORM class specification, integrated with SecureData or not.

If no errors are found, the advanced Sqoop import option would appear to work
correctly with a generated ORM class. Proceed to step 3 to troubleshoot the
cryptographic processing associated with the Sqoop integration.

CONFIDENTIAL 6-14
Developer Templates Integration Guide Version 5.0 Running the Sqoop Template

NOTE: Keep in mind that a failure to apply the specified advanced Sqoop import
option in this simpler sqoop import command may also indicate that the option
should be provided earlier, during the codegen step, as is the case for the
--fields-terminated-by option described above.

3. If step 1 failed, indicating that the batched version of the Sqoop integration does not
work for the advanced Sqoop import option of interest, but step 2 succeeded, indicating
that the advanced Sqoop import option of interest works, in general, with a generated
ORM class, the next step is troubleshooting the SecureData aspect of the integration.
Specifically, you must determine which of the overridden methods readFields and
toString are called, if any.

To determine this, run the sqoop import command again (as in step 1 above), and
check the logs for the import job (for example, in the Hadoop JobHistory user interface).
Specifically, look for the following messages logged by the SecureData
SqoopRecordWrapper class, indicating that the corresponding method was invoked:

readFields message: Reading fields from result set; web

service batch size:

toString message: Writing data to HDFS; batch size:

If either of these log messages is missing, then that indicates that Sqoop did not call the
corresponding overridden method in the generated ORM class when performing the
import operation with the specified advanced Sqoop import option, preventing the
SecureData cryptographic integration from working properly.

If neither of the required methods was called, the Sqoop integration does not support
the advanced Sqoop import option in question.

If the readFields method was called but the toString method was not called, and if
your scenario can work with the Simple API only (that is, no functionality specific to the
REST API, such as SST, is required), the advanced Sqoop import option of interest
should be able to use the non-batched version of the Sqoop integration, as described in
“Non-Batched Version of the Sqoop Integration” (page 6-12).

6-15 CONFIDENTIAL
Running the Sqoop Template Developer Templates Integration Guide Version 5.0

CONFIDENTIAL 6-16
7 Spark Integration
The Spark Developer Template demonstrates how to integrate Voltage SecureData data
protection technology in the context of Apache Spark, using the Scala programming language.
The Spark Developer Template also demonstrates the use of PySpark, the Python API to
Apache Spark. This demonstration includes the use of the Simple API (version 4.0 and greater)
and the REST API to perform cryptographic operations.

The Spark Developer Template demonstrates the use of several different older and newer
Spark and PySpark APIs and their corresponding data structures, including the following
variants:

• RDD - Using the Hadoop Developer Template’s common infrastructure to perform

cryptographic operations using the Spark Resilient Distributed Dataset (RDD) API.

• Dataset - Using the Hadoop Developer Template’s common infrastructure to perform

cryptographic operations more efficiently using the Spark Dataset API.

• DataFrame - Using Spark User Defined Functions (UDFs) to perform cryptographic

operations using the Spark and PySpark DataFrame API.

• Spark SQL - Using Spark UDFs to perform cryptographic operations by making a Spark
and PySpark SQL query on a Spark DataFrame data set.

• HiveUDF - Using Hive UDFs to perform cryptographic operations by making a Spark

and PySpark SQL query on a Spark DataFrame data set.

The Spark Developer Template uses the Spark driver/processor model. The data to be
protected or accessed is originally in the form of one or more columns in an input CSV file, and
the protected or accessed result ends up in an output CSV file.

At a high-level, these Spark Developer Template drivers perform the following steps:

• Perform the Spark creation phase by reading the input CSV file into the appropriate
Spark data structure (RDD, Dataset, and/or DataFrame objects, and/or a temporary
SQL table or view).

• Call the processor, either directly or by using a UDF, to protect or access the relevant
columns in the Spark object for that variant.

• Write the cryptographic results to the output CSV file.

The Spark Developer Template drivers and processors, written in Scala, work in conjunction
with the Java packages in the common infrastructure. This chapter provides a description of
these Spark components as well as providing instructions on how to run the Spark Developer
Template using the provided sample data.

7-1 CONFIDENTIAL
Developer Templates Integration Guide Version 5.0

NOTE: Due to performance issues related to serialization and deserialization of potentially

large amounts of data, the RDD and Dataset variants of the Spark Developer Template are
not practical to implement using PySpark. The PySpark integration is limited to the three
DataFrame variants, using Python drivers to call the Scala processor, and thereby
minimizing the language wrappers through which the data needs to be processed.

For more information about the common infrastructure used by all of the Developer Templates,
including the Spark Developer Template, see Chapter 3, “Common Infrastructure”.

NOTE: The Spark Developer Template sample job uses one of the sample data files already
supplied with the existing Hadoop Developer Templates: plaintext.csv (located in the
directory <install_dir>/sampledata). The Spark job script
run-spark-prepare-job copies this file to a Spark-job-specific location in HDFS. For
more information about this sample data file, see "Sample Data for the Spark Developer
Template" (page 7-14).

Another important aspect of understanding the Spark Developer Template is to understand

the configuration settings it uses. The Spark Developer Template reads its global settings from
the two standard XML configuration files vsauth.xml and vsconfig.xml and also
demonstrates the use of a component-specific XML configuration file for the RDD and Dataset
variants of the Spark Developer Template: vsspark-rdd.xml.

The documentation related to shared configuration settings is provided in Chapter 3, “Common

The remainder of this chapter is organized as follows:

• Integration Architecture of the Spark Developer Template (page 7-3) - This section
explains the Scala classes that implement the driver and processor components of the
Spark Developer Template. It also describes the three Python modules that serve as
driver components for three PySpark variants of the Spark Developer Template.

• Configuration Settings for the Spark Developer Template (page 7-12) - This section
reviews the configuration settings that are relevant to the Spark Developer Template,
including the component-specific approach demonstrated by the XML configuration file
vsspark-rdd.xml.

• Sample Data for the Spark Developer Template (page 7-14)- This section provides a
description of the sample data provided for the Spark Developer Template.

CONFIDENTIAL 7-2
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Spark Developer Template

• Running the Spark Developer Template (page 7-14)- This section provides instructions
for running the Scala-based variants of the Spark Developer Template, as provided, as
well as how to make the source code changes necessary to use the simple processor
instead of the default batch processor. It also includes instructions for running the three
PySpark variants of the Spark Developer Template.

• Limitations of the Spark Developer Template (page 7-26)- This section provides
information about the type of improvements you will need to make in order to create a
production-grade Spark solution that integrates calls to the Voltage SecureData APIs.

Integration Architecture of the Spark Developer Template

The Spark Developer Template uses the Spark driver/processor model to protect incoming
plaintext or to access incoming ciphertext, respectively, from an input CSV file, using Format-
Preserving Encryption (FPE) or Secure Stateless Tokenization™ (SST). The driver and
processor classes are written in the Scala programming language. Python drivers are also
included for the three PySpark variants of the Spark Developer Template.

The Spark Developer Template uses the same two XML configuration files that are used by the
other Hadoop Developer Templates: vsauth.xml and vsconfig.xml. Depending on the
variant, it may also use another Spark-specific XML configuration file named vsspark-
rdd.xml to contain the Spark-specific field configuration settings that associate a CSV column
number with a cryptId specified in the XML configuration file vsconfig.xml. These settings
could also be specified in that latter file (vsconfig.xml) in a fields element with its
component attribute set to spark.

NOTE: The Spark Developer Template demonstrates a feature of the common configuration
classes that allow the separation of configuration values that are specific to a particular
Hadoop technology into a separate configuration file for that technology. The other Hadoop
Developer Templates may take advantage of this possible isolation of configuration value
processing in a future release.

The Spark Developer Template makes use of the common infrastructure provided with Voltage
SecureData for Hadoop to retrieve global file-based configuration information, provide data
translation, and a cryptographic abstraction layer, as well as a REST client.

The Spark Developer Template provides three Scala packages for the five variants of the Spark
Developer Template:

• RDD Variant:

Source code package: com.voltage.securedata.spark.rdd

Source code location: <install-dir>/spark/src/main/scala/rdd

For information about this variant, described together with the Dataset variant, see
"RDD and Dataset Variants" (page 7-4).

7-3 CONFIDENTIAL
Integration Architecture of the Spark Developer Template Developer Templates Integration Guide Version 5.0

• Dataset Variant:

Source code package: com.voltage.securedata.spark.dataset

Source code location: <install-dir>/spark/src/main/scala/dataset

For information about this variant, described together with the RDD variant, see "RDD
and Dataset Variants" (page 7-4).

• DataFrame, Spark SQL, and Hive UDF Variants:

Source code package: com.voltage.securedata.spark.udf

Source code location: <install-dir>/spark/src/main/scala/udf

For information about these variants, see "UDF-Based Variants" (page 7-8).

The Spark Developer Template also provides a PySpark (Python) package for interfacing with
the Scala UDF variants DataFrame and Spark SQL (provided in the package
com.voltage.securedata.spark.udf) and the Java Hive UDF (provided in the package
com.voltage.securedata.hadoop.hive).

Source code location: <install-dir>/spark/src/main/python/udf

For information about these variants, see "UDF-Based Variants" (page 7-8).

The following two sub-sections described the Spark Developer Template variants based on the
their shared functionality:

• RDD and Dataset Variants (page 7-4)

• UDF-Based Variants (page 7-8)

RDD and Dataset Variants

This section discusses the driver and processor functionality for the RDD and Dataset variants
of the Spark Developer Template, which share enough functionality to warrant common
description.

RDD and Dataset Driver Functionality

As implied by their names, the drivers for the RDD and Dataset variants of the Spark Developer
Template differ with respect to the type of Spark object(s) into which the contents of the input
CSV file are read and from which the output CSV file are written.

The RDD variant defines the Spark driver object SDSparkDriver. This object defines the
function main, which serves as the Spark job entry point. This object:

1. Creates the Spark context object.

2. Reads the data in the input CSV file into a Spark RDD.

CONFIDENTIAL 7-4
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Spark Developer Template

3. Calls to the Spark processor’s function processRDD (defined for both the batch
processing class SDSparkBatchProcessor and the simplified single-crypto-operation
processing class SDSparkProcessor). This function returns another, transformed
RDD with the cryptographic results.

4. Writes the transformed RDD to the output CSV file.

The Dataset variant defines the Spark case class Person and the Spark driver object
SDSparkDatasetDriver. This driver object defines the function main, which serves as the
Spark job entry point. This object:

1. Creates the SparkSession object.

2. Defines an explicit schema (that matches the schema defined for the case class
Person) so that the data in the input CSV file can be read into a Spark DataFrame
object and then converted to a Spark Dataset object.

3. Calls to the Spark processor’s function processDS (defined for both the batch
processing class SDSparkDatasetBatchProcessor and the simplified single-crypto-
operation processing class SDSparkDatasetProcessor). This function returns
another, transformed Dataset with the cryptographic results.

4. Writes the transformed Dataset to the output CSV file.

RDD and Dataset Processor Functionality

The processors for the RDD and Dataset variants of the Spark Developer Template perform
the Spark transformation and action phases by, respectively, protecting or accessing the
specified sensitive data by index value and returning it in a transformed RDD or Dataset to the
Spark driver so that it can write the results to the output CSV file. There are two different
versions of each of these processors:

• The Batch Processors - These processors batch multiple cryptographic operations into
single direct calls to the shared Java Crypto interface methods
protectFormattedDataList and accessFormattedDataList. The real
advantage of these processors exist when the REST API is used for protecting or
accessing sensitive data, improving performance by avoiding network transactions for
individual cryptographic operations.

The Spark batch processor classes for these two variants are:

• RDD variant: SDSparkBatchProcessor

• Dataset variant: SDSparkDatasetBatchProcessor

These Spark processor classes and their inner class SparkCrypto perform batched
cryptographic processing for their Spark jobs, potentially distributing the job across the
nodes of your Hadoop cluster. These classes gather the data to be protected (as
opposed to the data to be passed through, as is) into batches of a configurable size

7-5 CONFIDENTIAL
Integration Architecture of the Spark Developer Template Developer Templates Integration Guide Version 5.0

(2000 by default), protecting or accessing all of the strings in each of the configured
columns as a single cryptographic operation before moving on to the next batch of
strings. Batch processing is especially important for efficiency when using either of the
Web Services API (the REST API), but is inherently more complicated. Successful
cryptographic processing of the input RDD or Dataset results in a different, transformed
RDD or Dataset being returned to the Spark driver.

NOTE: The Spark Developer Template source code, as shipped, calls these Spark
processors. To call the simple Spark processors, SDSparkProcessor or
SDSparkDatasetProcessor, you will need to modify the corresponding driver
source code, commenting out the code that calls the batch Spark processor and
uncommenting the code that calls the simple Spark processor.

See the comments in the source code in the files SDSparkBatchProcessor.scala

and SDSparkDatasetBatchProcessor.scala for more detailed information about
how the batch processing is performed.

• The Simple Processors - These processors perform cryptographic operations one-by-

one by directly calling the shared Java Crypto interface methods
protectFormattedData and accessFormattedData for each item of sensitive data
to be protected or accessed. This processor may be more simple to understand when
first experimenting with these variants of the Spark Developer Template, but for
performance reasons will probably not be appropriate as a model for production
cryptographic tasks.

The Spark simple processor classes for these two variants are:

• RDD variant: SDSparkProcessor

• Dataset variant: SDSparkDatasetProcessor

These Spark processor classes and their inner class SparkCrypto perform single-
crypto-op processing for the Spark job, potentially distributing the job across the nodes
of your Hadoop cluster. These classes iterate through the input RDD or Dataset,
cryptographically processing the relevant strings as it goes, one at a time, and adding
the result to the output RDD or Dataset. Input strings not configured for cryptographic
processing are echoed, as is, to the output RDD or Dataset. It is important to note that
no batch processing is performed by these Spark processors, with each individual string
to be protected or accessed being processed by itself, even when using one or both of
the Web Services API (the REST API). This can be very inefficient and is not appropriate
for production environments, but the code is much more simple and easy to follow.

CONFIDENTIAL 7-6
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Spark Developer Template

NOTE: To use these processors, minor changes to the corresponding driver source
code will be required to adjust which processor is called in each case:

In the RDD driver file SDSparkDriver.scala, change the lines:

//val processor = new SDSparkProcessor()

val processor = new SDSparkBatchProcessor()

To:

val processor = new SDSparkProcessor()

//val processor = new SDSparkBatchProcessor()

And in the Dataset driver file SDSparkDatasetDriver.scala, change the lines:

val processor = new SDSparkDatasetBatchProcessor()

//val processor = new SDSparkDatasetProcessor()

To:

//val processor = new SDSparkDatasetBatchProcessor()

val processor = new SDSparkDatasetProcessor()

The index-based approach to protecting and accessing columns in these variants of the Spark
Developer Template is based on the creation of an index-based crypto map using the XML
configuration file vsspark-rdd.xml, which associates a column index with a cryptId, the latter
of which provides the Voltage SecureData protection format, protection API, authentication
information, and so on.

There is an important difference between the RDD and Dataset variants of the Spark
Developer Template, relevant to both the batch and the simple processors and their use is an
index-based approach to specifying which columns to protect and access. It is that the Dataset
variant must take extra steps to use the current index and reflection to retrieve column values
as strings to pass to the performCrypto method of the relevant SparkCrypto object and to
convert a processed row back to a case class person record in the resulting Dataset object:

• Retrieve column value for cryptographic operation:

val fieldValue: String = dsRow.getClass
.getDeclaredMethod(colNames(colIndex))
.invoke(dsRow).toString

• Convert processed row back to case class person record in resulting Dataset object:
convertedRows(i) = convertToPerson(savedRows(i))

NOTE: These code examples are from the Dataset variant batch processor in the file
SDSparkDatasetBatchProcessor.scala.

7-7 CONFIDENTIAL
Integration Architecture of the Spark Developer Template Developer Templates Integration Guide Version 5.0

Refer to the source files SDSparkDatasetProcessor.scala and

SDSparkDatasetBatchProcessor.scala for more detailed information about how this
index-based reflection is performed.

UDF-Based Variants
This section discusses the driver and processor functionality for the DataFrame, Spark SQL,
and HiveUDF variants of the Spark Developer Template, which share enough functionality to
warrant common description.

The Spark Developer Template also provides PySpark versions of all three UDF variants. The
DataFrame and Spark SQL versions of these variants work by having a PySpark driver
module call the shared Scala processor class through a UDF. The HiveUDF version of these
variants works by having a PySpark driver module call the Java Hive classes directly.

DataFrame, Spark SQL, and HiveUDF Driver Functionality

As implied by their names, the drivers for the DataFrame, Spark SQL, and HiveUDF variants of
the Spark Developer Template, including the PySpark versions, all protect or access the
specified sensitive data one value at a time through a UDF. Likewise, they all begin by defining
an explicit schema that allows the input CSV file to be read into a Spark DataFrame object, all
within the context of a SparkSession object (for the HiveUDF variant, the SparkSession
object is created with Hive support enabled). After that, their processing techniques diverge, as
follows:

• The Scala version of the DataFrame variant of the vsconfig.xml calls the
protectColumn and accessColumn functions of the SDSparkUDFProcessor Scala
class by first calling UDF functions of the same name (protectColumn and
accessColumn) in the SDSparkDataFrameDriver Scala object. It then creates the
output DataFrame object by calling the protect or access UDF using the withColumn
function of the DataFrame object for each column to be protected or accessed,
respectively.

NOTE: This functionality and the associated parameter currying is done in

protectColumn and accessColumn UDF functions of the
SDSparkDataFrameDriver class, which enables this code to be shared with the
PySpark version, described below.

The PySpark version of the DataFrame variant has Python functions that are also
named protectColumn and accessColumn. These Python functions use PySpark to
call down into the Scala code in the SDSparkDataFrameDriver and
SDSparkUDFProcessor classes, as follows:

Level 1:

Python code in file sd_pyspark_dataframe_driver.py defines the

functions protectColumn and accessColumn with the parameters cryptId
and col.

CONFIDENTIAL 7-8
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Spark Developer Template

These Python functions use the PySpark library functions _to_seq and
_to_java_column when calling the Scala functions protectColumn and
accessColumn in the SDSparkDataFrameDriver Scala object (see level 2
below). This is required to convert the input to a Java sequence as required by
level 2 processing. Upon return, the PySpark library function Column is used to
cast the results back to a form Python can interpret.

It then creates the output DataFrame object by calling these functions within
the withColumn function of the DataFrame object for each column to be
protected or accessed, respectively.

Level 2:

The Scala code in file SDSparkDataFrameDriver.scala defines the

functions protectColumn and accessColumn with the parameter
cryptId.

These Scala functions create and return UDFs that call the functions of the
same names in the SDSparkUDFProcessor Scala class to perform the
cryptographic processing (see level 3 below). They also need to use currying
to combine the cryptId and col parameters at the Python level into the
single parameter cryptId at this level.

NOTE: For the Scala version of the DataFrame variant, function main in
the SDSparkDataFrameDriver object calls these functions within the
withColumn function of the DataFrame object for each column to be
protected or accessed, respectively.

Level 3:

The Scala code in file SDSparkUDFProcessor.scala defines the

functions protectColumn and accessColumn with the parameters
colValue and cryptId.

The Scala functions at this level use the SparkCrypto class to access
the shared CryptoFactory Java code to perform the actual
cryptographic operations.

Both of these variants uses the where function of the DataFrame object to limit the
processing to the first 10 rows of data in the input DataFrame object, producing only
10 rows of data in the output DataFrame object, resulting in only 10 lines being written
to the output CSV file at the end of driver processing.

• Both versions of the Spark SQL variant create a temporary database view of the data in
the DataFrame object in preparation for the upcoming SQL query, one in the Scala
driver and one in the PySpark driver. They both register the protectColumn and
accessColumn functions of the SDSparkUDFProcessor class as UDFs of the same
name, but somewhat differently from the two different Spark drivers:

7-9 CONFIDENTIAL
Integration Architecture of the Spark Developer Template Developer Templates Integration Guide Version 5.0

• The Scala version of the Spark SQL variant registers the UDFs in the sibling
functions registerProtectUDF and registerAccessUDF, both of which are
calling by function main.

• The PySpark version of the Spark SQL variant calls these Scala UDF registration
functions registerProtectUDF and registerAccessUDF in the
SDSparkSQLDriver object by using a JVM version of the SQL Context.

Then, using the sql function of the SQLContext object contained within the
SparkSession object, they both run a SQL query that calls the protect or access UDF
for each column to be protected or accessed, respectively. And as with all of the other
UDF variants, the query is limited to the first 10 rows of data in the temporary database
view, producing only 10 rows of data in the output DataFrame object, resulting in only
10 lines being written to the output CSV file at the end of driver processing.

NOTE: Due to how the SQL SELECT statements are used for this variant, only the
protected or accessed data appears in the output DataFrame object and output CSV
file. As shipped, these fields are the final four fields in each row of data: the email
address, the birth date, the credit card number, and the Social Security number.

• Both Scala version and the PySpark version of the HiveUDF variant create a temporary
database view of the data in the DataFrame object in preparation for the upcoming
SQL query. Unlike the other two types of UDF variants, they both create temporary
protect and access UDFs from the Hive classes ProtectData and AccessData, also
included in the Hadoop Developer Templates.

Then, using the sql function of the SQLContext object contained within the
SparkSession object, they both run a SQL query that calls the protect or access Hive
UDF for each column to be protected or accessed, respectively. And as with all of the
other UDF variants, the query is limited to the first 10 rows of data in the temporary
database table, producing only 10 rows of data in the output DataFrame object,
resulting in only 10 lines being written to the output CSV file at the end of driver
processing.

NOTE: As with both versions of the Spark SQL variant, due to how the SQL SELECT
statements are used for this variant, only the protected or accessed data appears in
the output DataFrame object and output CSV file. As shipped, these fields are the
final four fields in each row of data: the email address, the birth date, the credit card
number, and the Social Security number.

NOTE: As with the Hive Developer Template, the UDF variants of the Spark Developer
Template are inherently limited to protecting and accessing sensitive data one value at a
time (in other words, no batch processing is possible). Especially when subject to the
network overhead of one or both of the Voltage SecureData Web Services API (the REST
API) used for tokenization, processing large sets of data using UDFs can be prohibitively
slow. That is why the UDF variants limit their processing to the first 10 rows of the 10K rows
in the input CSV file, as described above.

CONFIDENTIAL 7-10
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Spark Developer Template

For both versions of all three UDF variants, the parameters to the UDFs are as follows:

• The name of the column in the DataFrame object, as established when its schema was
defined, which is also mapped to the temporary database view or table for the
SparkSQL and HiveSQL variants, respectively. This will provide the UDF with the
plaintext or ciphertext value to be protected or accessed, respectively.

• The name of a cryptId specified in the XML configuration file vsconfig.xml, which in
turn specifies the information required to protect or access the values in the specified
column. This information include a data protection format, the Voltage SecureDataAPI,
the identity and authentication information for cryptographic key derivation, and so on.
This cryptId approach for protecting and accessing columns in these variants of the
Spark Developer Template is based on one loaded-on-demand entry for each unique
combination of cryptographic processing choices (format, identity, choice of API,
translator class, if any, and so on), but not necessarily for each column of data to be
processed (two columns can use the same cryptId and thus same crypto map entry).

NOTE: Note that for the DataFrame variants, at the level of the
SDSparkDataFrameDriver, these two parameters are effectively cosed into one
parameter through the use of currying. This was necessary in order to have the cryptId
be treated as a string instead of as the name of a second database column.

The return value of each UDF call is the protected or accessed value, collected into the
resulting DataFrame object. This is either with the unprocessed column values in the case of
the DataFrame variant, or without the unprocessed column values in the case of the Spark
SQL and HiveUDF variants. All of the UDF drivers end by writing the contents of the
DataFrame object to the output CSV file.

DataFrame, Spark SQL, and HiveUDF Processor Functionality

A single processor class, SDSparkUDFProcessor, is shared by both versions (Spark and
PySpark) of the DataFrame and Spark SQL variants of the Spark Developer Template. In
addition to defining the protect and access UDF functions themselves (protectColumn and
accessColumn, respectively), it also defines an inner class, SparkCrypto, the instantiation of
which is postponed (by the use of the lazy attribute) until actually needed for cryptographic
processing for each partition by a worker node.

The HiveUDF variant of the Spark Developer Template does not use the
SDSparkUDFProcessor class, instead relying directly on the protect and access UDFs
provided by the Hive Developer Template, which serve as virtual processors for this variant of
the Spark Developer Template:
com.voltage.securedata.hadoop.hive.ProtectData
com.voltage.securedata.hadoop.hive.AccessData

NOTE: The wrapper object used to serialize the Hive UDFs also uses lazy initialization,
allowing the Hive UDFs to work in the context of a Spark session.

7-11 CONFIDENTIAL
Configuration Settings for the Spark Developer Template Developer Templates Integration Guide Version 5.0

Logging and Error Handling in the Spark Developer Template

The Spark Developer Template logs informational and error messages using the Apache
Commons Logging library for Java.

For the RDD and Dataset variants of the Spark Developer Template, note that the batch
processors log more information than the single-crypto-operation processors. The former
processor logs information about each batch as it is processed, while the latter processor just
logs information about which type of cryptographic operation (protect or access) is being
performed for the entire job (log entries for each individual cryptographic operations would
produce too much log output for most purposes).

For all versions of the UDF variants of the Spark Developer Template, both within the class
SDSparkUDFProcessor and within the Hive UDFs, logging is initialized but not included by
default due to the volume of logging that would be done for single-value protection and access
operations. You can add temporary logging calls as needed for your own debugging purposes.

Configuration Settings for the Spark Developer Template

The Spark Developer Template uses a total of three configuration files. Two of these
configuration files are the same as the ones used by the other Hadoop Developer Templates.
The third configuration file is specific to the RDD and Dataset variants of the Spark Developer
Template and contains Spark-specific configuration settings about the fields to be protected or
accessed. These settings are similar to the component-specific settings for the MapReduce
Developer Template in the configuration file vsconfig.xml. These variants of the Spark
Developer Template take advantage of common configuration processing functionality that
allows their Spark-specific settings to be isolated into a separate configuration file for index-
based column protection settings.

The configuration files for the Spark Developer Template are:

• vsconfig.xml - The general SecureData settings defined in this XML configuration

file mainly define the Voltage SecureData Server(s) with which the template will
interact.The Spark Developer Template does not require any custom configuration
settings beyond what can be processed using the common configuration infrastructure
used by the Hadoop Developer Templates.

The Spark Developer Template uses the version of the XML configuration file
vsconfig.xml used by the other Hadoop Developer Templates, located in the
following directory:
<install_dir>/config

NOTE: This configuration file contains field configuration settings for the MapReduce
and Sqoop templates, but the parallel settings for the RDD and Dataset variants of
the Spark Developer Template are defined in the XML configuration file vsspark-
rdd.xml, described below.

CONFIDENTIAL 7-12
Developer Templates Integration Guide Version 5.0 Configuration Settings for the Spark Developer Template

For more information about these settings, see "Configuration Settings" (page 3-5),
"vsconfig.xml" (page 3-33), "Common Configuration" (page 3-57) and "Hadoop
Configuration" (page 3-60).

• vsauth.xml - The settings in this XMLconfiguration file provide the authentication

information required for the Spark Developer Template to authenticate with the Voltage
SecureData Key Server.

The Spark Developer Template uses the version of the configuration file vsauth.xml
used by the other Hadoop Developer Templates, located in the following directory:
<install_dir>/config

For more information about these settings, see "Configuration Settings" (page 3-5),
"vsauth.xml" (page 3-38), "Common Configuration" (page 3-57), and the comments in
the XML configuration file itself.

• vsspark-rdd.xml - The settings in this XML configuration file provide the index-
based field settings that control which fields (columns) in the Spark job’s input CSV file
(and RDD or Dataset) as subject to cryptographic processing. For example:

The RDD and Dataset variants of the Spark Developer Template use this type of XML
configuration file. These are the same type of index-based field configuration settings
that are used by the MapReduce Developer Template in the XML configuration file
vsconfig.xml.

The Spark Developer Template XML configuration file vsspark-rdd.xml is located in

the following directory:
<install_dir>/spark/config

Before you begin to modify the Spark Developer Template XML configuration files for your own
purposes, such as using your own Voltage SecureData Server or different data formats, Micro
Focus Data Security recommends that you first run the Spark Developer Template samples as
provided, giving you assurance that your Hadoop cluster is configured correctly and
functioning as expected.

You will need to update many of the values in the XML configuration files vsauth.xml,
vsconfig.xml, and vsspark-rdd.xml in order to protect your own data using your own
Voltage SecureData Server.

7-13 CONFIDENTIAL
Sample Data for the Spark Developer Template Developer Templates Integration Guide Version 5.0

Alternative Approaches to Configuration

For the sake of simplicity, the Spark Developer Template demonstrates an approach to
configuration/authentication/authorization settings using a set of XML configuration files,
protected by file permissions, copied from the local file system to a directory on HDFS
(voltage/config within the user’s home directory) by the script
run-spark-prepare-job. This approach demonstrates a basic integration, placing emphasis
on the Voltage SecureData protection API calls, and does not focus on providing a
sophisticated mechanism to load these configuration settings.

CAUTION: This sample approach may not be appropriate in your production Spark
integrations.

The same alternative approaches to configuration that are suggested for the MapReduce
Developer Template can be considered for your production Spark integrations. For more
information, see "Other Approaches to Providing Configuration Settings" (page 3-52).

Sample Data for the Spark Developer Template

The Spark Developer Template sample jobs use the same sample data as the MapReduce
Developer Template, located in the following file:
<install-dir>/sampledata/plaintext.csv

This file contains 10,000 rows of dummy sample data consisting of names, addresses, email
addresses, birth dates, credit card numbers, and Social Security numbers. The XML
configuration file vsspark-rdd.xml, by default, specifies the protection of the final four fields
of each line using cryptIds named alpha, date, cc, and ssn, respectively.

The script run-spark-prepare-job copies the sample input CSV file from the local file
system to a Spark sample data directory in HDFS (voltage/spark-plaintext-sample-
data) within the user’s home directory, which is where the Spark protect scripts, such as run-
spark-protect-rdd-job, by default, expect to find it.

Running the Spark Developer Template

The Spark Developer Template include scripts in the directory <install_dir>/spark/bin

that you use to run its sample jobs. These scripts provide variables for adjusting the input and
output locations and filenames on HDFS and the command to submit the protect or access
Spark jobs.

CONFIDENTIAL 7-14
Developer Templates Integration Guide Version 5.0 Running the Spark Developer Template

Run-Time Prerequisites
To run the Spark sample job and to write the results to HDFS, the following services need to be
configured and running on the Hadoop cluster:

• Spark2 (installed on all nodes)

• YARN or Mesos (if you are using the YARN resource manager or Mesos cluster)

Also, make sure that the dependency versions specified in the file build.sbt are correct for
the version of Spark you are using.

Changing the Input and Output Locations and Filenames on HDFS

You can edit variables in the following scripts (one preparation script, eight protect scripts, and
eight access scripts) in the spark/bin directory to change the names and locations of the
input and output CSV files:

Preparation script, required for all variants:

• run-spark-prepare-job (required for all variants)

Scala protect job scripts for the different variants:

• run-spark-protect-rdd-job (RDD variant protect job)
• run-spark-protect-dataset-job (Dataset variant protect job)
• run-spark-protect-dataframe-job (DataFrame variant protect job)
• run-spark-protect-sql-job (Spark SQL variant protect job)
• run-spark-protect-hive-job (HiveUDF variant protect job)

PySpark protect job scripts for the different variants:

• run-pyspark-protect-dataframe-job (DataFrame variant protect job)
• run-pyspark-protect-sql-job (Spark SQL variant protect job)
• run-pyspark-protect-hive-job (HiveUDF variant protect job)

Scala access job scripts for the different variants:

• run-spark-access-rdd-job (RDD variant access job)
• run-spark-access-dataset-job (Dataset variant access job)
• run-spark-access-dataframe-job (DataFrame variant access job)
• run-spark-access-sql-job (Spark SQL variant access job)
• run-spark-access-hive-job (HiveUDF variant access job)

PySpark access job scripts for the different variants:

• run-pyspark-access-dataframe-job (DataFrame variant access job)
• run-pyspark-access-sql-job (Spark SQL variant access job)

7-15 CONFIDENTIAL
Running the Spark Developer Template Developer Templates Integration Guide Version 5.0

• run-pyspark-access-hive-job (HiveUDF variant access job)

If you make a change to the value of most of the following variables in one script, you will need
to make a corresponding change in the relevant upstream or downstream script. The variables
are:

• plaintextDir - In the prepare job, the HDFS directory (relative to the user’s home
directory) into which the original plaintext CSV file is copied.

• inputDir - In the protect and access jobs, the HDFS directory (relative to the user’s
home directory) from which the source original plaintext CSV file and protected CSV file,
respectively, are processed.

• outputDir - In the protect and access jobs, the HDFS directory (relative to the user’s
home directory) into which the resulting protected CSV file and accessed CSV file,
respectively, are placed. The output files (part-*) from the different Spark partitions
are also placed in this directory.

• plaintextFilename - In the prepare and protect jobs, the name of the original
plaintext CSV file that is the HDFS copy destination of the former job and that provides
the input to the latter job.

• protectedFilename - In the protect and access jobs, the name of the protected CSV
file in HDFS that receives the output of the former job and that provides the input to the
latter job, respectively. Also the name of the protected CSV file that will be saved locally
in the following local directory:
<install-dir>/spark/sampledata

• accessedFilename - In the access job, the name of the accessed CSV file in HDFS
that receives the output of that job. Also the name of the accessed CSV file that will be
saved locally in the following local directory:
<install-dir>/spark/sampledata

Steps to Run the Spark Developer Template

In order to run the Spark protect and access sample jobs, you must first perform the following
steps:

1. Build the target Spark JAR file (see “Building the Spark Developer Template” on page 2-
17).

2. Optionally edit the input and output path and filename variables in the prepare, protect,
or access scripts. For more information, see "Changing the Input and Output Locations
and Filenames on HDFS" (page 7-15).

CONFIDENTIAL 7-16
Developer Templates Integration Guide Version 5.0 Running the Spark Developer Template

3. Run the preparation script run-spark-prepare-job:

a. Change directory (cd) to the directory bin directory:

cd <install_dir>/spark/bin

b. Run the Spark protection sample job by running the following script:
./run-spark-prepare-job

While, in general, you can run the protect job for any of the variants of the Spark Developer
Template, including the PySpark versions, independent of any other variant, you do need to run
a particular protect job before you run an access job because the output of the former (protect)
job serves as the input to the latter (access) job. However, as shipped, these pairings of protect
and access jobs may not be intuitive. The pairings are as follows:
Table 7-1 Protect Script Prerequisites for Each Access Script

Before you run this access script: You must run this protect script:
run-spark-access-rdd-job run-spark-protect-rdd-job

run-spark-access-dataset-job run-spark-protect-dataset-job

run-spark-access-dataframe-job run-spark-protect-rdd-job

run-spark-access-sql-job run-spark-protect-rdd-job

run-spark-access-hive-job run-spark-protect-rdd-job

run-pyspark-access-dataframe-job run-spark-protect-rdd-job

run-pyspark-access-sql-job run-spark-protect-rdd-job

run-pyspark-access-hive-job run-spark-protect-rdd-job

The steps to run one or more of the job scripts for the variants of the Spark Developer
Template, including the PySpark versions, are as follows:

1. Change the file permissions of the XML configuration file vsauth.xml so that only the
current user can read it (this is because this XML configuration file contains the
sensitive authentication credentials that are important to conceal):
chmod 0600 <install_dir>/config/vsauth.xml

2. Change directory (cd) to the directory bin directory:

cd <install_dir>/spark/bin.

3. Run one or more of the variants of the Spark protect sample jobs, including the PySpark
protect sample jobs, as follows:
./run-spark-protect-rdd-job

./run-spark-protect-dataset-job

./run-spark-protect-dataframe-job

7-17 CONFIDENTIAL
Running the Spark Developer Template Developer Templates Integration Guide Version 5.0

./run-spark-protect-sql-job

./run-spark-protect-hive-job

./run-pyspark-protect-dataframe-job

./run-pyspark-protect-sql-job

./run-pyspark-protect-hive-job

NOTE: Note that the UDF variants (DataFrame, Spark SQL, and HiveUDF), as
shipped, only produce a small set of protection results, and that the Spark SQL and
HiveUDF results only include the data from the protected columns (unprocessed
columns do not appear in the output). These characteristics apply to the PySpark
versions as well.

4. Run one or more of the variants of the Spark access sample jobs, as follows, paying
attention to each access job’s protect job prerequisite as shown in Table 7-1 on page 7-
17:
./run-spark-access-rdd-job

./run-spark-access-dataset-job

./run-spark-access-dataframe-job

./run-spark-access-sql-job

./run-spark-access-hive-job

./run-pyspark-access-dataframe-job

./run-pyspark-access-sql-job

./run-pyspark-access-hive-job

NOTE: Note that the UDF variants (DataFrame, Spark SQL, and HiveUDF), as
shipped, only produce a small set of access results, and that the Spark SQL and
HiveUDF results only include the data from the accessed columns (unprocessed
columns do not appear in the output). These characteristics apply to the PySpark
versions as well.

Hadoop Distribution Dependencies for Running the Sample Jobs

How the Spark sample jobs are run by default depends on the Hadoop distribution you are
using. For example, by default, HDP (Hortonworks) runs the Spark jobs locally while Cloudera
runs Spark jobs using YARN with a deploy mode of client. You can edit the spark-submit
command line in the various protect and access job scripts to change the default behavior of
Spark in your Hadoop distribution.

CONFIDENTIAL 7-18
Developer Templates Integration Guide Version 5.0 Running the Spark Developer Template

For example, to run the Spark job using YARN with a deploy mode of cluster when using HDP,
the following additional arguments (highlighted) must be added to the spark-submit
command:
spark-submit --master yarn --deploy-mode cluster --jars ${...

Depending on your Hadoop distribution, you may need to use some combination of the
--master and --deploy-mode arguments to run the Spark sample jobs in a non-default
mode.

For more information for your distribution, use the --help argument to the spark-submit
command.

Spark Script Summary

Examine the scripts in the bin directory to see how they call the Spark commands in the
context of the Spark Developer Template. This section summarizes these scripts.

NOTE: All of these scripts expect the current directory in the local file system to be set to
<install_dir>/spark/bin when they are run.

run-spark-prepare-job
This script:

• Copies the original plaintext CSV file from the local file system to the expected location
in HDFS.

• Copies the three configuration files used by the Spark Developer Template from the
local file system to their default location in HDFS.

• Copies all of the necessary Hadoop and common infrastructure JAR files to the directory
<install_dir>/spark/lib on the local file system.

Invocation:
./run-spark-prepare-job

This script must be run from the directory <install_dir>/spark/bin and it has no
parameters.

7-19 CONFIDENTIAL
Running the Spark Developer Template Developer Templates Integration Guide Version 5.0

run-spark-protect-rdd-job
run-spark-protect-dataset-job
run-spark-protect-dataframe-job
run-spark-protect-sql-job
run-spark-protect-hive-job
These scripts run the Spark protect sample jobs for the Scala versions of each of the five
variants of the Spark Developer Template. They protect the sample data in the shared sample
data file plaintext.csv and consolidate the output from the various Spark partitions running
the protect job into a single output file in the following directories on both the local file system
and on HDFS:
Local File System: <install_dir>/spark/sampledata

HDFS: <users_home_dir>/voltage/spark-protected-sample-data/<variant>

Where <variant> is either rdd, dataset, dataframe, sql, and/or hive.

NOTE: This directory will contain each partition’s results in files with names
beginning with part-, as well as the consolidated results in a file named as
described below.

As shipped, each of the variants produces an output file of protected sample data using a
unique filename for its variant in the directories described above:

• RDD: protectedRDD.csv

• Dataset: protectedDataset.csv

• DataFrame: protectedDataFrame.csv

• Spark SQL: protectedSQLData.csv

• HiveUDF: protectedHiveData.csv

Invocations:
./run-spark-protect-rdd-job

./run-spark-protect-dataset-job

./run-spark-protect-dataframe-job

./run-spark-protect-sql-job

./run-spark-protect-hive-job

These scripts must be run from the directory <install_dir>/spark/bin and they have no
parameters.

CONFIDENTIAL 7-20
Developer Templates Integration Guide Version 5.0 Running the Spark Developer Template

Note that each Spark protect script includes code that detects the Hadoop distribution in use
and then attempts to ensure that the correct version of Spark (Spark2) is being invoked for that
distribution. This can be an issue when multiple versions of Spark are installed at the same time.

run-spark-access-rdd-job
run-spark-access-dataset-job
run-spark-access-dataframe-job
run-spark-access-sql-job
run-spark-access-hive-job
These scripts run the Spark access sample jobs for the Scala versions of each of the five
variants of the Spark Developer Template. They access the sample data created by a particular
prerequisite Spark protect sample job, as described in Table 7-1 on page 7-17, and consolidate
the output from the various Spark partitions running the access job into a single output file in
the following directories on both the local file system and on HDFS:
Local File System: <install_dir>/spark/sampledata

HDFS: <users_home_dir>/voltage/spark-accessed-sample-data/<variant>

Where <variant> is either rdd, dataset, dataframe, sql, and/or hive.

NOTE: This directory will contain each partition’s results in files with names
beginning with part-, as well as the consolidated results in a file named as
described below.

As shipped, each of the variants produces an output file of accessed sample data using a
unique filename for its variant in the directories described above:

• RDD: accessedRDD.csv

• Dataset: accessedDataset.csv

• DataFrame: accessedDataFrame.csv

• Spark SQL: accessedSQLData.csv

• HiveUDF: accessedHiveData.csv

Invocations (after running the appropriate protect sample job for each access sample job, as
described in Table 7-1 on page 7-17):
./run-spark-access-rdd-job

./run-spark-access-dataset-job

./run-spark-access-dataframe-job

./run-spark-access-sql-job

7-21 CONFIDENTIAL
Running the Spark Developer Template Developer Templates Integration Guide Version 5.0

./run-spark-access-hive-job

These scripts must be run from the directory <install_dir>/spark/bin and they have no
parameters.

Note that each Spark access script includes code that detects the Hadoop distribution in use
and then attempts to ensure that the correct version of Spark (Spark2) is being invoked for that
distribution. This can be an issue when multiple versions of Spark are installed at the same time.

run-pyspark-protect-dataframe-job
run-pyspark-protect-sql-job
run-pyspark-protect-hive-job
These scripts run the Spark protect sample jobs for the Python versions of each of the three
UDF variants of the Spark Developer Template. They protect the sample data in the shared
sample data file plaintext.csv and consolidate the output from the various Spark partitions
running the protect job into a single output file in the following directories on both the local file
system and on HDFS:
Local File System: <install_dir>/spark/sampledata

HDFS: <users_home_dir>/voltage/pyspark-protected-sample-data/<variant>

Where <variant> is either dataframe, sql, and/or hive.

NOTE: This directory will contain each partition’s results in files with names
beginning with part-, as well as the consolidated results in a file named as
described below.

As shipped, each of the variants produces an output file of protected sample data using a
unique filename for its variant in the directories described above:

• DataFrame: protectedPySparkDataFrame.csv

• Spark SQL: protectedPySparkSQLData.csv

• HiveUDF: protectedPySparkHiveData.csv

Invocations:
./run-pyspark-protect-dataframe-job

./run-pyspark-protect-sql-job

./run-pyspark-protect-hive-job

These scripts must be run from the directory <install_dir>/spark/bin and they have no
parameters.

CONFIDENTIAL 7-22
Developer Templates Integration Guide Version 5.0 Running the Spark Developer Template

run-pyspark-access-dataframe-job
run-pyspark-access-sql-job
run-pyspark-access-hive-job
These scripts run the Spark access sample jobs for the Python versions of each of the three
UDF variants of the Spark Developer Template. They access the sample data created by a
particular prerequisite Spark protect sample job, as described in Table 7-1 on page 7-17, and
consolidate the output from the various Spark partitions running the access job into a single
output file in the following directories on both the local file system and on HDFS:
Local File System: <install_dir>/spark/sampledata

HDFS: <users_home_dir>/voltage/pyspark-accessed-sample-data/<variant>

Where <variant> is either dataframe, sql, and/or hive.

NOTE: This directory will contain each partition’s results in files with names
beginning with part-, as well as the consolidated results in a file named as
described below.

As shipped, each of the variants produces an output file of accessed sample data using a
unique filename for its variant in the directories described above:

• DataFrame: accessedPySparkDataFrame.csv

• Spark SQL: accessedPySparkSQLData.csv

• HiveUDF: accessedPySparkHiveData.csv

Invocations (after running the appropriate protect sample job for each access sample job, as
described in Table 7-1 on page 7-17):
./run-pyspark-access-dataframe-job

./run-pyspark-access-sql-job

./run-pyspark-access-hive-job

These scripts must be run from the directory <install_dir>/spark/bin and they have no
parameters.

7-23 CONFIDENTIAL
TrustStores Used by the Spark Developer Template Developer Templates Integration Guide Version 5.0

update-spark-config-files-in-hdfs
This script updates the Spark-specific XML configuration file for the Spark Developer Template
(vsspark-rdd.xml) from the local file system to its default location in HDFS.

Use this script in conjunction with the generic Hadoop Developer Templates script
update-config-files-in-hdfs in the directory <install_dir>/bin to update all three
configuration files used by the Spark Developer Template. For more information about this
generic script, see "Loading Updated Configuration Files into HDFS" (page 3-77).

Invocation:
./update-spark-config-files-in-hdfs

This script must be run from the directory <install_dir>/spark/bin and it has no
parameters.

TrustStores Used by the Spark Developer Template

The Spark Developer Template uses the same approach to trustStore management as the
other Hadoop Developer Templates. This involves the use of the class
TruststoreInitializer to duplicate the trusted root certificates in the OpenSSL trustStore
in the JVM trustStore, eliminating the need to directly manage such certificates in both of these
trustStores. For more information, see "Multiple Developer Template TrustStores - Background
and Usage" (page 3-56).

Using the Spark Web Server

To view Spark jobs in their various states (running/completed/incomplete), browse to the

following URL
http://<Spark_UI_Server>:<Port>

Where <Spark_UI_Server> and <Port> are dependent on the Hadoop distribution you
are using.

For example, in the Ambari user interface, for version 2.6.1 of the HDP (Hortonworks),
information about the Spark UI Server can be found by clicking on Spark2, then Quick Links,
then Spark UI Server.

The History Server in the Spark UI looks like the following, with the most recent jobs listed first:

CONFIDENTIAL 7-24
Developer Templates Integration Guide Version 5.0 Using the Spark Web Server

Information about each job can be viewed by clicking on the App ID link.

To view environment information about a particular job, click on App ID link and then the
environment tab:

To view job logs, click on the App ID link, and then on the Executors tab, choose stderr. The
following example shows how this will display, when using YARN, the job logs within the YARN
ResourceManager UI:

7-25 CONFIDENTIAL
Limitations of the Spark Developer Template Developer Templates Integration Guide Version 5.0

Limitations of the Spark Developer Template

The Spark Developer Template demonstrates a simple approach to integrating the Voltage
SecureData APIs into a Spark job in order to protect and access sensitive data. There are
several limitations to this approach that are worth mentioning:

• The Spark Developer Template reads its original plaintext sample data from a CSV file. If
your plaintext data is in a different format, you will need to customize the source code
for one or more of the Scala and Python Spark drivers in the following files to convert
the plaintext from your source format to the format (RDD or DataFrame) expected by
the Spark Developer Template processor(s).
• SDSparkDriver.scala
• SDSparkDatasetDriver.scala
• SDSparkDataFrameDriver.scala
• SDSparkSQLDriver.scala
• SDSparkHiveDriver.scala
• sd_pyspark_dataframe_driver.py
• sd_pyspark_sql_driver.py
• sd_pyspark_hive_driver.py

Likewise, if your scenario requires the output to be in a form other than a CSV file, the
final step in these Spark drivers will need to be changed accordingly.

• The Spark Developer Template demonstrates Voltage SecureData integration with

several Spark components using Scala, including Spark Core (RDD), DataFrames,
Datasets, and Spark SQL. PySpark integration with DataFrames and Spark SQL is also
demonstrated. The Spark Developer Template does not demonstrate how to integrate
with other Spark components such as Spark Streaming, MLib, and so on.

CONFIDENTIAL 7-26
8 NiFi Integration
The NiFi Developer Template demonstrates how to integrate Voltage SecureData data
protection technology in the context of NiFi. This demonstration includes the use of the Simple
API (version 4.0 and greater) and the REST API.

NOTE: Both the Voltage SecureData for Hadoop Developer Templates and NiFi use the term
template in their own way. For the former, template is used to convey the fact that the Java
packages and the associated configuration approach provided for the MapReduce, Hive,
Sqoop, Spark, and NiFi integrations are meant to provide guidance and a starting point for
your own similar, but naturally more robust integrations.

Within NiFi, a template is a save-able and reuse-able set of connected, potentially pre-
configured processors.

In this document, NiFi Developer Template refers to the Voltage SecureData NiFi integration
and NiFi template, when used at all, has the stand-alone NiFi meaning.

NiFi provides an obvious integration opportunity in the form of its individual processors. NiFi
processors provide discrete processing steps in a flow of data. The SecureDataProcessor
NiFi processor provided with the NiFi Developer Template serves as an example of a NiFi
processor that can be configured to either protect or access the data flowing through it, and
works in conjunction with the Java packages in the common infrastructure. This chapter
provides a description of the former as well as instructions on how to run the NiFi Developer
Template in two different ways using the provided sample data. For more information about
the common infrastructure used by the Developer Templates, see Chapter 3, “Common
Infrastructure”.

NOTE: The NiFi Developer Template comes with its own sample data. It has been simplified
even further for demonstration purposes, with just a single type of data, such as credit card
numbers, Social Security numbers, or email addresses, provided in each input file, one data
value per line. For more information about these sample data files, see "Sample Data for the
NiFi Developer Template" (page 8-17).

Another important aspect of understanding the NiFi Developer Template is to understand the
configuration settings it uses. Unlike the other three Datastream Developer Templates, which
read all of their configuration settings from their two XML configuration files (vsconfig.xml
and vsauth.xml), the NiFi Developer Template reads its (global) configuration settings from a
Java Properties configuration file (vsnifi.properties) with each SecureDataProcessor
getting its individual settings through the NiFi user interface.

Much of the documentation related to the global configuration settings relevant to the NiFi
Developer Template is provided in Chapter 3 in the section "Configuration Settings" (page 3-5).
This section provides information about the common infrastructure Java classes used to read
and create in-memory copies of the settings, as well as a description of the individual settings.

8-1 CONFIDENTIAL
Quick Start Using the Provided NiFi Workflow Developer Templates Integration Guide Version 5.0

This chapter will review these global configuration settings in the context of the NiFi Developer
Template as well as provide information about configuring a SecureDataProcessor using
the NiFi user interface.

The remainder of this chapter is organized as follows:

• Quick Start Using the Provided NiFi Workflow (page 8-2) - This section provides
instructions for deploying and running the NiFi Developer Template using the provided
pre-configured workflow with the provided sample data and using the public-facing
Voltage SecureData Server dataprotection hosted by Micro Focus Data Security.

NOTE: For instructions about building the NiFi Developer Template, see "Building the
Datastream Developer Templates" (page 2-19).

• Integration Architecture of the NiFi Developer Template (page 8-8) - This section
explains the Java classes that are specific to the NiFi Developer Template including the
classes that implement the SecureDataProcessor and the classes for retrieving the
global configuration settings from the configuration file vsnifi.properties.

• Configuration Settings for the NiFi Developer Template (page 8-10) - This section
reviews the global configuration settings that are relevant to the NiFi Developer
Template and explains the properties set for individual instances of a
SecureDataProcessor using the NiFi user interface.

• Sample Data for the NiFi Developer Template (page 8-17) - This section provides a
description of the simplified sample data provided for the NiFi Developer Template.

• Adding the SecureDataProcessor to a Blank Workflow (page 8-18) - This section

provides instructions for adding the SecureDataProcessor to a blank workflow and
connecting it to appropriate upstream and downstream processors.

• Limitations and Simplifications of the NiFi Developer Template (page 8-20) - This
section provides information about the type of improvements you will need to make in
order to create a production-grade NiFi processor that integrates calls to the Voltage
SecureData APIs.

Quick Start Using the Provided NiFi Workflow

After you have installed the Simple API (see the Voltage SecureData Simple API Installation
Guide) and installed and built the NiFi Developer Template (see "Installing the Developer
Templates" on page 2-7 and "Building the Datastream Developer Templates" on page 2-19),

CONFIDENTIAL 8-2
Developer Templates Integration Guide Version 5.0 Quick Start Using the Provided NiFi Workflow

follow these steps to deploy and run the provided NiFi workflow and exercise the
SecureDataProcessor using the provided sample data and the public-facing Voltage
SecureData Server dataprotection hosted by Micro Focus Data Security:

1. Deploy the Voltage SecureData NiFi Processor

Copy the Voltage SecureData NiFi processor’s NAR file (vs-nifi-processors.nar)

from the following source directory to the following target directory on your NiFi server’s
file system:

Source directory: <install_dir>/stream/nifi/bin

Target directory on NiFi server: <nifi-install-location>/lib

This step deploys the SecureData processor you just built.

2. Deploy the Configuration File for the Voltage SecureData NiFi Processor

After making any necessary changes to the Java Properties configuration file
vsnifi.properties, such as to the Simple API install path setting
simpleapi.install.path, copy it to the following directory on your NiFi server’s file
system:
<nifi-install-location>/conf

3. Start or Restart the NiFi Server

Start (or restart) your NiFi server, using the script in the following directory:
<nifi-install-location>/bin

4. Open the NiFi User Interface

When you are working on the NiFi server, the NiFi user interface is usually launched in a
compatible browser from the following URL:

http://<host>:8080/nifi

8-3 CONFIDENTIAL
Quick Start Using the Provided NiFi Workflow Developer Templates Integration Guide Version 5.0

5. Import the Voltage SecureData Example Template

From the NiFi user interface, begin by clicking the Upload Template button in the
Operate Palette:

The Upload Template dialog box opens. Click the magnifying glass Browse button,
browse to and choose the example template SecureDataExample.xml from the
directory <install-dir>/stream/nifi/template, and then upload it.

6. Add the SecureDataExample NiFi Template to Your Workspace

From the NiFi user interface, drag the Template tool icon from the Components
Toolbar into the NiFi workspace:.

The Add Template dialog box opens. Confirm that the SecureDataExample NiFi
template is displayed in the Choose Template dropdown list and then click Add.

NOTE: If you have imported more than one NiFi template, you may need to choose
the SecureDataExample NiFi template in the Choose Template dropdown list
before clicking Add.

CONFIDENTIAL 8-4
Developer Templates Integration Guide Version 5.0 Quick Start Using the Provided NiFi Workflow

The following SecureDataExample workflow will now be displayed in your NiFi

workspace:

It includes an upstream GetFile processor, two downstream PutFile processors (one

for the SUCCESS relationship and one for the FAILURE relationship), and a downstream
LogAttribute processor (for the FAILURE relationship).

7. Configure the Input and Output Directories

For each of the following processors, right-click in the processor and choose Configure
to open the Configure Processor dialog box. In the Properties tab of this dialog box,
enter an appropriate directory path of your choosing as the value of the indicated
property, and then click Apply:

Processor Type Processor Name Property Name

GetFile Get source file (input) Input Directory
PutFile Put SUCCESS file (output) Directory
PutFile Put FAILURE file (error) Directory

NOTE: You will need to create three empty directories on your NiFi server’s local file
system to serve as the input, output, and error directories.

8. Configure the Voltage SecureData Properties

Right-click in the SecureDataProcessor and choose Configure to open the

Configure Processor dialog box. In the Properties tab of this dialog box, configure the
Voltage SecureData properties for the example workflow.

To use the public-facing Voltage SecureData Server dataprotection, hosted by

Micro Focus Data Security, for which the default settings in the Java Properties
configuration file vsnifi.properties are configured, and to protect the credit card

8-5 CONFIDENTIAL
Quick Start Using the Provided NiFi Workflow Developer Templates Integration Guide Version 5.0

sample data in the file creditcard.txt, you will only need to add a value for the
SharedSecret property (use the value voltage123). For more information about this
step, see "Configuring the Properties of the NiFi SecureDataProcessor" (page 8-12).

9. Start All of the Processors in the Workflow

Press ctrl-A to select all of the processors in the SecureDataExample workflow and
then click the Start Component button in the Operate Palette.

The red square Stopped icon in front of each processor’s name changes to a green
triangle Started icon.

You are now ready to exercise the SecureDataExample workflow, as explained in the
following section.

Exercising the SecureDataExample Workflow

If you have successfully started all five of the processors in the SecureDataExample
workflow, exercising it is straightforward. You just need to copy a sample data file of the correct
format into the input directory that you created.

If you are using the public-facing Voltage SecureData Server dataprotection, hosted by
Micro Focus Data Security, and default settings for the SecureDataProcessor, with the
addition of the SharedSecret property value voltage123, start by putting a copy of the
sample data file creditcard.txt into the input directory. If the protection operation
succeeds, you will soon find an output file named creditcard.txt in the output directory.
This output file will contain the ciphertext values corresponding to the plaintext values in the
input file.

Continue exercising the workflow by experimenting with protecting the other sample data files.
Remember that you will need to reconfigure the SecureDataProcessor appropriately (the
processor must be stopped to reconfigure it, and then restarted):

Sample Data File Format API Type

creditcard.txt cc simpleapi or rest
creditcard.txt cc-sst-6-4 rest
ssn.txt ssn simpleapi or rest

CONFIDENTIAL 8-6
Developer Templates Integration Guide Version 5.0 Quick Start Using the Provided NiFi Workflow

Sample Data File Format API Type

email.txt auto or Alphanumeric simpleapi or rest
date.txt DATE-ISO-8601 simpleapi or rest
simpleapi* or rest
name.txt AlphaExtendedTest1
(uses FPE2)

* If you are using version 5.0 or later of the Simple API.

Finally, you can exercise the data access aspect of the workflow by using the successful output
files you have generated as input files for access operations. You will need to re-configure the
SecureDataProcessor to perform access operations and continue to exercise care that you
have configured the other properties of the SecureDataProcessor to match the input file
you intend to process.

Using the Workflow With a Different Voltage SecureData Server

If you want to exercise the SecureDataExample workflow with a different Voltage SecureData
Server, remember that you will need to change (most of) the configuration settings in the
configuration file vsnifi.properties in the directory <install-dir>/nifi/config.
The configuration settings in this file provide information such as the location of the Simple API
installation on your NiFi server, and the hostname of the REST API to use.

By default, these settings are configured to use the public-facing Voltage SecureData Server
dataprotection, hosted by Micro Focus Data Security for demonstration purposes. If you are
using this Voltage SecureData Server during your initial experimentation with the NiFi
Developer Template, no changes are necessary (except, perhaps, to the install location of the
Simple API). However, you will need to copy this configuration file to the directory
<nifi-install-location>/conf on your NiFi server’s file system.

NOTE: After you begin using your own Voltage SecureData Server, you will need to change
the configuration settings in this file and copy it (again) to the directory
<nifi-install-location>/conf on your NiFi server’s file system.

As you adapt the NiFi Developer Template code for your own purposes, you are, of course, free
to change how this configuration information is provided to your SecureData processors for use
when initializing the Voltage SecureData APIs you intend to use.

For more information about the parameters that the SecureDataProcessor expects to find
in the configuration file vsnifi.properties (which do not extend the common
configuration properties used by all of the Developer Templates), see "Common Configuration"
(page 3-57).

8-7 CONFIDENTIAL
Integration Architecture of the NiFi Developer Template Developer Templates Integration Guide Version 5.0

Integration Architecture of the NiFi Developer Template

NiFi provides an obvious integration mechanism in the form of its extensible processor
architecture. The NiFi Developer Template pursues this approach by providing a SecureData
processor (named SecureDataProcessor) that can protect incoming plaintext or access
incoming ciphertext using Format-Preserving Encryption (FPE) or Secure Stateless
Tokenization™ (SST). The processor can be configured with the standard FPE and SST
parameters: format, identity, and authentication credentials in one of two forms. It can also be
configured to perform the protect or access operation using one of two different SecureData
data protection APIs: the Simple API (version 4.0 and greater) or the REST API (the latter API
can be used for SST processing while the former cannot).

The SecureData processor makes use of the common infrastructure provided with Voltage
SecureData for Hadoop for retrieving global file-based configuration information, providing
data translation and a cryptographic abstraction layer, as well as the REST client. The
SecureDataProcessor provides the following Java packages:

• com.voltage.securedata.nifi - This package contains the classes used to

implement the SecureDataProcessor itself.

See "Processor Classes for the NiFi Developer Template" (page 8-8).

• com.voltage.securedata.nifi.config - This package contains the classes used

to read and parse the global configuration settings from the configuration file
vsnifi.properties.

See "Configuration Classes for the NiFi Developer Template" (page 8-9).

Processor Classes for the NiFi Developer Template

The following Java package and its associated Java source code provide classes that
implement the Voltage SecureData NiFi processor:

Source code package: com.voltage.securedata.nifi

Source code location: <install-dir>/stream/nifi/src

The NiFi integration uses a NiFi processor to perform data protection processing on one or
more incoming plaintexts or ciphertexts. As shipped, it assumes that its input stream contains
one item of data to be protected or accessed “per line”, with no other data on the line, in a
format corresponding to the configured FPE or SST format. In other words, no parsing within a
line is required, and each plaintext or ciphertext is separated from the next one by the
appropriate line termination character(s), either a carriage-return or a carriage-return/line-feed
pair.

The com.voltage.securedata.nifi package defines the following classes in .java

source files of the same name:

CONFIDENTIAL 8-8
Developer Templates Integration Guide Version 5.0 Integration Architecture of the NiFi Developer Template

• DataStreamCallback - This class implements the NiFi StreamCallback interface

and contains the code that creates batches of input plaintext or ciphertext and passes it
to an appropriate newly created or recycled (common infrastructure) Crypto object.
This object hides the details of the SecureData data protection API configured for use in
this SecureData processor. Then, assuming a successful data protection operation, the
results are streamed to the processor’s output, one result per line.

Failures are handled by throwing a runtime exception. This exception is caught in the
method OnTrigger of the SecureDataProcessor class, which logs the error and
routes the entire flow file to the FAILURE relationship.

• SecureDataProcessor - This class extends the NiFi AbstractProcessor class and

provides for the various aspects of NiFi processor housekeeping: property definition and
validation, relationship definition and management, input triggering, and so on. It also
initializes the CryptoFactory class that is part of the common infrastructure shared
with the Hadoop Developer Templates, and creates a DataStreamCallback object to
process the incoming data and stream outgoing data.

Configuration Classes for the NiFi Developer Template

The following Java package and its associated Java source code provide classes that
implement the NiFi-specific configuration classes:

Source code package: com.voltage.securedata.nifi.config

Source code location: <install-dir>/stream/nifi/src

The SecureDataProcessor uses the classes in this package to read global configuration
settings from the java Properties configuration file vsnifi.properties. As shipped, the
SecureDataProcessor does not require any additional global configuration settings than are
read and established in-memory by the package com.voltage.securedata.config, as
described in "Common Configuration" (page 3-57).

NOTE: The remaining configuration settings for the NiFi Developer Template are specific to
a given instance of the SecureDataProcessor: the SecureData API to use, the operation
type, and the FPE/SST parameters. This configuration information is integrated with the
processor itself as part of the NiFi processor definition and configured using the NiFi user
interface. For more information about this configuration information, see "Configuring the
Properties of the NiFi SecureDataProcessor" (page 8-12).

The NiFi Developer Template provides a class in this package that modestly extends the
parallel class in the common configuration package com.voltage.securedata.config.
The NiFi Developer Template uses this NiFi-specific class to read and parse the global
configuration values that it requires.

The com.voltage.securedata.nifi.config package defines the following set of classes

in .java source files of the same name:

8-9 CONFIDENTIAL
Configuration Settings for the NiFi Developer Template Developer Templates Integration Guide Version 5.0

• NiFiConfigPopulator - This class extends the ConfigPopulator class and

provides methods for reading the global configuration settings from the NiFi
configuration file vsnifi.properties into their in-memory equivalents.

• FileConfigLoader - This class provides methods for loading the configuration

information from the configuration file vsnifi.properties on the NiFi server’s local
file system.

This class wraps the call to the NiFiConfigPopulator class in code that reads the
configuration file from the local file system. This approach allows for the addition of
other types of configuration loaders that load configuration data from different input
sources; reading from the local file system is just one example approach demonstrated
in the NiFi Developer Template.

The NiFi Developer Template configuration implementation provides an example of the types
of configuration information that your custom SecureData NiFi processor will need access to in
order to be able to use one or both of the two Voltage SecureData APIs demonstrated here in
NiFi workflows. It is also an example of one possible approach to making this configuration
information available to your custom NiFi processor.

Logging and Error Handling in the NiFi Developer Template

In general, the NiFi Developer Template is generous with respect to logging, logging successful
operations as well as failures. For performance reasons, the former may not be appropriate in
benchmarking and production environments. Change the logging behavior as required by
modifying the source code and rebuilding.

With respect to error processing, the DataStreamCallback class treats all of the input in a
given input stream (a single file in the provided template workflow) collectively: if any protect or
access operations in a given stream fail, the entire stream fails, resulting in the following actions:

• The input stream is written unchanged to the failure output stream: plaintext remains
plaintext and ciphertext remains ciphertext.

• Full error details are logged to the processor’s logs.

• A custom error attribute on the flow file is set to the text of the error message, allowing
a downstream processor to take action based on the nature of the error.

Finer-grained error processing may be appropriate for your environment. For example, you may
want to re-write the intentionally simplified NiFi Developer Template code such that only
individual cryptographic operations that fail get routed to the failure output stream, regardless
of how the input data is grouped in the input files.

Configuration Settings for the NiFi Developer Template

There are two classes of configuration settings used by the NiFi Developer Template:

CONFIDENTIAL 8-10
Developer Templates Integration Guide Version 5.0 Configuration Settings for the NiFi Developer Template

• General SecureData settings that define characteristics of the Voltage SecureData

Server(s) with which the template will interact as well as a few other common
characteristics. These settings are in the NiFi configuration file vsnifi.properties.
For example (intervening comment lines removed and lines truncated where
necessary):

simpleapi.policy.url = https://voltage-pp-0000.dataprotection...
simpleapi.install.path = /opt/voltage/simpleapi
rest.hostname = voltage-pp-0000.dataprotection.voltage.com
#product.name =
#product.version =
return.protected.value.on.access.auth.failure = false

NOTE: By default, the product.name and product.version properties are

commented out. For the NiFi Developer Template, the default value for the product
name is SecureData-NiFi-Sample. There is no default value for the product
version. For more information about applicable character restrictions and how these
values are used by the Simple API and the REST API when they send requests to the
Voltage SecureData Server, see "Product Name" (page 3-14) and "Product Version"
(page 3-15).

Note that the NiFi Developer Template does not require any custom global
configuration settings beyond what can be processed using the common configuration
infrastructure provided for the Voltage SecureData for Hadoop Developer Templates.
The one distinct difference is that the Java Properties configuration file
vsnifi.properties for NiFi is read from the local file system, not from HDFS as for
the Hadoop Developer Templates. For more information about these settings, see
"Common Configuration" (page 3-57) and "Common Configuration" (page 3-57).

Anytime you make changes to this configuration file, you must always copy it to the
following directory in the NiFi server’s file system and start or restart your NiFi server:
<nifi-install-location>/conf

NOTE: You must update most of the values in the Java Properties configuration file
vsnifi.properties in order to protect data using your own Voltage SecureData
Server. However, if possible, before doing so Micro Focus Data Security recommends
that you first run the provided sample data through the NiFi ready-to-use workflow
as provided, giving you assurance that your NiFi workflow is configured correctly and
functioning as expected. For more information, see "Adding the SecureDataProcessor
to a Blank Workflow" (page 8-18).

• The remainder of the configuration settings that the NiFi Developer Template requires
to perform protect and access operations (the equivalent of the other types of settings
in the Hadoop XML configuration files vsconfig.xml and vsauth.xml) are handled
differently in the NiFi Developer Template. Because NiFi provides an extensible
mechanism for defining custom properties for a processor, including an extensible user
interface for setting the custom properties, the NiFi Developer Template makes use of
that mechanism.

8-11 CONFIDENTIAL
Configuration Settings for the NiFi Developer Template Developer Templates Integration Guide Version 5.0

These built-in processor properties are discussed in the following section, Configuring
the Properties of the NiFi SecureDataProcessor.

Configuring the Properties of the NiFi SecureDataProcessor

The NiFi processor provided with the NiFi Developer Template, named
SecureDataProcessor, is the core of the SecureData NiFi integration. This processor is
capable of both protect and access operations using FPE or SST, performed by either of the
following Voltage SecureData APIs:

• The Simple API

NOTE: By design, the Simple API cannot perform SST operations due to its local
processing.

• The REST API

Regardless of which API the SecureDataProcessor is configured to use, you must always
provide the standard set of cryptographic parameters:

• Format - The name of the centrally configured format to which the plaintext to be
protected and the ciphertext to be accessed will conform.

The NiFi Developer Template allows the optional use of NiFi expression processing for
the Format parameter. For more information, see "Format" (page 8-13).

• Identity - The common name portion of the identity that, for FPE, will be used in the
derivation of the cryptographic key that will be used to protect the incoming plaintext or
access the incoming ciphertext. For SST, the identity must match the identity configured
for the specified SST format.

• Authentication Method and Credentials - The authentication method and associated

credentials to be used when connecting to the Voltage SecureData Server, such as to
derive a cryptographic key.

This set of configuration values can be set in the Properties tab of the Configure Processor
dialog box. To open this dialog box, in the NiFi user interface, right-click in an idle
SecureDataProcessor and choose Configure.

The remainder of this section provides information about each of these configuration values, in
the order they appear in the Properties tab of the Configure Processor dialog box.

Operation
Use the Operation property to specify whether this SecureDataProcessor will perform a
protect or access operation. Click in the Value column for this property, choose either
PROTECT or ACCESS in the Value column drop-down box, and then click Ok.

CONFIDENTIAL 8-12
Developer Templates Integration Guide Version 5.0 Configuration Settings for the NiFi Developer Template

The default value for this required property is: PROTECT

Format
Use the Format property to specify the format of the data to be processed by this
SecureDatProcessor. Enter the name of a data protection format, either the name of one of
the built-in formats such as cc or auto, or the name of a centrally configured format. Click in
the Value column for this property, type the name of the format, and then click Ok.

There is no default value for this required property.

The NiFi Developer Template allows the optional use of NiFi expression processing for the
Format parameter. For example, if you wanted to specify the data protection format at the
beginning of each input filename (format name as the portion of the filename before the first
underscore character), you could specify the format as:
${filename:substringBefore('_')}

If you use this expression as the value of the Format parameter, then you would need to
rename the sample data files appropriately, to begin with their format name and an underscore.
For example, rename creditcard.txt to cc_creditcard.txt.

This change involved three modifications to the SecureDataProcessor class:

• Enabling expression language support when creating the PropertyDescriptor

object for the Format property.

NOTE: For relevant information about the API used to enable expression language
support being deprecated, see API Used to Enable Expression Language Support
has been Deprecated at the end of this list of modifications.

• Modifying the method getFormatInfo to accept the FlowFile object as one of its
parameters, enabling flow file attributes such as filename to be accessible when
attempting to evaluate expressions.

• Attempting expression evaluation when getting the Format property value. If no

expression is found, the literal value is passed through, retaining the prior behavior.

API Used to Enable Expression Language Support has been Deprecated

As of Apache NiFi version 1.7.0 (June 2018), the following NiFi core API has been marked
as deprecated:
public PropertyDescriptor.Builder
expressionLanguageSupported(boolean supported)

The current version of the SecureDataProcessor code uses this API to turn on
expression language support for this property (Format).

This deprecated API has been replaced by the following API in newer versions of NiFi:

8-13 CONFIDENTIAL
Configuration Settings for the NiFi Developer Template Developer Templates Integration Guide Version 5.0

public PropertyDescriptor.Builder
expressionLanguageSupported(ExpressionLanguageScope
expressionLanguageScope)

In most cases, the use of the deprecated API should not cause any issues in your NiFi
environments, other than possibly displaying deprecation warnings when building the
SecureDataProcessor code. However, it is possible that future releases of NiFi may
completely remove the deprecated API. The possible behaviors, depending on NiFi
versions, are as follows:

• All versions prior to 1.7.0: No warnings or errors.

• Versions 1.7.0 and higher (all known versions at this time) : Deprecation
warning. For example:
The method expressionLanguageSupported(boolean) from the
type PropertyDescriptor.Builder is deprecated.

• Future versions: Possible build error. Example:

error: cannot find symbol: method
expressionLanguageSupported(boolean)

If you get build errors on possible future versions of NiFi where the currently used,
deprecated API has been completely removed, or if you want to correct any deprecation
warnings as of NiFi 1.7.0, you can edit the Java source file SecureDataProcessor.java
and replace the following line of code:
expressionLanguageSupported(true) // this flag enables expres...

With these lines of code:

// this flag enables expressions

expressionLanguageSupported(
ExpressionLanguageScope.FLOWFILE_ATTRIBUTES)

If you do, you will also need to import that new ExpressionLanguageScope class, at the
top of that Java source file, as follows:
import org.apache.nifi.expression.ExpressionLanguageScope;

After making these code changes, rebuild the processor using Maven. You should not get
any build warnings or errors related to this API. For more information about building the
NiFi Developer Template, see "Building the Datastream Developer Templates" (page 2-19).

Identity
Use the Identity property to specify the common name portion of the identity to be used for
this cryptographic operation. Click in the Value column for this property, type the identity, and
then click Ok.

There is no default value for this required property.

CONFIDENTIAL 8-14
Developer Templates Integration Guide Version 5.0 Configuration Settings for the NiFi Developer Template

NOTE: When using the Voltage SecureData Server dataprotection, maintained by Micro
Focus Data Security for testing purposes, you can use the demonstration identity
[email protected]. When using your own Voltage SecureData Server, make sure that you use
an identity that matches a configured authorization rule.

Auth Method
Use the Auth Method property to specify the authentication method to be used when
connecting to the Voltage SecureData Server, such as to derive the cryptographic key needed
to complete the operation. Click in the Value column for this property, choose either
SharedSecret or UserPassword in the Value column drop-down box, and then click Ok.

The default value for this required property is: SharedSecret

SharedSecret
When you have chosen SharedSecret as the value of the Auth Method property, use the
SharedSecret property to specify the shared secret credential to be used when connecting to
the Voltage SecureData Server. Click in the Value column for this property, type the shared
secret, and then click Ok.

NOTE: In order to help keep the shared secret private, this property is configured such that
the text Sensitive value set will be displayed in the user interface instead of the shared secret
itself.

There is no default value for this property.

Username
When you have chosen UserPassword as the value of the Auth Method property, use the
Username property to specify the username credential to be used when connecting to the
Voltage SecureData Server. Click in the Value column for this property, type the username, and
then click Ok.

There is no default value for this property or the associated Password property.

Password
When you have chosen UserPassword as the value of the Auth Method property, use the
Password property to specify the password credential to be used when connecting to the Key
Server. Click in the Value column for this property, type the password, and then click Ok.

NOTE: In order to help keep the password private, this property is configured such that the
text Sensitive value set will be displayed in the user interface instead of the password itself.

There is no default value for this property or the associated Username property.

8-15 CONFIDENTIAL
Configuration Settings for the NiFi Developer Template Developer Templates Integration Guide Version 5.0

API Type
Use the API Type property to specify which of the two available Voltage SecureData APIs will
be used to perform the cryptographic operation. Click in the Value column for this property,
choose either simpleapi or rest in the Value column drop-down box, and then click Ok.

The default value of this required property is: simpleapi

SecureDataProcessor Property Validation

NiFi automatically validates the processor configuration and reports errors if there are any
issues that prevent the processor from being started. Specifically, a warning icon is displayed
for the processor, and the processor cannot be started, when any of the following configuration
error conditions exist:

• If any of the required processor properties are not set, NiFi will report an error such as
the following:
'Format' is invalid because Format is required

• If the authentication credential properties (SharedSecret and Username/Password)

fail their dynamic validation. Based on the setting of the Auth Method property, either
the SharedSecret property or the Username and Password properties are required.
When this custom validation fails, NiFi will report an error such as the following:
'SharedSecret' is invalid because SharedSecret value cannot be
null/empty if Auth Method is set to SharedSecret

NOTE: NiFi performs this type of custom validation only if more basic required
property validation is successful. In other words, if no value is set for a required
property such as Format or Identity, an error message related to that issue is
displayed. After such basic validation is performed successfully, then custom
validation such as is used for authentication credentials is performed and reported
upon.

• If you attempt to add a custom property to the SecureDataProcessor. The only

processor properties that can be configured for the SecureDataProcessor, as
written, are those described in the previous section, "Configuring the Properties of the
NiFi SecureDataProcessor" (page 8-12). If you attempt to use the new property button
(+) on the Properties tab to add a new custom name/value pair to the list of properties,
NiFi will report an error such as the following:
'name' validated against 'value' is invalid because 'name' is
not a supported property

Of course, you can always modify the definition of the SecureDataProcessor such
that custom properties are allowed.

CONFIDENTIAL 8-16
Developer Templates Integration Guide Version 5.0 Sample Data for the NiFi Developer Template

You must correct all configuration errors, as described above, before you can start the
SecureDataProcessor. After all configuration errors have been corrected, and the required
relationships have been configured (as described in the following section, SecureDataProcessor
Relationship Configuration), the yellow warning icon for the SecureDataProcessor will
change to a red square. This indicates that the processor is currently stopped and is ready to
be started.

SecureDataProcessor Relationship Configuration

The SecureDataProcessor defines two relationships for sending the output of
cryptographic processing to downstream processor(s): success and failure. When configuring
the processor, both of these two relationships must either be connected to a downstream
processor (or sometimes two downstream processors), or auto-terminated.

If either relationship is not connected or auto-terminated, NiFi displays a warning icon with an
error message such as the following, and does not allow the processor to be started:
'Relationship failure' is invalid because Relationship 'failure' is
not connected to any component and is not auto-terminated

Sample Data for the NiFi Developer Template

The sample data, located in the <install-dir>/stream/nifi/sampledata subdirectory,

consists of five files, each containing 10,000 lines. Each line has a single plaintext on it. The
type of the plaintext corresponds to the name of the file, and the plaintext in the file can be
protected using the formats indicated here:

• creditcard.txt - Protect the credit card data in this file using the built-in FPE format
cc.

You can also use the SST format cc-sst-6-4 for this input data, if you also change the
API Type property to rest, indicating the use of the REST API for the SST operations.

• ssn.txt - Protect the social security number data in this file using the built-in FPE
format ssn.

• email.txt - Protect the email address data in this file using the pre-configured FPE
format Alphanumeric.

• date.txt - Protect the date data in this file using the FPE format DATE-ISO-8601,
pre-configured on the dataprotection Voltage SecureData Server hosted by Micro
Focus Data Security.

• name.txt - Protect the name data in this file using the FPE format
AlphaExtendedTest1, pre-configured on the dataprotection Voltage SecureData
Server hosted by Micro Focus Data Security. Note that this file contains characters
beyond the normal ASCII range, which will be protected using a Variable-Length String

8-17 CONFIDENTIAL
Adding the SecureDataProcessor to a Blank Workflow Developer Templates Integration Guide Version 5.0

(VLS) format configured to support extended character sets using FPE2. Also note that
you must use the REST API (which itself requires version 6.0 or greater of the Voltage
SecureData Server) or version 5.0 or greater of the Simple API to use this type of format.

If you are using your own Voltage SecureData Server to process the sample data in either of the
files date.txt, or name.txt, or to tokenize the data in the file creditcard.txt, you will
need to create the corresponding format(s) in your Voltage SecureData Server, as described in
Appendix A, “Voltage SecureData Server Configuration”.

Adding the SecureDataProcessor to a Blank Workflow

After you have built and deployed the SecureDataProcessor (see "Building the Datastream
Developer Templates" on page 2-19 and steps 1, 2, and 3 in "Exercising the
SecureDataExample Workflow" on page 8-6), you can use it in a workflow. To do so, follow
these steps:

1. From the NiFi user interface (usually available at http://<host>:8080/nifi), drag

the processor tool onto your workspace so that you can use the Add Processor dialog
box to add the SecureDataProcessor to your workspace.

TIP: Search for “SecureData” in the Add Processor dialog box to avoid scrolling to
near the bottom of the list of processors.

2. Configure the SecureDataProcessor for your protect or access scenario, as

described in "Configuring the Properties of the NiFi SecureDataProcessor" (page 8-12).

NOTE: If you have not done so already, you will also need to provide configuration
settings for your Voltage SecureData Server in the configuration file
vsnifi.properties, as described in "Common Configuration" (page 3-57).

3. Add and configure a processor, such as TailFile or GetFile, to read input data from
a specific input file or directory.

4. Connect the SUCCESS relationship from your chosen input processor to the
SecureDataProcessor you added and configured above.

5. Add and configure a processor, such as PutFile, to write output data to a specific
directory.

CONFIDENTIAL 8-18
Developer Templates Integration Guide Version 5.0 Adding the SecureDataProcessor to a Blank Workflow

6. Connect the SUCCESS relationship from the SecureDataProcessor to your chosen

output processor. Remember to auto-terminate any relationships (such as SUCCESS
and FAILURE) that this output processor may have in order to indicate that the
workflow ends there. Alternatively, if you have additional steps in your workflow, you can
connect this output processor to other downstream processors in the workflow.

7. If you want to handle cryptographic processing failures, connect the FAILURE

relationship from the SecureDataProcessor to an appropriate downstream
processor. For example, you could add another instance of PutFile to receive failed
flow files, or you could add a processor, such as LogAttribute, to log the failure.

Otherwise, auto-terminate the FAILURE relationship from the SecureDataProcessor.

NOTE: Remember that defined downstream relationships for all processors must be
auto-terminated or connected to another processor.

The following screenshot shows the SecureDataProcessor configured to receive input from
an upstream GetFile processor and pass successful results to a downstream PutFile
processor.

Figure 8-1 The SecureDataProcessor in a Workflow

Start the processors in your new NiFi workflow and then exercise it in the same way as
described for the pre-configured workflow in "Exercising the SecureDataExample Workflow"
(page 8-6).

8-19 CONFIDENTIAL
Limitations and Simplifications of the NiFi Developer Template Developer Templates Integration Guide Version 5.0

Limitations and Simplifications of the NiFi Developer Template

Note the following known limitations and simplifications in the NiFi Developer Template
(intentional so as to keep the SecureDataProcessor code focused on its core functionality
of performing cryptographic operations):

• Simplified Input Data Format

As written, the SecureDataProcessor processes plaintext to be protected or

ciphertext to be accessed that is by itself as a “line” of data. Your data is likely to occur in
a more realistic format, such as in CSV or JSON format. In this case, you will need to
modify the NiFi Developer Template to extract the data to be protected or accessed
from that format in order to perform the relevant cryptographic operation, perhaps re-
inserting the cryptographic result in its place.

• All-or-Nothing Flow File Failures - No Partial Processing

As the SecureDataProcessor attempts to protect or access the data in the input

data flow, if any data item in the input flow file fails to be processed successfully, then
the entire flow file is treated as a failure, and the original input flow file is transferred to
the FAILURE relationship. However, in your own custom SecureData processors,
consider more advanced processing in which individual input data in the flow file that
fail to be processed are sent to the FAILURE relationship and successfully processed
individual data items are sent to the SUCCCESS relationship. This type of advanced
processing is beyond the scope of this simple example integration.

CONFIDENTIAL 8-20
9 StreamSets Integration
The StreamSets Developer Template demonstrates how to integrate Voltage SecureData data
protection technology in the context of StreamSets. This demonstration includes the use of the
Simple API (version 4.0 and greater) and the REST API.

StreamSets provides an obvious integration opportunity in the form of its individual processors.
StreamSets processors provide discrete processing steps in a flow of data. The SDProcessor
StreamSets processor provided with the StreamSets Developer Template serves as an example
of a StreamSets processor that can be configured to either protect or access the data flowing
through it, and works in conjunction with the Java packages in the common infrastructure. This
chapter provides a description of the former as well as instructions on how to run the
StreamSets Developer Template in two different ways using the provided sample data. For
more information about the common infrastructure used by the Developer Templates, see
Chapter 3, “Common Infrastructure”.

NOTE: The StreamSets Developer Template comes with its own sample data. It has been
simplified even further for demonstration purposes, with just a single type of data, such as
credit card numbers, Social Security numbers, or email addresses, provided in each input file,
one data value per line. For more information about these sample data files, see "Sample
Data for the StreamSets Developer Template" (page 9-11).

Another important aspect of understanding the StreamSets Developer Template is to

understand the configuration settings it uses. Like most of the other Datastream Developer
Templates, the StreamSets Developer Template reads its configuration settings from its two
XML configuration files (vsconfig.xml and vsauth.xml). However, a minimal amount of
configuration information is left to be provided in the StreamSets user interface for the
SDPRocessor (such as the choice between protect and access, where to find the XML
configuration files, which field(s) in the incoming records to protect or access, and using which
cryptId).

Much of the documentation related to the global configuration settings relevant to the
StreamSets Developer Template is provided in Chapter 3 in the sections "Common
Configuration" (page 3-57) and "Configuration Settings" (page 3-5). These sections provide
information about the common infrastructure Java classes used to read and create in-memory
copies of the settings, as well as a description of the individual settings. This chapter will review
these global configuration settings in the context of the StreamSets Developer Template as
well as provide information about configuring a SDProcessor using the StreamSets user
interface.

The remainder of this chapter is organized as follows:

• Quick Start Using the Provided StreamSets Pipelines (page 9-2) - This section provides
instructions for building, deploying, and running the StreamSets Developer Template
using the provided pre-configured pipeline(s) with the provided sample data and using
the public-facing Voltage SecureData Server dataprotection hosted by Micro Focus
Data Security.

9-1 CONFIDENTIAL
Quick Start Using the Provided StreamSets Pipelines Developer Templates Integration Guide Version 5.0

• Integration Architecture of the StreamSets Developer Template (page 9-7) - This

section explains the Java classes that are specific to the StreamSets Developer
Template including the classes that implement the SDProcessor and the classes allow
for customization of the StreamSets user interface for the SDProcessor.

• Configuration Settings for the StreamSets Developer Template (page 9-10) - This
section reviews the global configuration settings that are relevant to the StreamSets
Developer Template and explains the properties set for individual instances of an
SDProcessor using the StreamSets user interface.

• Sample Data for the StreamSets Developer Template (page 9-11) - This section
provides a description of the simplified sample data provided for the StreamSets
Developer Template.

• Adding the Voltage SDProcessor to a Blank Pipeline (page 9-12) - This section
provides instructions for adding the SDProcessor to a blank pipeline and connecting it
to an appropriate upstream origin or processor and downstream processor or
destination.

• Limitations of the StreamSets Developer Template (page 9-21) - This section provides
information about the type of improvements you will need to make in order to create a
production-grade StreamSets processor that integrates calls to the Voltage SecureData
APIs.

Quick Start Using the Provided StreamSets Pipelines

After you have installed the Simple API (see the Voltage SecureData Simple API Installation
Guide) and installed and built the StreamSets Developer Template (see "Installing the
Developer Templates" on page 2-7 and "Building the Datastream Developer Templates" on
page 2-19), follow these steps to deploy and run the provided StreamSets pipeline(s) and
exercise the SDProcessor using the provided sample data and the public-facing Voltage
SecureData Server dataprotection hosted by Micro Focus Data Security:

Then, follow these steps to deploy and prepare to run one or the other of the provided
StreamSets pipelines using the provided sample data and the public-facing Voltage SecureData
Server dataprotection hosted by Micro Focus Data Security:

1. Deploy the Voltage SecureData StreamSets Processor

Copy the Voltage SecureData StreamSets processor’s archive file

(sdprocessor-1.0-SNAPSHOT.tar.gz) from the following source directory to the
following target directory on your StreamSets host file system:

Source directory:
<install_dir>/stream/streamsets_processor/target

CONFIDENTIAL 9-2
Developer Templates Integration Guide Version 5.0 Quick Start Using the Provided StreamSets Pipelines

Target directory on StreamSets host:

/opt/streamsets-datacollector/user-libs/

Then unpack the archive file by running the following command:

tar -xzf sdprocessor-1.0-SNAPSHOT.tar.gz

This step deploys the Voltage SecureData StreamSets processor you built by following
the instructions in "Building the Datastream Developer Templates" (page 2-19).

2. Copy the Configuration Files to the Expected Location

Copy the entire Voltage SecureData StreamSets processor’s default configuration file
directory (containing the XML configuration files vsconfig.xml and vsauth.xml)
from the following source location to the following target location on your StreamSets
host file system:

Directory source location:

<install_dir>/stream/streamsets_processor/vsconfig

Expected target location on StreamSets host:

/etc/sdc/vsconfig

Care should be taken to protect the XML configuration files, and especially the
authentication credentials in vsauth.xml, in this target location. The group and
user access control to this directory should be restricted to the sdc user only (under
which the StreamSets service is executed).

3. Start or Restart the StreamSets Service

Start (or restart) your StreamSets service. For example, on CentOS:

service sdc restart

4. Create the Expected Input and Output Directories

Create the following expected input and output directories used by the SDProtect and
SDAccess pipelines on the StreamSets host:
/tmp/voltage/datain

/tmp/voltage/dataout

Make sure that the user that runs the StreamSets service, sbc, has read/write
permission for both of these directories.

5. Open the StreamSets User Interface

When you are working on the StreamSets host, the StreamSets user interface is usually
launched in a compatible browser from the following URL:

9-3 CONFIDENTIAL
Quick Start Using the Provided StreamSets Pipelines Developer Templates Integration Guide Version 5.0

http://<host>:18630

Exercising the Sample Pipelines

The StreamSets Developer Template includes two pre-configured sample pipelines, one for
protecting credit card and Social Security numbers and one for accessing that same data. The
pipelines are ready to run with the provided sample data, in the following directories:

Pipelines: <install_dir>/stream/streamsets_processor/sample/pipelines

Files: SDProtect.json and SDAccess.json

Sample data: <install_dir>/stream/streamsets_processor/sample/data

Files: plaintext.csv and ciphertext.csv

Follow these steps to exercise the SDProtect pipeline, provided with the StreamSets Developer
Template:

1. Copy the plaintext sample file plaintext.csv from the following source directory to
the target directory created in Step 4 in "Quick Start Using the Provided StreamSets
Pipelines" (page 9-2) above:

Source directory:
<install_dir>/stream/streamsets_processor/sample/data

Target directory on StreamSets host:

/tmp/voltage/datain

2. In the StreamSets user interface, in the Create New Pipeline dropdown menu, choose
Import Pipeline. The Import Pipeline dialog box appears:

3. Give the pipeline to be imported the title SDProtect and a description and then browse
for and choose the saved pipeline SDProtect.json in the following directory:

CONFIDENTIAL 9-4
Developer Templates Integration Guide Version 5.0 Quick Start Using the Provided StreamSets Pipelines

<install_dir>/stream/streamsets_processor/sample/pipelines

The imported pipeline will be shown with InputDir as the origin, connected to Voltage
SDProcessor (configured for a protect operation) as the processor, in turn connected to
OutputDir as the destination.

4. Optionally, validate that the SDProtect sample pipeline is ready to run by clicking the
Preview button in the StreamSets user interface:

In the Preview Configuration dialog box, accept the defaults and click Run Preview.
The first ten records from the file plaintext.csv in the datain directory will be
processed and displayed (but not written to the output directory dataout). When
ready, dismiss the preview.

For more information, see "Sample Data for the StreamSets Developer Template" (page
9-11).

5. Run the SDProtect sample pipeline by clicking the Start button in the StreamSets user
interface:

The SDProtect pipeline will begin running and process the records in the file
plaintext.csv in the datain directory.

6. Examine the results of the SDProtect sample pipeline by looking for the output file in
the output directory, dataout, created in Step 4 in "Quick Start Using the Provided
StreamSets Pipelines" (page 9-2) above.

9-5 CONFIDENTIAL
Quick Start Using the Provided StreamSets Pipelines Developer Templates Integration Guide Version 5.0

While the pipeline is running, the output will be in a file named _tmp_ciphertext_0.
After the pipeline is stopped, this file will be renamed to ciphertext_<UniqueID>,
where <UniqueID> is a long unique identifier. The pipeline will probably take a few
minutes to process the 10,000 records in the input file.

The output file will contain the same columns as the input file plaintext.csv, with
the first two columns of each record, the credit card number and the Social Security
number protected (the plaintext values replaced by computed ciphertext values), as
specified by the configuration of the processor component (Voltage SDProcessor) of
the SDProtect sample pipeline.

NOTE: The output file ciphertext_<UniqueID> produced by the SDProtect

pipeline is identical to the input file ciphertext.csv provided for the SDAccess
pipeline.

Exercise the SDAccess pipeline in the same way, taking the following extra steps and making
the following relevant changes:

• Delete the input file plaintext.csv from the directory datain.

• Copy the input file ciphertext.csv (instead of plaintext.csv) to the directory

datain.

• Import the saved pipeline from the file SDAccess.json (instead of SDProtect.json)
and give it the title SDAccess (instead of SDProtect).

• Examine the results in the output file _tmp_plaintext_0 (instead of

_tmp_ciphertext_0) while the pipeline is still running, or in the output file
plaintext_<UniqueID> (instead of ciphertext_<UniqueID>) after the pipeline
has been stopped.

NOTE: The output file plaintext_<UniqueID> produced by the SDAccess

pipeline is identical to the input file plaintext.csv provided for the SDProtect
pipeline.

Using the Pipelines with a Different Voltage SecureData Server

If you want to exercise the SDProtect and/or SDAccess sample pipelines with a different
Voltage SecureData Server, remember that you will need to change (most of) the configuration
settings in the XML configuration files vsconfig.xml and vsauth.xml in the directory
<install-dir>/stream/streamsets_processor/vsconfig. The configuration settings
in these files provide information such as the location of the Simple API installation on your
StreamSets host and the hostname of the REST API to use (the former file), and authentication
credentials (the latter file).

CONFIDENTIAL 9-6
Developer Templates Integration Guide Version 5.0 Integration Architecture of the StreamSets Developer Template

Developer Template, no changes are necessary (except, perhaps, to the install location of the
Simple API). However, you will need to copy these XML configuration files to the directory
expected by the sample pipelines on the StreamSets host (/etc/sdc/vsconfig).

NOTE: After you begin using your own Voltage SecureData Server, you will need to change
the configuration settings in these XML configuration files and copy them (again) to the
directory /etc/sdc/vsconfig or whatever directory your StreamSets processor is
configured to use.

As you adapt the StreamSets Developer Template code for your own purposes, you are, of
course, free to change how this configuration information is provided to your StreamSets
processor(s) when initializing the Voltage SecureData APIs you intend to use.

For more information about the parameters that the Voltage SecureData processor for
StreamSets expects to find in the XML configuration files vsconfig.xml and vsauth.xml
(which do not extend the common configuration properties used by all of the Developer
Templates, see "Common Configuration" (page 3-57).

Integration Architecture of the StreamSets Developer Template

StreamSets provides an obvious integration mechanism in the form of its extensible processor
architecture. The StreamSets Developer Template uses this approach by providing a set of
classes that implement a Voltage SecureData processor for use with StreamSets. When
deployed, the Voltage SecureData processor becomes available for use in the StreamSets user
interface.

When the StreamSets Developer Template is built, a Maven plug-in is used to package the
required JAR files (and potentially other types of resources) into a .tar.gz archive file for
deployment. The archive file sdprocessor-1.0-SNAPSHOT.tar.gz will typically contain the
following JAR files:

httpclient-4.5.10.jar
httpcore-4.4.12.jar
sdprocessor-1.0-SNAPSHOT.jar
simpleapi-1.0.jar
vscryptofactory-1.0.jar
vsrestclient-1.0.jar
vs-stream-common-1.0.jar

When unpacked in the target directory /opt/streamsets-datacollector/user-libs/

on the StreamSets host, and after the StreamSets server has been restarted, the user (and only
the user, based on file permissions) who unpacked the archive file will be able to see and use
the SDProcessor processor in the StreamSets user interface.

In the StreamSets user interface, the SDProcessor processor includes a SecureData Settings
tab with settings for the operation type (Protect or Access), the directory in which the standard
Developer Templates XML configuration files, vsconfig.xml and vsauth.xml, are located,

9-7 CONFIDENTIAL
Integration Architecture of the StreamSets Developer Template Developer Templates Integration Guide Version 5.0

and a list of one or more pairs of mappings from fields to be processed to the cryptId containing
the cryptographic parameters that control that processing. The cryptographic parameters
include the data protection format, which can specify either Format-Preserving Encryption
(FPE) or Secure Stateless Tokenization™ (SST).

NOTE: When the cryptId specifies an SST format, it must also specify the REST API (the
Simple API does not support SST operations and, as of version 5.0 of the Hadoop Developer
Templates, the SOAP API is not supported).

The Voltage SecureData processor classes provided with the StreamSets Developer Template
make use of common infrastructure provided with the Developer Templates for retrieving
global file-based configuration information, providing data translation and a cryptographic
abstraction layer, as well as a REST client. The StreamSets Developer Template provides the
following Java package:

• com.voltage.securedata.streamset.processor - This package contains the

Voltage SecureData processor classes that implement a StreamSets processor that can
be used in the StreamSets user interface as part of a pipeline, protecting or accessing
multiple fields of data flowing through it.

Processor Classes for the StreamSets Developer Template

The following Java package and its associated Java source code provide classes that
implement the Voltage SecureData StreamSets processor used in this reference
implementation to protect or access data flowing through it.

Source code package: com.voltage.securedata.streamset.processor

Source code location: <install-dir>/stream/streamsets_processor/

src/main/java

The Voltage SecureData processor classes in this package implement the user interface for the
processor as well as the processing that integrates the Voltage SecureData APIs using the
cryptographic abstraction shared by the Developer Templates and the configuration settings in
the XML configuration files vsconfig.xml and vsauth.xml.

The com.voltage.securedata.streamset.processor package defines the following

classes in .java source files of the same name:

• Errors, Groups, OperationType, OperationTypeChooseValues,

SecureDataConfig, SecureDataDProcessor - These enumerations and
classes are necessitated by the StreamSets processor user interface. Together they
implement the custom aspects of the Voltage SecureData processor for StreamSets.

• SecureDataProcessor - This class extends the StreamSets abstract class

SingleLaneRecordProcessor and implements the integration with the Voltage
SecureData APIs through the cryptographic abstraction shared by the Developer

CONFIDENTIAL 9-8
Developer Templates Integration Guide Version 5.0 Integration Architecture of the StreamSets Developer Template

Templates. Batch processing is implemented, including batch processing of single

records. Single record processing is also implemented in the event that records are
delivered for processing one at a time.

Configuration Classes Used by the StreamSets Developer Template

The following shared Java package and its associated Java source code provide the classes
that implement the reading of the configuration settings for all of the DataStream Developer
Templates other than the NiFi Developer Template.

Source code package: com.voltage.securedata.stream.common

Source code location: <install-dir>/stream/stream_common/src/main/java

The Voltage SecureData processor classes for StreamSets use the classes in this package to
read global configuration settings from the configuration files vsauth.xml and
vsconfig.xml.

As shipped, the Voltage SecureData processor classes for StreamSets do not require any
configuration settings other than the common global configuration settings that are read and
established in-memory by the package com.voltage.securedata.config, as described in
"Common Configuration" (page 3-57). However, whereas the Voltage SecureData for Hadoop
Developer Templates read these configuration files from HDFS, the DataStream Developer
Templates (including the StreamSets Developer Template) read these configuration files from
a local directory.

For more information about this shared Java package, see "Shared Code for the DataStream
Developer Templates" (page 3-63).

Logging and Error Handling in the StreamSets Developer Template

Because StreamSets uses the Simple Logging Facade for Java (SLF4J) library to log its
informational and error messages, the StreamSets Developer Template’s class
SecureDataProcessor uses the same logging library for that purpose, with its messages
ending up in the log file specified by the StreamSets server. By default, this log file is:

/var/log/sdc/sdc.log

The logs contain general informational and debugging messages which will be helpful when
troubleshooting. Consult the log files if you encounter any errors when running the StreamSets
Developer Template.

Errors are handled by throwing an OnRecordErrorException., defined by the StreamSets

pipeline API.

9-9 CONFIDENTIAL
Configuration Settings for the StreamSets Developer Template Developer Templates Integration Guide Version 5.0

Configuration Settings for the StreamSets Developer Template

The StreamSets Developer Template uses the standard XML configuration files defined by, and
used throughout, the Developer Templates: vsconfig.xml and vsauth.xml.

There are two types of configuration settings used by the StreamSets Developer Template:

• General configuration settings that define characteristics of the Voltage SecureData

Server(s) with which the StreamSets Developer Template will interact as well as a few
other common characteristics. These settings are in the XML configuration file
vsconfig.xml. For more information about these settings, see "Configuration Settings"
(page 3-5), "vsconfig.xml" (page 3-33), and "Common Configuration" (page 3-57).

• Authentication/authorization settings in the configuration file vsauth.xml that provide

default and named sets of authentication (and authorization) settings for use by the
StreamSets Developer Template. For more information about these settings, see
"Configuration Settings" (page 3-5), "vsauth.xml" (page 3-38), and "Common
Configuration" (page 3-57).

NOTE: Unlike some of the other Developer Templates, the StreamSets Developer Template
does not use the fieldMappings element in the XML configuration file vsconfig.xml.
Instead, fields in incoming records are mapped directly to the cryptIds defined in this same
configuration file, using the SecureData Settings tab in the StreamSets user interface for
the Voltage SecureData processor.

Also, unlike some the other Developer Templates, the StreamSets Developer Template does
not ever use a template-specific configuration file, such as vsstreamsets.xml. A custom
product and version for the clientId element can only be provided in the XML
configuration file vsconfig.xml (when not provided, the default client identifier product is
set to SecureData-Streamset-Processor and no default version is set).

The StreamSets Developer Template does not require any custom configuration settings
beyond what can be processed using the common configuration infrastructure provided for
common use throughout the Developer Templates. For more information about these settings,
see "Common Configuration" (page 3-57), "Configuration Settings" (page 3-5), "XML
Configuration Files" (page 3-32), and the comments in the configuration files themselves.

Before you begin to modify the StreamSets Developer Template’s XML configuration files for
your own purposes, such as using your own Voltage SecureData Server or different data
formats, Micro Focus Data Security recommends that you first run the StreamSets Developer
Template sample pipeline(s) as provided, giving you assurance that your StreamSets
installation is configured correctly and functioning as expected.

You will need to update many of the values in the XML configuration files vsauth.xml and
vsconfig.xml in order to protect your own data using your own Voltage SecureData Server.

CONFIDENTIAL 9-10
Developer Templates Integration Guide Version 5.0 Sample Data for the StreamSets Developer Template

Configuring the Settings of the Voltage SDProcessor

In the user interface for the Voltage SecureData processor for StreamSets, on the SecureData
Settings tab, there are two stand-alone settings and one pair of settings. The latter can be
extended to be a list of pairs of settings. This section provides information about these settings.

Operation Type
The Operation Type setting allows you to choose between protect and access operations for
this instance of the Voltage SecureData processor for StreamSets. By default, the processor’s
operation type is set to PROTECT. You can use the dropdown control to change it to ACCESS
when appropriate.

Config Location
The Config Location setting allows you to specify the directory in which the Voltage
SecureData processor for StreamSets will look for the XML configuration files vsauth.xml and
vsconfig.xml. The default value for this setting is /etc/sdc/vsconfig, but you can
change it as required for your environment.

Field to Process and CryptId Pairs

The Field to Process and CryptID settings pairs allow you to identify the fields within each
input record that you want to protect or access, and for each one, the name of a cryptId in the
XML configuration file vsconfig.xml that specifies the information required to perform the
cryptographic operation (such as identity, data protection format, and authentication
credentials).

Sample Data for the StreamSets Developer Template

The StreamSets Developer Template includes two files of related sample data
(plaintext.csv and ciphertext.csv) that you can use with the included sample pipelines
(saved in the files SDProtect.json and SDAccess.json). Each of these sample data files
include 10,000 records with seven comma-separated fields in each.

The first and second fields in each record represent credit card and Social Security numbers,
respectively, and it is precisely these two fields that are different between the two sample data
files. The file ciphertext.csv is the result of cryptographically processing the first and
second fields in each record in the file plaintext.csv using cryptIds named cc and ssn,
respectively, and according to the cryptographic parameters supplied in the XML configuration
files vsconfig.xml and vsauth.xml, included with the StreamSets Developer Template
(which uses public-facing Voltage SecureData Server dataprotection hosted by Micro
Focus Data Security).

9-11 CONFIDENTIAL
Adding the Voltage SDProcessor to a Blank Pipeline Developer Templates Integration Guide Version 5.0

This means that the output file produced by running the sample pipeline SDProtect with the
sample data in plaintext.csv will contain the same data as the provided sample data file
ciphertext.csv. Likewise, the output file produced by running the sample pipeline
SDAccess with the sample data in ciphertext.csv will contain the same data as the
provided sample data file plaintext.csv. For instructions about how to run these sample
pipelines, see "Quick Start Using the Provided StreamSets Pipelines" (page 9-2).

Adding the Voltage SDProcessor to a Blank Pipeline

This section describes the steps required to create a StreamSets pipeline from scratch that
includes the Voltage SDProcessor. It is divided into a number of logical sub-sections.

Creating a New Pipeline

Follow these steps to create a new StreamSets pipeline:

1. On the Pipelines page of the StreamSets user interface, click the Create New Pipeline
button:

The New Pipeline dialog box appears.

2. In the New Pipeline dialog box, provide an appropriate Title and Description, and then
click Save.

A blank pipeline appears.

CONFIDENTIAL 9-12
Developer Templates Integration Guide Version 5.0 Adding the Voltage SDProcessor to a Blank Pipeline

3. In the StreamSets user interface, in the Configuration -> Error Records tab, in the
Error Records dropdown menu, choose Discard (Library: Basic):

Adding and Configuring an Origin

Follow these steps to add and configure a Directory origin to your new StreamSets pipeline:

1. Using the Select Origin dropdown menu, choose Directory:

2. On the command line, create the input directory /tmp/voltage/mydatain and set its
ownership to sdc for user and group:
mkdir /tmp/voltage/mydatain
chown sdc:sdc /tmp/voltage/mydatain

9-13 CONFIDENTIAL
Adding the Voltage SDProcessor to a Blank Pipeline Developer Templates Integration Guide Version 5.0

3. In the StreamSets user interface, confirm that the origin Directory 1 is selected, and
then in the Configuration -> Files tab, provide the full path to the input directory as the
Files Directory and *.csv as the File Name Pattern:

4. In the Configuration -> Data Format tab, in the Data Format dropdown menu, choose
Delimited:

CONFIDENTIAL 9-14
Developer Templates Integration Guide Version 5.0 Adding the Voltage SDProcessor to a Blank Pipeline

5. In the Configuration -> Post Processing tab, optionally make a File Post Processing
choice other than None.

NOTE: The default, None, leaves input files in place, but records them as already
processed.

Adding and Configuring the Voltage SDProcessor

Follow these steps to add and configure the Voltage SDProcessor processor to your new
StreamSets pipeline:

1. Using the Select Processor to connect dropdown menu, choose Voltage SDProcessor:

2. On the command line, copy an appropriate input file to the input directory created
above (/tmp/voltage/mydatain). For example, from the <install_dir>/
stream/streamsets_processor directory:

cp ./sample/data/plaintext.csv /tmp/voltage/mydatain

3. In the StreamSets user interface, confirm that the processor Voltage SDProcessor 1 is
selected, and then in the Configuration -> SecureData Settings tab, leave the
Operation Type and Config Location fields set to their default values (PROTECT and
/etc/sdc/vsconfig, respectively) and then set two fields to be protected by typing
the following values into the indicated fields:

9-15 CONFIDENTIAL
Adding the Voltage SDProcessor to a Blank Pipeline Developer Templates Integration Guide Version 5.0

Fields in records are specified using a zero-based field position preceded by the slash
character (/) and cryptIds are specified using their names. In this case, the first field in
each record (/0)is a credit card number, processed using the cryptId named cc, and the
second field in each record (/1) is a Social Security number, processed using the cryptId
named ssn. To display the second row for entering /1 and ssn, click the plus sign (+) to
the right of the current last row.

Adding and Configuring a Destination

Follow these steps to add and configure a Destination for your new StreamSets pipeline:

1. Using the Select Destination to connect dropdown menu, choose Local FS:

2. On the command line, create the output directory /tmp/voltage/mydataout and set
its ownership to sdc for user and group:
mkdir /tmp/voltage/mydataout
chown sdc:sdc /tmp/voltage/mydataout

CONFIDENTIAL 9-16
Developer Templates Integration Guide Version 5.0 Adding the Voltage SDProcessor to a Blank Pipeline

3. In the StreamSets user interface, confirm that the destination Local FS 1 is selected, and
then in the Configuration -> Output Files tab, provide the full path to the output
directory as the Directory Template:

4. In the Configuration -> Data Format tab, in the Data Format dropdown menu, choose
Delimited:

9-17 CONFIDENTIAL
Adding the Voltage SDProcessor to a Blank Pipeline Developer Templates Integration Guide Version 5.0

Previewing the Pipeline

Follow these steps to preview the processing in your new StreamSets pipeline:

1. Click the Preview button in the StreamSets user interface:

In the Preview Configuration dialog box, accept the defaults and click Run Preview.

2. The first ten records from the file plaintext.csv in the mydatain directory will be
displayed as will be output by the Directory 1 stage:

CONFIDENTIAL 9-18
Developer Templates Integration Guide Version 5.0 Adding the Voltage SDProcessor to a Blank Pipeline

3. Click on the stage Voltage SDProcessor 1 to preview the processing to be performed

by this stage, showing the changes made to the first and second fields of each of the
first ten records as input to the stage and as output from the stage:

4. Click on the stage Local FS 1to display the first ten records as input to the Local FS 1
stage:

9-19 CONFIDENTIAL
Adding the Voltage SDProcessor to a Blank Pipeline Developer Templates Integration Guide Version 5.0

5. When finished, close the preview:

Running the Pipeline

Follow these steps to run your new StreamSets pipeline using the sample data you already
copied to the mydatain directory:

1. Click the Start button in the StreamSets user interface:

CONFIDENTIAL 9-20
Developer Templates Integration Guide Version 5.0 Limitations of the StreamSets Developer Template

2. Your new pipeline will begin running and display a variety of statistics about its
processing:

It will probably take several minutes to process the 10,000 records in the sample input
file.

Using the default settings for output file naming, your output file with protected credit
card and Social Security numbers will be named _tmp_sdc-<unique_id> while your
pipeline is running and then be renamed to sdc-<unique_id> after your pipeline is
stopped.

Limitations of the StreamSets Developer Template

An issue related to StreamSets file security remains unresolved. Specifically, when the
SDProcessor processor either A) attempts to load the Simple API JNI library, which contains
the underlying cryptographic code written in the C programming language, and/or B) fails to
load the XML configuration files from the directory /etc/sdc/vsconfig, an
AccessControlException exception is thrown. For example:

CONTAINER_0701 - Stage 'SampleProcessor_01' initialization error:

java.security.AccessControlException: access denied
("java.lang.RuntimePermission" "loadLibrary./opt/voltage/voltage-
simple-api-java-05.30.0000-Linux-x86_64-64b-r246504/lib/
libvibesimplejava.so"

It is important for your StreamSets administrator to set permissions such that the SDProcessor
processor can access these resources.

9-21 CONFIDENTIAL
Limitations of the StreamSets Developer Template Developer Templates Integration Guide Version 5.0

If necessary, you can work-around this issue by disabling the StreamSets security manager by
setting the property SDC_SECURITY_MANAGER_ENABLED to false in one of the following two
files:

File: /opt/streamsets-datacollector/libexec/_sdc

Setting: SDC_SECURITY_MANAGER_ENABLED=false

File: /opt/streamsets-datacollector/libexec/sdc-env.sh

Setting: export SDC_SECURITY_MANAGER_ENABLED=false

After making this change, restart your StreamSets service (for example, on CentOS: service
sdc restart) and confirm that you see the following warning in the StreamSets log file:

Log file: /var/log/sdc/sdc.log

Warning: WARN Main - Security Manager : DISABLED

NOTE:If you see the following Security Manager message in this log file, the Security
Manager is not disabled:

Security Manager : ENABLED, policy file: file:///etc/sdc/sdc-

security.policy

CONFIDENTIAL 9-22
10 Kafka Connect Integration
The Kafka Connect Developer Template demonstrates how to integrate Voltage SecureData
data protection technology in the context of the Apache Kafka Connect API. This is achieved by
providing a Voltage SecureData implementation of the Kafka Connect Transformation
interface. This custom transformation class works in conjunction with the Java packages in the
common infrastructure to call the underlying Voltage SecureData APIs, as follows:

• Using a custom implementation of the Transformation interface with the built-in

source connector class FileStreamSource, protect the specified data as it is being
written from an external data source into a Kafka topic.

• Using a custom implementation of the Transformation interface with the built-in sink
connector class FileStreamSink, access the specified data as it is being read from a
Kafka topic to an external data source.

Note that due to the design of Kafka Connect, processing in the Kafka Connect Developer
Template occurs one record at a time. Further, as implemented using the HoistField built-in
transformation to retrieve that record, the entire line in the sample data file is treated as a single
value. So, as delivered and for the sake of simplicity, each line in the sample data files consist of
a single value to be protected. For more information about these sample data files, see "Sample
Data for the Kafka Connect Developer Template" (page 10-11).

For more information about the common infrastructure used by all of the Developer Templates,
including the Kafka Connect Developer Template, see Chapter 3, “Common Infrastructure”.

Another important aspect of understanding the Kafka Connect Developer Template is to

understand the two different types of configuration settings it uses:

• The Kafka Connect Developer Template includes several Java Properties configuration
files, located in the bin directory, that are defined for use by Kakfa Connect scripts, and
specifically by the script for running Kafka Connect in stand-alone mode (connect-
standalone). These configuration files define characteristics such as the Connector to
use and the transformations it will perform, and the converters to be used to serialize
the data to a standard format before it is written to, or read from, the specified Kafka
topic.

• The custom Kafka Connect transformation classes for protecting and accessing data,
provided with the Kafka Connect Developer Template, rely on the global settings from
the two standard configuration files vsauth.xml and vsconfig.xml for information
required to interact with the underlying Voltage SecureData APIs. The documentation
related to the global configuration settings in these files that are relevant to the Kafka
Connect Developer Template is provided in several places in this document:

• The section "Common Configuration" (page 3-57) provides information about

the common infrastructure Java classes used to read and create in-memory
copies of the settings, as well as a brief description of the individual settings.

10-1 CONFIDENTIAL
Integration Architecture of the Kafka Connect Developer Template Developer Templates Integration Guide Version 5.0

• The section "Shared Code for the DataStream Developer Templates" (page 3-
63) provides information about the Java classes used by the DataStream
Developer Templates, including the Kafka Connect Developer Template, to wrap
the common infrastructure Java classes described above.

• The section "Configuration Settings" (page 3-5) provides more detailed

information about the individual configuration settings.

This chapter will review these global configuration settings in the context of the Kafka Connect
Developer Template.

The remainder of this chapter is organized as follows:

• Integration Architecture of the Kafka Connect Developer Template (page 10-2) - This
section explains the Java classes that are specific to the Kafka Connect Developer
Template including the classes that implement the Transformation interface.

• Configuration Settings for the Kafka Connect Developer Template (page 10-5) - This
section reviews the global configuration settings that are relevant to the Kafka Connect
Developer Template.

• Sample Data for the Kafka Connect Developer Template (page 10-11) - This section
provides a description of the simplified sample data provided for the Kafka Connect
Developer Template.

• Running the Kafka Connect Developer Template (page 10-12) - This section provides
instructions for running the Kafka Connect Developer Template as provided.

• Limitations of the Kafka Connect Developer Template (page 10-17) - This section
provides information about the type of improvements you will need to make in order to
create production-grade Kafka Connect transformations that integrate calls to the
Simple API and/or the REST API.

Integration Architecture of the Kafka Connect Developer Template

Kafka Connect provides an obvious integration mechanism in the form of its extensible
transformation architecture. The Kafka Connect Developer Template uses this approach by
providing Voltage SecureData protect and access transformation classes, implementing the
Kafka Connect interface Transformation. These transformation classes can then be included
in a list of transforms performed by a Kafka Connect source connector or sink connector. As
specified in the Kafka Connect protect and access Java Properties files, the custom protect and
access transformation classes protect plaintext being written into a Kafka topic using a
FileStreamSource connector and access ciphertext being read from a Kafka topic using a
FileStreamSink connector, respectively. They use Format-Preserving Encryption (FPE) or
Secure Stateless Tokenization™ (SST). These classes can be configured with the standard FPE
and SST parameters: format, identity, and authentication credentials using the standard XML
configuration files vsconfig.xml and vsauth.xml. They can also be configured to perform

CONFIDENTIAL 10-2
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Kafka Connect Developer Template

the protect or access operations using one of two different SecureData data protection APIs:
the Simple API or the REST API (as of version 5.0 of the Hadoop Developer Templates, the
SOAP API is not supported). Note that only the REST API can be used for SST processing.

The Voltage SecureData protect and access transformation classes provided with the Kafka
Connect Developer Template make use of common infrastructure provided with the Developer
Templates for retrieving global file-based configuration information, providing data translation
and a cryptographic abstraction layer, as well as a REST client. The Kafka Connect Developer
Template provides the following Java package:

• com.voltage.securedata.stream.connect - This package contains the Voltage

SecureData protect and access transformation classes, and their shared base class, that
can be used in a Kafka Connect source connector or sink connector to provide
cryptographic processing to the data being written to, or read from, a Kafka topic.

Transformation Classes for the Kafka Connect Developer Template

The following Java package and its associated Java source code provide classes that
implement the Voltage SecureData protect and access transformation classes. As delivered in
this reference implementation, a FileStreamSource connector uses the custom protect
Transformation class implementation to protect plaintext being written into a Kafka topic
and a FileStreamSink connector uses the custom access Transformation class
implementation to access data as it is read from a Kafka topic.

Source code package: com.voltage.securedata.stream.connect

Source code location: <install-dir>/stream/kafka-storm/src/main/java

The Voltage SecureData protect and access classes in this package look up the fields specified
for Kafka Connect transform to which they have been assigned as its type. These fields (a
single field in the provided sample workflow) are then looked up in the field mappings for the
kafka-connect component in the XML configuration file vsconfig.xml. These mappings
provide the name of a corresponding cryptId, which in turn provides the various cryptographic
settings (format, identity, authentication credentials, and so on) required to perform the protect
or access operation using the configured underlying Voltage SecureData API.

The com.voltage.securedata.stream.connect package defines the following classes in

.java source files of the same name:

• BaseField - This abstract class implements the aspects of the Kafka Connect
Transformation interface that can be shared between the ProtectField and
AccessField classes described below. This includes the determination about whether
the data can be processed with or without a schema as well as the initialization of the
CryptoFactory object.

10-3 CONFIDENTIAL
Integration Architecture of the Kafka Connect Developer Template Developer Templates Integration Guide Version 5.0

• ProtectField - This abstract class implements the aspects of the Kafka Connect
Transformation interface specific to the protect transformation, including the two
inner concrete classes, Key and Value, and their methods called by code in the base
class BaseField.

• AccessField - This abstract class implements the aspects of the Kafka Connect
Transformation interface specific to the access transformation, including the two
inner concrete classes, Key and Value, and their methods called by code in the base
class BaseField.

Configuration Classes Used by the Kafka Connect Developer Template

Source code package: com.voltage.securedata.stream.common

Source code location: <install-dir>/stream/stream_common/src/main/java

The Voltage SecureData protect and access transformation classes use the classes in this
package to read global configuration settings from the configuration files vsauth.xml and
vsconfig.xml.

As shipped, the Voltage SecureData transformation classes for Kafka Connect do not require
any configuration settings other than the common global configuration settings that are read
and established in-memory by the package com.voltage.securedata.config, as
described in "Common Configuration" (page 3-57). However, whereas the Voltage SecureData
for Hadoop Developer Templates read these configuration files from HDFS, the DataStream
Developer Templates (including the Kafka Connect Developer Template) read these
configuration files from a local directory.

For more information about this shared Java package, see "Shared Code for the DataStream
Developer Templates" (page 3-63).

Kafka Connect Developer Template Transformations

Kafka Connect transforms are specified in terms of a Java class that implements the
transformation and any parameters expected by that specific transformation. The Java class
may be a standard one that is provided for use with Kafka Connect, or it may be a custom
transformation (that implements the Transformation interface defined by Kafka Connect).

Transformations are specified within Java properties files, the format of which is defined by
Kafka Connect, that are provided to and read by Kafka Connect scripts, such as the Kafka
Connect script for stand-alone execution: connect-standalone.sh. This script expects two
or more Java Properties files as command line parameters, the first one specifying the key/
value pairs required by the Kafka Connect worker(s). The one or more subsequent Java
Properties files specified on the command line each define a connector to be started. The

CONFIDENTIAL 10-4
Developer Templates Integration Guide Version 5.0 Configuration Settings for the Kafka Connect Developer Template

connector key-value pairs include a name for this connector instance, the associated Java class,
values expected by that Java class, and general Kafka Connect parameters such as the
maximum number of tasks and the relevant topic(s).

For more information about the Kafka Connect configuration information specified using Java
Properties files, see "Configuration Settings for the Kafka Connect Developer Template" (page
10-5).

Logging in the Kafka Connect Developer Template

The Kafka Connect Developer Template logs informational and error messages using the
Apache Commons Logging library. When using stand-alone mode, as demonstrated by the
scripts in Kafka Connect Developer Template, these log messages are written to stdout.

NOTE: When using distributed mode, log messages are written using a REST server that
Kafka Connect launches at <rest-server-address>:<port>/logs.

The logs contain general informational and debugging messages which will be helpful when
troubleshooting. Consult the log files if you encounter any errors when running the Kafka
Connect Developer Template.

Configuration Settings for the Kafka Connect Developer Template

The Kafka Connect Developer Template uses Java Properties configuration files defined by
Kafka Connect as well as the standard XML configuration files defined by, and used throughout,
the Developer Templates. This section provides information about both of these types of
configuration files.

Kafka Connect Java Properties Files

There are four Kafka Connect Java Properties files included with the Kafka Connect Developer
Template. The key/value pairs expected in these files are defined by Kafka Connect and are
used to control the behavior of Kafka Connect workers and connectors. Their contents are
mostly beyond the scope of this documentation. However, a brief overview of the four files
provided is appropriate. The remainder of this section describes the aspects of these files that
are relevant to the Voltage SecureData API integration demonstrated by the Kafka Connect
Developer Template sample workflow.

connect-standalone-worker.properties
The Java Properties file connect-standalone-worker.properties contains key/value
pairs required by the Kafka Connect worker that is started with Kafka Connect in stand-alone
mode. This Java Properties file is the first command line parameter to the Kafka Connect
Developer Template script run-kafka-connect-protect-transform and passed through
as the first command line parameter to the Kafka Connect script connect-standalone.sh.

10-5 CONFIDENTIAL
Configuration Settings for the Kafka Connect Developer Template Developer Templates Integration Guide Version 5.0

connect-file-source-protect.properties
The Java Properties file connect-file-source-protect.properties contains key/value
pairs required by the Kafka source connector that the Kafka Connect Developer Template uses
to protect data as it is written from a file in the local file system to a Kafka topic. It uses the
connector class FileStreamSource with a single task and an input file in the sample
directory named ssn.txt. It writes to the Kafka topic specified as the parameter to the script
create-kafka-topic. It specifies a sequence of three transforms to the data as it is written:

transforms=MakeMap,InsertSource,Protect

1. MakeMap - Using the arbitrary name MakeMap, this transform is implemented by the
built-in transformation HoistField, which will place each line from the specified input
file as a JSON value (HoistField$Value) using the field name ssn
(transforms.MakeMap.field=ssn). For example:
{"ssn" : "675-03-4941"}

2. InsertSource - Using the arbitrary name InsertSource, this transform is

implemented by the built-in transformation InsertField, which will add a static field
and value to each JSON record before it is in transit to the specified Kafka topic. For
example:
{"ssn" : "675-03-4941", "data_source" : "../sampledata/ssn.txt"}

NOTE: This transform is not required by the Kafka Connect Developer Template
sample. The first and third transforms, MakeMap and Protect, can operate correctly
with or without it being included.

3. Protect - Using the arbitrary name Protect, this transform is implemented by the
Kafka Connect Developer Template custom transformation ProtectField, which will
perform the specified protect operation(s) on the corresponding value(s). As delivered,
the Protect transform performs a single protect operation on the value in the JSON
field ssn, as shown above.
transforms.Protect.type=com.voltag...nnect.ProtectField$Value
transforms.Protect.fields=ssn

As implied by the plural form fields, the custom Protect transform is designed to
protect more than one field value per record when appropriate. Multiple fields can be
specified by providing a list of comma-separated field names (without whitespace).

CONFIDENTIAL 10-6
Developer Templates Integration Guide Version 5.0 Configuration Settings for the Kafka Connect Developer Template

NOTE: To keep the Kafka Connect Developer Template as simple as possible, the
FileStreamSource connector is used. This connector can only read a single file
at a time and each row in that file is turned into a string. The first specified
transformation, named MakeMap, uses the built-in transformation class
HoistField. This transformation makes a map, associating that string (as the
value) with the field name ssn. A more complex connector might be able to read
associated values from multiple files, or extract multiple values from a single row
in one file, in order to create a JSON record that contains multiple fields to be
protected by the time it reached the Protect transform.

Each field specified in the comma-separated list of fields

(transforms.Protect.fields=<comma_separated_list_of_fields>)
specifies a fields element for the component kafka-connect within the
fieldMappings element in the XML configuration file vsconfig.xml (or possibly in
the fields element of the component-specific XML configuration file
vskafka-connect.xml). This mapping leads, by name, to a cryptId element
defined in the XML configuration file vsconfig.xml, which specifies the various
parameters needed for the protect operation (format, identity, authentication
credentials, and so on).

At the end of the Protect transform, the JSON record, prior to being written to the
Kafka topic, will look much the same, with the Social Security number protected using
the built-in FPE format ssn (as specified by the field-to-cryptId-to-format mapping in
the XML configuration file vsconfig.xml):
{"ssn" : "783-91-4941", "data_source" : "ssn.txt"}

In the provided sample workflow, as specified in "Steps to Run the Kafka Connect Developer
Template" (page 10-13), this Java Properties file is provided as the second command line
parameter to the Kafka Connect Developer Template script
run-kafka-connect-protect-transform and passed through as the second command
line parameter to the Kafka Connect script connect-standalone.sh.

connect-file-sink-access.properties
The Java Properties file connect-file-sink-access.properties contains key/value
pairs required by the Kafka sink connector that the Kafka Connect Developer Template uses to
access data as it is read from a Kafka topic to a file in the local file system. It uses the connector
class FileStreamSink with a single task and an output file named ssn-access-sink.txt.
It reads from the Kafka topic specified as the parameter to the script create-kafka-topic. It
specifies a single transform to the data as it is read:

• Access - Using the arbitrary name Access, this transform is implemented by the Kafka
Connect Developer Template custom transformation AccessField, which will perform
the specified access operation(s) on the corresponding value(s). As delivered, the
Access transform performs a single access operation on the value in the JSON field
ssn:

10-7 CONFIDENTIAL
Configuration Settings for the Kafka Connect Developer Template Developer Templates Integration Guide Version 5.0

transforms=Access

transforms.Access.type=com.voltag...onnect.AccessField$Value
transforms.Access.fields=ssn

The records, as delivered to this transform, from the Kafka topic

ssn-protect-connect have the following form, from which the Social Security
number ciphertext is retrieved and accessed.
{"ssn" : "783-91-4941", "data_source" : "ssn.txt"}

As with the custom Protect transform and as implied by the plural form fields, the
custom Access transform is designed to access more than one field value per record
when appropriate. Multiple fields can be specified by providing a list of comma-
separated field names (without whitespace).

NOTE: As explained in the note for the Protect transformation above, this sample
workflow, as provided, accesses just a single field as each record is read from the
specified Kafka topic.

Each field specified in the comma-separated list of fields

(transforms.Access.fields=<comma_separated_list_of_fields>) specifies
a fields element for the component kafka-connect within the fieldMappings
element in the XML configuration file vsconfig.xml (or possibly in the fields
element of the component-specific XML configuration file vskafka-connect.xml).
This mapping leads, by name, to a cryptId element defined in the XML configuration
file vsconfig.xml, which specifies the various parameters needed for the access
operation (format, identity, authentication credentials, and so on).

At the end of the Access transform, the JSON record, prior to being written to the
specified file in the local file system, will look much the same, with the Social Security
number accessed using the built-in FPE format ssn (as specified by the field-to-cryptId-
to-format mapping in the XML configuration file vsconfig.xml). For example:
{"ssn" : "675-03-4941", "data_source" : "ssn.txt"}

In the provided sample workflow, as specified in "Steps to Run the Kafka Connect Developer
Template" (page 10-13), this Java Properties file is provided as the third command line
parameter to the Kafka Connect Developer Template script
run-kafka-connect-protect-transform and passed through as the third command line
parameter to the Kafka Connect script connect-standalone.sh.

connect-file-sink.properties
The Java Properties file connect-file-sink.properties contains key/value pairs
required by the Kafka sink connector that the Kafka Connect Developer Template uses to write
the ciphertext, as is, from a Kafka topic to a file in the local file system. It uses the connector
class FileStreamSink with a single task and an output file named
ssn-protect-sink.txt. It reads from the same Kafka topic specified as the parameter to

CONFIDENTIAL 10-8
Developer Templates Integration Guide Version 5.0 Configuration Settings for the Kafka Connect Developer Template

the script create-kafka-topic. It does not specify any transforms to the data as it is read,
resulting in the ciphertext JSON record being written to the specified file in the local file system.
For example:
{"ssn" : "783-91-4941", "data_source" : "ssn.txt"}

Kafka Connect Developer Template XML Configuration Files

The Kafka Connect Developer Template uses the standard XML configuration files defined by,
and used throughout, the Developer Templates: vsconfig.xml and vsauth.xml. It also has
the potential to use a component-specific configuration file named vskafka-connect.xml
for clientId product and version values using the clientId element and for field mappings
using the fields element.

There are three types of configuration settings used by the Kafka Connect Developer
Template:

• General configuration settings in the XML configuration file vsconfig.xml define

characteristics of the Voltage SecureData Server(s) with which the Kafka Connect
Developer Template will interact as well as a few other common characteristics. For
more information about these settings, see "Configuration Settings" (page 3-5),
"vsconfig.xml" (page 3-33), and "Common Configuration" (page 3-57).

• Authentication/authorization settings in the configuration file vsauth.xml provide

default and named sets of authentication (and authorization) settings for use by the
Kafka Connect Developer Template. For more information about these settings, see
"Configuration Settings" (page 3-5), "vsauth.xml" (page 3-38), and "Common
Configuration" (page 3-57).

• Multiple Kafka Connect-specific settings in the XML configuration file vsconfig.xml

provide information about which (named) fields may be protected or accessed and the
cryptId associated with each such field (the latter of which leads to information about
how to protect or access them, and potentially any pre- and post-processing required
for the columns to protect or access). For example:

In the component-specific XML configuration file named vskafka-connect.xml, this

field mapping would be specified as:

10-9 CONFIDENTIAL
Configuration Settings for the Kafka Connect Developer Template Developer Templates Integration Guide Version 5.0

The Kafka Connect Developer Template does not require any custom configuration settings
beyond what can be processed using the common configuration infrastructure provided for
common use throughout the Developer Templates. For more information about these settings,
see "Common Configuration" (page 3-57), "Configuration Settings" (page 3-5), "XML
Configuration Files" (page 3-32), and the comments in the configuration files themselves.

Before you begin to modify the Kafka Connect Developer Template XML configuration files for
your own purposes, such as using your own Voltage SecureData Server or different data
formats, Micro Focus Data Security recommends that you first run the Kafka Connect
Developer Template sample workflow as provided, giving you assurance that your Kafka
installation is configured correctly and functioning as expected.

You will need to update many of the values in the XML configuration files vsauth.xml and
vsconfig.xml in order to protect your own data using your own Voltage SecureData Server.

Configuration File Location

For the sake of simplicity, the Kafka Connect Developer Template demonstrates an approach to
configuration/authentication/authorization settings using the standard set of XML
configuration files (vsconfig.xml, vsauth.xml, and optionally, vskafka-connect.xml),
protected by file permissions, stored in the local file system on which Kafka Connect is run in
stand-alone mode. By default, the Kafka Connect assumes that it can locate these XML
configuration files by prepending the relative directory path../config to each XML
configuration filename and that the current directory is <install-dir>/stream/
kafka_connect/bin.

The Kafka Connect protect and access connector Java Properties files each contain a
commented out key/value pair that can be used to specify an alternative relative directory path
that can be used to locate the XML configuration files in a directory other than the default:

• Java Properties File: connect-file-source-protect.properties

#transforms.Protect.configPath=../config

• Java Properties File: connect-file-sink-access.properties

#transforms.Access.configPath=../config

If you chose to uncomment and use this alternate configuration file location key/value pair,
remember that the relative directory path you specify will be relative to current directory on the
local file system when you run Kafka Connect in stand-alone mode. Typically, this is the
following directory:

CONFIDENTIAL 10-10
Developer Templates Integration Guide Version 5.0 Sample Data for the Kafka Connect Developer Template

<install-dir>/stream/kafka_connect/bin
.

NOTE: In the event that you choose to run Kafka Connect in distributed mode, you must put
the XML configuration files in the same location on every computer on which your Kafka
Connect tasks will run. This is true regardless of whether you use the default configuration
file location or use the configPath key/value pairs to specify an alternate location.

Sample Data for the Kafka Connect Developer Template

The sample data, located in the <install-dir>/stream/kafka_connect/sampledata

sub-directory, consists of four files, each containing 10 lines. Each line contains a single
plaintext of a type corresponding to the name of the file, and the plaintext in the file can be
protected using the formats indicated here:

• creditcard.txt - Protect the credit card data in this file using the SST format cc-sst-
6-4 for this input data. This works because in the XML configuration file
vsconfig.xml, the default API (defaultApi="simpleapi") is overridden for this
format to use the REST API, which is required for the SST operations:
<cryptId name="cc" format="cc-sst-6-4" api="rest" />

You could change the format setting above to cc, specifying the built-in FPE format for
credit card numbers and then remove the api attribute and value to revert to the use of
the Simple API for this cryptId.

• ssn.txt - Protect the social security number data in this file using the built-in FPE
format ssn.

• email.txt - Protect the email address data in this file using the FPE format
AlphaNumeric, pre-configured on the dataprotection Voltage SecureData Server
hosted by Micro Focus Data Security.

• date.txt - Protect the date data in this file using the FPE format DATE-ISO-8601,
pre-configured on the dataprotection Voltage SecureData Server hosted by Micro
Focus Data Security.

If you are using your own Voltage SecureData Server to process the sample data in the files
email.txt or date.txt or to tokenize the data in the file creditcard.txt, you will need to
create the corresponding format(s) in your Voltage SecureData Server, as described in
Appendix A, “Voltage SecureData Server Configuration”.

10-11 CONFIDENTIAL
Running the Kafka Connect Developer Template Developer Templates Integration Guide Version 5.0

Running the Kafka Connect Developer Template

The Kafka Connect Developer Template includes scripts in the directory <install_dir>/
stream/kafka_connect/bin that you use to run the template’s sample workflow. These
scripts provide commands to create a Kafka topic and run Kafka Connect in stand-alone mode
to protect Social Security numbers as they are written to a Kafka topic and to access them as
they are read from that Kafka topic.

Run-Time Prerequisites
To run the Kafka Connect Developer Template, you will need the following services configured
and running:

• Kafka

• Zookeeper

Editing the Distribution-Specific Run-Time Settings

The Kafka Connect Developer Template includes several scripts that you use to create and
delete the relevant Kafka topic, to start the Kafka Connect worker, and to start the source and
sink connectors. These scripts require several settings, which depend on your Kafka
distribution, in order to execute successfully. To make editing these settings easier, they are
located together in the script configuration file vsdistrib.properties, co-located with the
scripts in the bin directory. You must edit the following properties in this file to specify values
appropriate for your execution environment.

kafkaBrokerList
This property specifies a list of Kafka Brokers in <host>:<port> format, separated by
commas. For example:
hostname1.domain.com:1234,hostname2.domain.com:1234

The administrator who configured your Kafka installation will be able to provide you with this
host and port information for your Kafka Broker hosts. In some cases, the port number may be
found in the listeners property in the Kafka properties file server.properties.

kafkaServerPropsFile
This property specifies the absolute file path to the Kafka Java Properties file
server.properties.

The administrator who configured your Kafka installation will be able to provide you with the
path to this Java Properties file. Note that in some cases this file may be outside the main Kafka
installation path. Some examples of possible locations for this file include:

CONFIDENTIAL 10-12
Developer Templates Integration Guide Version 5.0 Running the Kafka Connect Developer Template

• /etc/kafka/<version>/0/server.properties

• /opt/cloudera/parcels/KAFKA/etc/kafka/conf.dist/server.properties

If you are not able find this file on your cluster, the following command may help you locate it:
locate "*kafka*server.properties"

kafkaBinDir
This property specifies the absolute path to the Kafka bin directory, which contains Kafka
scripts such as kafka-topics.sh and connect-standalone.sh. If the Kafka bin directory
is already included in the system PATH variable, do not provide a value for this property.
Otherwise, specify the full path to the directory kafka/bin within your Kafka installation
location, ending with a slash character (/). Example locations:
• /usr/hdp/<version>/kafka/bin/

• /opt/cloudera/parcels/KAFKA/lib/kafka/bin/

NOTE: When you specify a value for this property, it must end with the slash character (/).

If you are not able find this path for your cluster, the following command may help you locate it:
locate "kafka-topics.sh"

Steps to Run the Kafka Connect Developer Template

After building the Voltage SecureData Kafka Connect JAR file, as described in "Building the
Datastream Developer Templates" (page 2-19), and editing the property values in the script
configuration file vsdistrib.properties, as described in the previous section, use the
following steps to run the Kafka Connect Developer Template sample workflow.

1. Change directory (cd) to the following directory where the Kafka Connect JAR file was
built:
<install_dir>/stream/kafka_connect/target

2. Copy the Kafka Connect JAR file, vs-kafka-connect-1.0.jar, to the Kafka lib
directory for the Kafka distribution you are using:

• HDP: /usr/hdp/<hdp-version>/kafka/lib

• CDH: /opt/cloudera/parcels/kafka/lib

• Standalone Kafka installation lib directory.

NOTE: The Kafka Connect Developer Template is only supported for HDP and CDH.
It is not supported for MapR and EMR.

10-13 CONFIDENTIAL
Running the Kafka Connect Developer Template Developer Templates Integration Guide Version 5.0

3. Change directory (cd) to the following directory to run the scripts in the subsequent
steps:
<install_dir>/stream/kafka_connect/bin

4. Create a Kafka topic to be used by the Kafka Connect Developer Template scripts by
running the following script, choosing a topic name and specifying it as a parameter. For
example:
./create-kafka-topic ssn-protect-connect

The script create-kafka-topic will edit the relevant Kafka Connect Java Properties
files to contain this topic name so that when Kafka Connect is run in stand-alone mode,
the source and sink connectors will know what Kafka topic to write to and read from,
respectively.

5. Run the following Kafka Connect Developer Template script, including the four Java
Properties files as parameters (shown on multiple lines to improve read-ability):
./run-kafka-connect-transform
connect-standalone-worker.properties
connect-file-source-protect.properties
connect-file-sink-access.properties
connect-file-sink.properties

This script will start Kafka Connect in stand-alone mode with a Kafka Connect worker
(using the Java Properties file specified as the first parameter) and three Kafka Connect
connectors (one source connector using the Java Properties file specified as the second
parameter and two sink connectors using the Java Properties files specified as the third
and fourth parameters).

NOTE: The first parameter must be the worker Java Properties file, as shown above.
The order of the remaining three connector Java Properties files are arbitrary, and
only one of them is strictly necessary, and more than three could be specified.

The sink connectors above will each produce a single output file:

• ssn-access-sink.txt – This output file is produced by the sink connector

associated with the Java Properties file connect-file-sink-
access.properties. The Voltage SecureData transform it specifies using the
Java class AccessData will access the Social Security number ciphertext in the
Kafka topic ssn-protect-connect. Lines in this output file will have the
following form, with a plaintext Social Security number:
{"ssn" : "675-03-4941", "data_source" : "ssn.txt"}

• ssn-protect-sink.txt – This output file is produced by the sink connector

associated with the Java Properties file connect-file-sink.properties.
Because this sink connector does not specify any transforms, the Social Security

CONFIDENTIAL 10-14
Developer Templates Integration Guide Version 5.0 Running the Kafka Connect Developer Template

number ciphertext in the Kafka topic ssn-protect-connect will appear, as is,

in this output file. Lines in this output file will have the following form, with a
ciphertext Social Security number:
{"ssn" : "783-91-4941", "data_source" : "ssn.txt"}

6. Enter Ctrl+c to stop Kafka Connect.

NOTE: The connectors will keep running until the Connect worker is manually
stopped.

7. Delete the relevant Kafka topic by running the following script, specifying the same topic
name as its parameter:
./delete-kafka-topic ssn-protect-connect

8. Delete the Kafka Connect offsets file used when running Kafka in stand-alone mode:
rm /tmp/connect.offsets

Variations on Running the Kafka Connect Developer Template

There are several variations on running the Kafka Connect Developer Template’s sample
workflow that you may want to try, as follows:

• You can repeat steps 1 through 6, performing step 3 twice, launching fewer connectors
each time. For example, the first time, you could run the script run-kafka-connect-
transform with just the first two parameters, protecting the Social Security numbers as
they are written to the Kafka topic, but producing no output files. Then, when you run
the script run-kafka-connect-transform again, run it with just the first and third
parameters. This would retrieve the protected records from the existing Kafka topic,
access them, and write them to the output file ssn-access-sink.txt.

• Make changes such that you are protecting different sample data. For example, to
protect the credit card sample data instead, make the following changes, exactly as
shown, to these Java Properties configuration files:

File: connect-file-source-protect.properties
file=../sampledata/ssn.txt -> file=../sampledata/cc.txt
transforms.MakeMap.field=ssn -> transforms.MakeMap.field=cc
transforms.Protect.fields=ssn -> transforms.Protect.fields=cc

File: connect-file-sink-access.properties:
file=ssn-access-connect.txt -> file=cc-access-connect.txt
transforms.Access.fields=ssn -> transforms.Access.fields=cc

File: connect-file-sink.properties:
file=ssn-access-connect.txt -> file=cc-access-connect.txt

10-15 CONFIDENTIAL
Running the Kafka Connect Developer Template Developer Templates Integration Guide Version 5.0

The run the Kafka Connect Developer Template sample workflow again (steps 1
through 6), specifying a different topic name as the parameter to the script create-
kafka-topic. For example:

./create-kafka-topic cc-protect-connect

NOTE: Remember to pass the updated Kafka topic name, cc-protect-connect, as

the parameter to the script deleting the Kafka topic.

Script Summary
Examine the scripts in the bin directory to see how they call Kafka scripts in the context of the
Kafka Connect Developer Template. This section summaries these scripts.

create-kafka-topic
This script creates a new Kafka topic with the specified name. It also edits the following three
Kafka Connect Java Properties files to specify this topic name as the value for the relevant
name:

• connect-file-source-protect.properties: topic=<topic_name>

• connect-file-sink-access.properties: topics=<topic_name>

• connect-file-sink.properties: topics=<topic_name>

Invocation:
./create-kafka-topic <topic_name>

This script expects a single parameter: the name of the Kafka topic to be created.

delete-kafka-topic
This script deletes the Kafka topic with the specified name.

Invocation:
./delete-kafka-topic <topic_name>

This script expects a single parameter: the name of the Kafka topic to be deleted.

run-kafka-connect-protect-transform
After editing the Kafka Connect worker Java Properties file to include the Kafka broker list
provided in the file vsdistrib.properties (described below), this script calls the Kafka
script connect-standalone.sh to start Kafka in stand-alone mode.

CONFIDENTIAL 10-16
Developer Templates Integration Guide Version 5.0 Limitations of the Kafka Connect Developer Template

Invocation (shown on two lines to enhance read-ability):

./run-kafka-console-producer <worker_properties_file>
<1_or_more_connector_properties_files>

This script expects two or more parameters: the name of a Kafka worker Java Properties file
and the names of one or more connector Java Properties files.

vsdistrib.properties
This configuration file is used by the other scripts in the bin directory (described above) to get
the required Kafka installation settings from a single place in order to avoid redundant editing
in multiple scripts. It defines the following variables:
• kafkaBrokerList

• kafkaServerPropsFile

• kafkaBinDir

For more information, see "Editing the Distribution-Specific Run-Time Settings" (page 10-12).

Limitations of the Kafka Connect Developer Template

The Kafka Connect Developer Template provides a very basic sample integration, showing
custom Kafka transformation classes that protect and access simple input data. Specifically, two
of the simplifications made in this template are:

• Simplified Input Data Format

The custom Kafka transformation classes ProtectField and AccessField, as

demonstrated, show protect and access operations on very simple input files that
contain a single data value to be protected per line. For example, from the sample data
file ssn.txt:
675-03-4941
471-15-7616
901-74-8563
.
.
.
469-62-9503
152-38-2718
257-07-4384

This simple input data format is required by the first built-in Kafka transform class
(HoistData) used by the protect transformation using the connector class
FileStreamSource. The HoistData class pulls either key or value data as an entire
file line to use when constructing a JSON record.

10-17 CONFIDENTIAL
Limitations of the Kafka Connect Developer Template Developer Templates Integration Guide Version 5.0

Your requirements are likely to require more complex processing logic, building a JSON
records with multiple related fields and, possibly, protecting or accessing more than one
of those fields. The good news is that the custom Kafka transformation classes
ProtectField and AccessField, provided with the Kafka Connect Developer
Template, are already designed to protect multiple fields in a JSON record being written
to a Kafka topic or access multiple fields in a JSON record being read from a Kafka topic,
respectively. However, to take advantage of this functionality, different transforms than
are demonstrated would need to be used to create the required JSON records. For
example, a JSON record with both Social Security and credit cards numbers, such as the
following:
{"ssn" : "675-03-4941", "cc" : "5225-6290-4183-4450"}

Furthermore, if the data coming into the transformations using the custom Kafta
transformation classes ProtectField and AccessField are not formatted as JSON
records, as shown above, the code in these custom transformation classes will need to
be modified accordingly.

• Simplified Configuration

The Kafka Connect Developer Template reads its configuration and authentication/
authorization settings from XML configuration files in a default location on the local file
system; an approach that may be more simple than is called for by your scenario. Your
production integration may require alternative approaches for configuring these
settings, including putting the XML configuration files in a different location on the local
file system. For more information about configuring the location in which the Kafka
Connect protect and access transformation will look for their XML configuration files,
see "Configuration File Location" (page 10-10).

Note that in the event that you choose to run Kafka Connect in distributed mode, you
must put the XML configuration files in the same location on every computer on which
your Kafka Connect tasks will run. This is true regardless of whether you use the default
configuration file location or use the alternate configuration location key/value pairs in
the Kafka Connect protect and access Java Properties files to specify an alternate
location.

• No Batch Processing for Kafka Connect Transformations

With respect to transformations, Kafka Connect inherently does not support batching.
For the Kafka Connect Developer Template, this means that data is protected or
accessed one record (message) at a time. In other words, one line from the sample input
file and/or one line to the output file(s) at a time.

NOTE: This limitation is necessary to preserve the stability, fault-tolerance, scalability,

delivery semantics, and so on, provided by the Kafka Connect API’s use of offsets,
which track a given source or sink connector’s read location on a per-record basis.

• IBSE/AES Cryptographic Operations are Not Supported

CONFIDENTIAL 10-18
Developer Templates Integration Guide Version 5.0 Limitations of the Kafka Connect Developer Template

The Kafka Connect Developer Template does not support Voltage SecureData IBSE/
AES protect and access operations.

• Limited Authentication Methods

The Kafka Connect Developer Template does not support Kerberos authentication nor
LDAP + Shared Secret authentication/authorization.

The inherent Kafka Connect limitation behind these authentication limitations is that
there is no Kafka Connect API for retrieving the identity of the user who started the
Kafka worker and connectors.

• The legacy Java Properties configuration files vsconfig.properties and

vsauth.properties are not supported. If you attempt to use this legacy method to
supply the Voltage SecureData configuration values, a runtime exception will be
generated.

10-19 CONFIDENTIAL
Limitations of the Kafka Connect Developer Template Developer Templates Integration Guide Version 5.0

CONFIDENTIAL 10-20
11 Kafka-Storm Integration
The Kafka-Storm Developer Template demonstrates how to integrate Voltage SecureData data
protection technology in the context of Apache Kafka and Apache Storm. This demonstration
includes the use of the Simple API (version 4.0 and greater) and the REST API.

Storm provides an obvious integration opportunity in the form of its bolt technology. Storm
bolts provide discrete processing steps in a flow of data. The Protect Bolt and Access Bolt
provided with the Kafka-Storm Developer Template serve as examples of using Storm bolts to
protect or access, respectively, the data flowing through them. They work in conjunction with
the Java packages in the common infrastructure. This chapter provides a description of these
bolts as well as the overall Storm topology into which the Protect Bolt is integrated (The Access
bolt is not demonstrated in the provided Storm topology, but is nevertheless included in the
Kafka-Storm Developer Template for completeness.)

This simple topology uses an off-the-shelf Storm component, the Kafka Spout, that reads Kafka
records from a particular Kafka topic (in this scenario, Kafka is configured to read input data
from a file, placing each line from the file into its own record). The Kafka Spout streams data to
be protected to the Protect Bolt, which protects the incoming data using the configured
Voltage SecureData credentials, format, and Voltage SecureData technology. The Protect Bolt,
in turn, streams the protected data to another off-the-shelf Storm component, the HDFS Bolt,
which streams the protected data to a file in the Hadoop Distributed File System (HDFS). The
following illustration shows the basic structure of the Storm topology used by the Kafka-Storm
Developer Template:

Kafka Spout Protect Bolt HDFS Bolt

Figure 11-1 Storm Topology in the Kafka-Storm Developer Template

NOTE: Both the Kafka Spout and HDFS Bolt are off-the-shelf components provided with
Storm.

This chapter also provides instructions on how to run the Kafka-Storm Developer Template
using the provided sample data.

For more information about the common infrastructure used by all of the Developer Templates,
including the Kafka-Storm Developer Template, see Chapter 3, “Common Infrastructure”.

NOTE: The Kafka-Storm Developer Template comes with its own sample data, distinct from
the sample data provided for the three Hadoop Developer Templates. It has been simplified
even further for demonstration purposes, with just a single type of data, such as credit card
numbers, Social Security numbers, or email addresses, provided in each input file, one data
value per line. For more information about these sample data files, see "Sample Data for the
Kafka-Storm Developer Template" (page 11-10).

11-1 CONFIDENTIAL
Integration Architecture of the Kafka-Storm Developer Template Developer Templates Integration Guide Version 5.0

Another important aspect of understanding the Kafka-Storm Developer Template is to

understand the configuration settings it uses. The Kafka-Storm Developer Template reads its
global settings from the two standard XML configuration files vsauth.xml and
vsconfig.xml.

The documentation related to the global configuration settings relevant to the Kafka-Storm
Developer Template is provided in Chapter 3, “Common Infrastructure”, in the section
"Common Configuration" (page 3-57). This section provides information about the common
infrastructure Java classes used to read and create in-memory copies of the settings, as well as
a description of the individual settings. This chapter will review these global configuration
settings in the context of the Kafka-Storm Developer Template.

The remainder of this chapter is organized as follows:

• Integration Architecture of the Kafka-Storm Developer Template (page 11-2) - This

section explains the Java classes that are specific to the Kafka-Storm Developer
Template including the classes that implement the Protect and Access Bolts.

• Configuration Settings for the Kafka-Storm Developer Template (page 11-7) - This
section reviews the global configuration settings that are relevant to the Kafka-Storm
Developer Template.

• Sample Data for the Kafka-Storm Developer Template (page 11-10) - This section
provides a description of the simplified sample data provided for the Kafka-Storm
Developer Template.

• Running the Kafka-Storm Developer Template (page 11-10) - This section provides
instructions for running the Kafka-Storm Developer Template as provided.

• Simplifications of the Kafka-Storm Developer Template (page 11-18) - This section

provides information about the type of improvements you will need to make in order to
create a production-grade Kafka-Storm topology that integrates calls to the SecureData
APIs.

Integration Architecture of the Kafka-Storm Developer Template

Storm provides an obvious integration mechanism in the form of its extensible spout and bolt
architecture. The Kafka-Storm Developer Template uses this approach by providing protect
and access bolts that can be wedged between the Kafka Spout and the HDFS Bolt that are
provided with Storm. The Protect Bolt and Access Bolt protect incoming plaintext or access
incoming ciphertext, respectively, using Format-Preserving Encryption (FPE) or Secure
Stateless Tokenization™ (SST). These bolts can be configured with the standard FPE and SST
parameters: format, identity, and authentication credentials in one of two forms. They can also
be configured to perform the protect or access operations using one of three different
SecureData data protection APIs: the Simple API (version 4.0 and greater) and the REST API
(the latter API can be used for SST processing while the former cannot).

CONFIDENTIAL 11-2
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Kafka-Storm Developer Template

The Protect Bolt and the Access Bolt make use of the common infrastructure provided with
Voltage SecureData for Hadoop for retrieving global file-based configuration information,
providing data translation and a cryptographic abstraction layer, as well as a REST client. The
Kafka-Storm Developer Template provides the following Java packages:

• com.voltage.securedata.stream.kafka - This package contains the class used

to implement a Kafka producer that reads input lines from a plaintext data file and
streams each line as a record into a specific Kafka topic.

See "Kafka Producer" (page 11-3).

• com.voltage.securedata.stream.storm - This package contains the classes

used to implement the Protect Bolt and the Access Bolt as well as their shared base
class. This package also contains the class that implements the Storm topology for the
Kafka-Storm Developer Template.

See "Protect and Access Bolts, and Storm Topology" (page 11-3).

Kafka Producer
The following Java package and its associated Java source code provide a class that
implements the Kafka producer that reads input lines from a plaintext data file and streams
each line as a record into a specific Kafka topic.

Source code package: com.voltage.securedata.stream.kafka

Source code location: <install-dir>/stream/kafka_storm/src/main/java

The class in this package implements a simple Kafka producer that reads values from a
specified file and writes each line in that file as a record to the specified Kafka topic. The names
of the file and topic are provided as command line parameters, along with a list of Kafka
brokers.

The Kafka producer package defines the following class in a .java source file of the same
name:

• SampleKafkaProducer - This class implements a program to read input lines from

a plaintext data file and stream each line as a record into a specific Kafka topic. This
provides the data source to be read by the off-the-shelf Kafka Bolt in the Kafka-Storm
Developer Template’s Storm topology.

Protect and Access Bolts, and Storm Topology

The following Java package and its associated Java source code provide classes that
implement the Voltage SecureData protect and access Storm bolts using a common base class,
and that implements the Storm topology for the Kafka-Storm Developer Template:

Source code package: com.voltage.securedata.stream.storm

11-3 CONFIDENTIAL
Integration Architecture of the Kafka-Storm Developer Template Developer Templates Integration Guide Version 5.0

Source code location: <install-dir>/stream/kafka_storm/src/main/java

The Kafka-Storm integration uses a Storm bolt to perform data protection processing on one or
more incoming plaintexts. As shipped, it uses an off-the-shelf Kafka spout to read tuples from a
Kafka topic that each contain one item of data to be protected, in a format corresponding to the
configured FPE or SST format. Each tuple is passed to the Voltage SecureData protect bolt,
which protects the data as configured. The protected tuple is then passed to an off-the-shelf
HDFS bolt, which writes the ciphertext or token to its own line in a file in HDFS.

The Storm package defines the following classes in .java source files of the same name:

• AccessBolt - This class implements the Voltage SecureData access bolt, extending
the functionality in its base class BaseBolt. It uses the specified configuration
information to access its ciphertext or token input and sends its output to the
downstream bolt, the HDFS bolt in the case of the Kafka-Storm Developer Template’s
Storm topology.

Note that the access bolt is not included in the Kafka-Storm Developer Template’s
Storm topology as shipped, but is nevertheless included for completeness.

• BaseBolt - This class implements the functionality shared by the protect and access
bolts, serving as their base class. It extends the Storm class BaseRichBolt.

• ProtectBolt - This class implements the Voltage SecureData protect bolt, extending
the functionality in its base class BaseBolt. It uses the specified configuration
information to protect its plaintext input and sends its output to the downstream bolt,
the HDFS bolt in the case of the Kafka-Storm Developer Template’s Storm topology.

• StormTopology - This class implements the Storm topology for the Kafka-Storm
Developer Template. It requires seven, and optionally eight, command line parameters
that specify the various configurable aspects of the topology. It uses these parameters
to create and set the Kafka Spout, the Voltage SecureData protect bolt, and the HDFS
bolt, and finally, to submit the topology to Storm for execution.

Configuration Classes Used by the Kafka-Storm Developer Template

The following shared Java package and its associated Java source code provide the classes
that implement the reading of the configuration settings for all of the DataStream Developer
Templates.

Source code package: com.voltage.securedata.stream.common

Source code location: <install-dir>/stream/stream_common/src/main/java

The Voltage SecureData protect and access Storm bolt classes use the classes in this package
to read global configuration settings from the XML configuration files vsauth.xml and
vsconfig.xml.

CONFIDENTIAL 11-4
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Kafka-Storm Developer Template

As shipped, the Voltage SecureData protect and access Storm bolt classes do not require any
configuration settings other than the common global configuration settings that are read and
established in-memory by the package com.voltage.securedata.config, as described in
"Common Configuration" (page 3-57). However, whereas the Voltage SecureData for Hadoop
Developer Templates reads these configuration files from HDFS, the DataStream Developer
Templates (including the Kafka-Storm Developer Template) read these configuration files from
a local directory.

For more information about this shared Java package, see "Shared Code for the DataStream
Developer Templates" (page 3-63).

Overview of the Storm Bolt Lifecycle

The lifecycle of a Storm bolt involves the following stages:

1. Construction - The bolt is constructed when the topology is built and instantiated, on
the computer from which the topology is submitted to the Storm cluster.

2. Serialization - The bolt instance is serialized and sent to the Storm cluster worker
nodes.

3. Preparation - The prepare method of the bolt instance is called on each worker node,
which initializes it to process tuples.

4. Execution - The execute method of the bolt instance is called on each worker node,
causing it to process each tuple it receives from any upstream spouts or bolts in the
topology, sending its output to the downstream bolt, if any.

In the context of this life cycle, SecureData cryptographic processing is integrated into the
Protect and Access Bolts as follows:

1. Construction - During this phase of the bolt life cycle, the Protect Bolt is constructed
with the configuration, authentication, and authorization settings needed to perform the
cryptographic processing. For information about how the Kafka-Storm Developer
Template’s Storm topology reads and uses these settings when constructing an
instance of this bolt, see the method main in the class StormTopology and the
constructor in the class BaseBolt. For information about alternative approaches to
handling the required configuration settings, see "Alternative Approaches to
Configuration" (page 11-8).

2. Serialization - All of the SecureData configuration objects in the class BaseBolt

implement the Serializable interface so that the required serialization can be
performed during this phase of the bolt life cycle.

3. Preparation - During this phase of the bolt life cycle, the method prepare of the class
BaseBolt initializes the static CryptoFactory instance using the deserialized
configuration settings.

11-5 CONFIDENTIAL
Integration Architecture of the Kafka-Storm Developer Template Developer Templates Integration Guide Version 5.0

4. Execution - During this phase of the bolt life cycle, the Protect Bolt performs
cryptographic processing on the input tuple. For information about how this is
implemented, see the method execute in the class BaseBolt.

Logging and Error Handling in the Kafka-Storm Developer Template

The Kafka-Storm Developer Template logs informational and error messages using the Apache
Commons Logging library. These log messages are written to the Storm worker log files, which
you can view using the Storm UI or directly on the worker nodes. The logs contain general
informational and debugging messages which will be helpful when troubleshooting. Consult the
log files if you encounter any errors when running the Kafka-Storm Developer Template’s
Storm topology.

Any errors encountered by the Voltage SecureData protect or access bolts are trapped in the
bolt's execute method, which then acknowledges (ACKs) the tuple and reports the error,
using the following core Storm API calls:
• collector.ack(tuple);

This call acknowledges the tuple so that it isn't retried. This is appropriate for the
Kafka-Storm Developer Template because cryptographic operation failures are very
likely the result of malformed data or incorrect configuration and any retry attempt
would likely just fail in the same way.
• collector.reportError(e);

This call reports the error to the Storm framework so that it is written to the Storm
logs and displayed in the Storm UI.

To see how this error handling is implemented see the execute method in the BaseBolt
class.

This means that if you publish an invalid data item to the Kafka-Storm Developer Template’s
Kafka topic, such as by interactively using the Kafka console producer, the Voltage SecureData
protect bolt will report the error to the logs and to the Storm UI, and will not send any
corresponding ciphertext or token line to the output file in HDFS.

CAUTION: In general, do not throw an exception from the execute method in this type of
Storm topology. If you do, Storm will consider the bolt instance to have crashed and will re-
create it. This will result in new instances of the downstream HDFS bolt and cause multiple
zero-byte output files to be created in HDFS.

The correct approach, as described above, is to trap and report the error without throwing it
to the caller of the execute method.

CONFIDENTIAL 11-6
Developer Templates Integration Guide Version 5.0 Configuration Settings for the Kafka-Storm Developer Template

Configuration Settings for the Kafka-Storm Developer Template

The Kafka-Storm Developer Template uses the standard XML configuration files defined by,
and used throughout, the Developer Templates. This section provides information about these
configuration files.

The Kafka-Storm Developer Template, and the Protect and Access Bolts in particular, use the
same two XML configuration files as are used by the all of the Hadoop Developer Templates
(other than the NiFi Developer Template): vsauth.xml and vsconfig.xml However,
whereas some of the Hadoop Developer Templates read these configuration files from HDFS,
the Kafka-Storm Developer Template reads these configuration files from the local file system.

NOTE: Starting with version 5.0, the Kafka-Storm Developer Template uses XML
configuration files instead of Java Properties configuration files. If the Java Properties
configuration files used in previous versions are present and the newer XML configuration
files are not present, the former will be used.

There are three classes of configuration settings used by the Kafka-Storm Developer Template:

• General configuration settings that define characteristics of the Voltage SecureData

Server(s) with which the Kafka Connect Developer Template will interact as well as a few
other common characteristics. These settings are in the XML configuration file
vsconfig.xml. For more information about these settings, see "Configuration Settings"
(page 3-5), "vsconfig.xml" (page 3-33), and "Common Configuration" (page 3-57).

• Authentication/authorization settings in the configuration file vsauth.xml that provide

• Additional settings in the XML configuration file vsconfig.xml that provide

information about the cryptIds configured with different sets of cryptographic
parameters. Each cryptId provides information about how to protect or access the data
read from the Kafka topic, potentially including any pre- and post-processing required
for that data. For example:
<cryptIds defaultIdentity="[email protected]"
defaultApi="simpleapi">
<cryptId name="alpha" format="Alphanumeric" />
<cryptId name="date" format="DATE-ISO-8601" />
<cryptId name="cc" format="cc-sst-6-4" api="rest" />
<cryptId name="ssn" format="ssn" />
<cryptId name="extended" format="AlphaExtendedTest1" />
</cryptIds>

11-7 CONFIDENTIAL
Configuration Settings for the Kafka-Storm Developer Template Developer Templates Integration Guide Version 5.0

NOTE: The component-specific XML configuration file named

vskafka-storm.xml can also be used, but given that the Kafka-Storm Developer
Template does not use field mappings (it specifies cryptIds directly), the only values
that can be provided in this file are for the product and version attributes of the
clientId element.

The Kafka-Storm Developer Template does not require any custom configuration settings
beyond what can be processed using the common configuration infrastructure provided for
common use throughout the Hadoop Developer Templates. For more information about these
settings, see "Common Configuration" (page 3-57), "Configuration Settings" (page 3-5), "XML
Configuration Files" (page 3-32), and the comments in the configuration files themselves.

Before you begin to modify the Kafka-Storm Developer Template XML configuration files for
your own purposes, such as using your own Voltage SecureData Server or different data
formats, Micro Focus Data Security recommends that you first run the Kafka-Storm Developer
Template sample as provided, giving you assurance that your Kafka and Storm installations are
configured correctly and functioning as expected.

You will need to update many of the values in the XML configuration files vsauth.xml and
vsconfig.xml in order to protect your own data using your own Voltage SecureData Server.

Also, because these configuration settings are read when the Storm topology is first initialized
and submitted, anytime you make changes to either of these configuration files, you must kill
and resubmit the Storm topology for any new settings to be read and used.

Alternative Approaches to Configuration

For the sake of simplicity, the topology in the Kafka-Storm Developer Template demonstrates
an approach to configuration/authentication/authorization settings using a set of properties
files, protected by file permissions, stored on the local computer on which the topology is built
and from which it is submitted to the Storm cluster. When the topology is built, these various
settings are read, including the sensitive authentication and authorization information, from
these properties files and then used to construct the topology’s instance of the Protect Bolt.
This approach demonstrates a basic integration, placing emphasis on the Voltage SecureData
protection API calls, and does not focus on providing a sophisticated mechanism to load these
settings.

CAUTION: This sample approach may not be appropriate in your production Kafka/Storm
integrations. In particular, keep in mind that the Storm topology deployment mechanism
involves serializing the spout and bolt instances and sending them over the network to the
worker nodes in the Storm cluster. These worker nodes deserialize the spout and bolt
instances and use them to execute the operations in the workflow. Depending on the
security settings and isolation of your Storm cluster, it may not be secure to send the
sensitive authentication and authorization credentials over the network in this manner,
especially if this transmission is not protected by a secure cluster configuration.

CONFIDENTIAL 11-8
Developer Templates Integration Guide Version 5.0 Configuration Settings for the Kafka-Storm Developer Template

You can, of course, choose, alternative approaches to providing the necessary configuration
information to the spouts and bolts in your production Kafka/Storm integrations, depending on
your specific topology implementation and cluster configuration. Some examples of other
possible approaches for this configuration information are as follows:

• You could hard-code the configuration, authentication, and authorization settings

directly in the source code for the bolt, which may make more sense in some cases when
compared to storing these settings in external configuration files. However, keep in
mind that sensitive information like this may be recoverable by decompiling the bolt
class files in the topology JAR file. Therefore, only consider this approach if the JAR files
on the Storm worker nodes are not accessible by unauthorized users, such as if these
bolt instances are running only on specific isolated and protected worker nodes in the
Storm cluster.

See the method getHardCodedConfig in the class BaseBolt, for a (commented-out)

example of how you might implement this type of approach to providing configuration
information.

• You could copy the relevant configuration files to a local directory on all of the Storm
worker nodes in the cluster using a configuration management tool such as Puppet or
Chef, protecting them using file permission settings. The Protect and Access bolts could
then be re-written to look for these files in a specific local directory on each worker node.

This approach is similar to the approach provided in the Kafka-Storm Developer

Template, as delivered, but instead of the configuration information being read and
constructed into the Protect Bolt on the computer launching the topology, the Protect
Bolt would instead be constructed with the location of the configuration files on the
worker nodes. Only when the bolt is later prepared for execution on the worker nodes
would the configuration files be read using the provided local file path.

This approach has the advantage that no sensitive authentication and authorization
settings are serialized and sent over the network. Those settings are already waiting for
the Protect Bolt on the worker nodes. On the other hand, the distribution management
tool used to put the configuration files on the worker nodes in advance must itself be
secure. Further, because the Storm workers run as user storm, by default, and not as
the user building and submitting the topology, there could be issues when using file
permission settings to protect these sensitive files: if you limit read access to the
authentication and authorization file, the generic storm user on the worker nodes will
not be able to read it. Note, however, that this limitation may be mitigated by configuring
the Storm workers to run as the user deploying the topology. For information about how
to do this, see the documentation for your Storm installation, and in particular, the
configuration setting run.worker.as.user.

See the method loadConfig in the class BaseBolt for an example of how such an
approach may be implemented.

11-9 CONFIDENTIAL
Sample Data for the Kafka-Storm Developer Template Developer Templates Integration Guide Version 5.0

These approaches are two of many alternative approaches that you can investigate for use in
your scenario. Your approach may be completely different, or perhaps a hybrid of one of the
approaches described here and other possible changes, such as using XML for the
configuration file format.

Sample Data for the Kafka-Storm Developer Template

The sample data, located in the <install-dir>/kafka_storm/sampledata subdirectory,

consists of four files, each containing 10 lines. Each line contains a single plaintext of a type
corresponding to the name of the file, and the plaintext in the file can be protected using the
formats indicated here:

• creditcard.txt - Protect the credit card data in this file using the built-in FPE format
cc.

You can also use the SST format cc-sst-6-4 for this input data, if you also change the
API Type property to rest, indicating the use of the REST API for the SST operations.

• ssn.txt - Protect the social security number data in this file using the built-in FPE
format ssn.

• email.txt - Protect the email address data in this file using the FPE format
AlphaNumeric, pre-configured on the dataprotection Voltage SecureData Server
hosted by Micro Focus Data Security.

• date.txt - Protect the date data in this file using the FPE format DATE-ISO-8601,
pre-configured on the dataprotection Voltage SecureData Server hosted by Micro
Focus Data Security.

Running the Kafka-Storm Developer Template

The Kafka-Storm Developer Template includes scripts in the directory <install_dir>/

stream/kafka_storm/bin that you use to run the template’s Storm topology. These scripts
provide commands to create a Kafka topic, publish plaintext records to that topic, and submit a
Storm topology to process those records from the topic.

CONFIDENTIAL 11-10
Developer Templates Integration Guide Version 5.0 Running the Kafka-Storm Developer Template

Run-Time Prerequisites
To run the Kafka-Storm Developer Template, including writing the results of the Storm
topology to HDFS, you will need the following services configured and running on your Hadoop
cluster:

• Kafka

• Storm

• Zookeeper

Editing the Distribution-Specific Run-Time Settings

The Kafka-Storm Developer Template includes several scripts that you use to invoke the Kafka
and Storm services required to run the template’s Storm topology. These scripts require several
Hadoop distribution-specific settings in order to successfully call these services as well as to
specify the output location in HDFS. To make editing these settings easier, they are located
together in the script configuration file vsdistrib.properties, co-located with the scripts
in the bin directory. You must edit the following properties in this file to specify values
appropriate for your cluster environment:

kafkaBrokerList
This property specifies a list of Kafka Brokers in <host>:<port> format, separated by
commas. For example:
hostname1.domain.com:1234,hostname2.domain.com:1234

The administrator who configured Kafka on your cluster will be able to provide you with this
host and port information for the Kafka Broker hosts in your cluster. In some cases, the port
number may be found in the listeners property in the Kafka properties file
server.properties.

kafkaServerPropsFile
This property specifies the absolute file path to the Kafka properties file server.properties.

The administrator who configured Kafka on your cluster will be able to provide you with the
path to this properties file. Note that in some cases this file may be outside the main Kafka
installation path. Some examples of possible locations for this file include:
• /etc/kafka/<version>/0/server.properties

• /opt/cloudera/parcels/KAFKA/etc/kafka/conf.dist/server.properties

If you are not able find this file on your cluster, the following command may help you locate it:
locate "*kafka*server.properties"

11-11 CONFIDENTIAL
Running the Kafka-Storm Developer Template Developer Templates Integration Guide Version 5.0

kafkaBinDir
This property specifies the absolute path to the Kafka bin directory, which contains Kafka
scripts such as kafka-topics.sh. If the Kafka bin directory is already included in the system
PATH variable, do not provide a value for this property. Otherwise, specify the full path to the
directory kafka/bin within your Kafka installation location, ending with a slash character (/).
Example locations:
• /usr/hdp/<version>/kafka/bin/

• /opt/cloudera/parcels/KAFKA/lib/kafka/bin/

NOTE: When you specify a value for this property, it must end with the slash character (/).

If you are not able find this path for your cluster, the following command may help you locate it:
locate "kafka-topics.sh"

hdfsOutDir
This property specifies the output directory in HDFS into which you want the HDFS Bolt to
create the output files containing the ciphertext results of the Storm topology. Note that the
user submitting the topology must have write permission for this directory and must be able to
grant write permission to other users.

NOTE: The Kafka-Storm Developer Template’s scripts grant write permissions on this
directory to other users, since Storm workers run as user storm by default. Therefore, you
must specify a directory for which it is acceptable, in terms of the security of your HDFS, for
other users to have write permission. Alternatively, you can update the script run-storm-
topology to no longer grant this permission, instead configuring the Storm workers to run
as the user submitting the topology. See the documentation for your Storm installation for
configuration details on this advanced Storm run.worker.as.user configuration setting.

For the simplest case, the value for this property can be the user's home directory in HDFS,
which you can specify by replacing the <username> place-holder with the appropriate
username value. Alternatively, you can specify any other directory in HDFS, as long as you have
(and can grant) write permissions on that directory.

Steps to Run the Kafka-Storm Developer Template

After building the target Storm topology JAR file, as described in "Building the Datastream
Developer Templates" (page 2-19), and editing the property values in the script configuration
file vsdistrib.properties, as described in the previous section, use the following steps to
exercise the Kafka-Storm Developer Template.

CONFIDENTIAL 11-12
Developer Templates Integration Guide Version 5.0 Running the Kafka-Storm Developer Template

NOTE: Before performing these steps, confirm that your Hadoop cluster has a home
directory in HDFS for the user account under which you plan to submit the template’s Storm
topology. For more information, see "Creating a Home Directory in HDFS" (page 3-75).

1. If you performed the steps to build the target Storm topology JAR file on a computer
outside your Hadoop cluster, copy the following directories from your build computer to
a local directory on your Hadoop cluster:

• bin (shell scripts)

• config (configuration files)

• sampledata (sample input data files)

• target (Storm topology JAR file)

2. Because the configuration file vsauth.properties contains the sensitive

authentication credentials for performing the cryptographic operations, change its
permission settings so that only the current user can read it:
chmod 0600 config/vsauth.properties

3. Update the XML configuration file vsconfig.xml to establish your general

configuration settings. You will at least need to set the property
simpleapi.install.path to specify the location of your Simple API installation on
the Storm worker nodes. For more information about these general configuration
settings, see "Configuration Settings for the Kafka-Storm Developer Template" (page
11-7) and "Common Configuration" (page 3-57).

4. Update the configuration file vsauth.properties to set your authentication and

authorization settings. If you initially run the Kafka-Storm Developer Template using the
Voltage SecureData Server dataprotection, hosted by Micro Focus Data Security,
you can leave the settings in this file set to their default values. For more information
about these general configuration settings, see "Configuration Settings for the Kafka-
Storm Developer Template" (page 11-7) and "Common Configuration" (page 3-57).

5. Change directory (cd) to the following directory to run the scripts in the following steps:
<install_dir>/stream/kafka_storm/bin

6. Confirm that you have correctly edited the script variables in the script configuration file
vsdistrib.properties to specify the distribution-specific settings and output
location. For more information, see "Editing the Distribution-Specific Run-Time Settings"
(page 11-11).

7. Run the script create-kafka-topic to create a new Kafka topic with the name “ssn”:
./create-kafka-topic ssn

11-13 CONFIDENTIAL
Running the Kafka-Storm Developer Template Developer Templates Integration Guide Version 5.0

8. Run the script run-storm-topology to submit the template’s Storm topology with
the name ssn-topology, specifying the Kafka topic ssn as the input source for the
Kafka Spout, and the cryptId name ssn to identify the cryptographic parameters to use
for the topology’s protect operations:
./run-storm-topology ssn ssn-topology ssn

The parameters to this script are: topic name, topology name, cryptId name, and
optionally, the API type (with the Simple API as the default choice). In this example, only
three parameters are provided and the topic name happens to be the same as the
cryptId name. They can be different.

NOTE: If you are using the deprecated Java Properties configuration files with the
Kafka-Storm Developer Template, specify a data protection format name as the third
parameter instead of a cryptId name.

9. Run the script run-kafka-sample-producer to launch a Kafka producer to read

lines of data from the sample data file ssn.txt and publish them to the Kafka topic
ssn:

./run-kafka-sample-producer ../sampledata/ssn.txt ssn

10. Check the configured output directory in HDFS (specified by the hdfsOutDir property
in the script configuration file vsdistrib.properties) for the output file produced
by the Storm topology submitted in step 8 (ssn-topology). This topology reads input
records from the Kafka topic ssn, protects the plaintext in those records, and writes the
resulting ciphertext to a file in the specified output directory in HDFS. The form of the
name of the output file is as follows:
vs-storm-sample-<hdfs-bolt-identifier>.txt

Where the <hdfs-bolt-identifier> portion of the filename is generated by the

HDFS Bolt at runtime. The content of this output file is the protected SSN ciphertext,
produced by protecting the input SSN values according to the cryptographic
parameters in the specified cryptId, using the configured Voltage SecureData API.

11. Optionally, call the provided delete scripts to clean-up the template’s Storm topology, its
Kafka topic, and the output files created by the topology, respectively:
./delete-storm-topology ssn-topology

./delete-kafka-topic ssn

./delete-hdfs-files

CONFIDENTIAL 11-14
Developer Templates Integration Guide Version 5.0 Running the Kafka-Storm Developer Template

Running the Kafka Console-Producer

As an alternative to the steps above to run the template’s Kafka producer to read and publish
the input lines from the provided data files in the directory sampledata , you can also run the
provided Console Producer script, which allows you to interactively enter input data messages
on the console window and have them published to the specified topic.

The steps for running the template’s Storm topology using this interactive approach are as
follows, using credit card data protected by the REST API:

1. Run the script create-kafka-topic to create a new Kafka topic with the name cc:
./create-kafka-topic cc

2. Run the script run-storm-topology to submit the template’s Storm topology with
the name cc-topology, specifying the Kafka topic cc as the input source for the Kafka
Spout, the cryptId name cc-sst-6-4 to identify the cryptographic parameters to use,
and the REST API for the topology’s protect operations:
./run-storm-topology cc cc-topology cc-sst-6-4 rest

3. Run the script run-kafka-console-producer to interactively read lines of data from

the console window and publish them to the Kafka topic cc:
./run-kafka-console-producer cc

Then, as you enter lines of data at the console prompt, one item per line, they will be
published to that topic. For example, try typing in the following sample credit card
numbers, entered separately:
1111-2222-3333-4444

2222-3333-4444-5555

4. Check the configured output directory in HDFS (specified by the hdfsOutDir property
in the script configuration file vsdistrib.properties) for the output file produced
by the Storm topology submitted in step 2 (cc-topology). This topology reads input
records from the Kafka topic cc, protects the plaintext in those records, and writes the
resulting ciphertext to a file in the specified output directory in HDFS. The form of the
name of the output file is as follows:
vs-storm-sample-<hdfs-bolt-identifier>.txt

Where the <hdfs-bolt-identifier> portion of the filename is generated by the

HDFS Bolt at runtime. The contents of this output file is the protected credit card
ciphertext, produced by tokenizing your interactively entered credit card values using
the REST API.

11-15 CONFIDENTIAL
Running the Kafka-Storm Developer Template Developer Templates Integration Guide Version 5.0

TIP: Tailing the Output File

The output file in HDFS is assigned a unique name by the HDFS Bolt, with a numeric
identifier. When you run the template’s Storm topology, check the output directory in HDFS
for the exact name of this file, and then tail it as follows:

hdfs dfs -tail -f <OutDir>/vs-storm-sample-<UniqueId>.txt

Where <OutDir> is the output directory in HDFS and <UniqueId> is the unique portion of
the output filename assigned by the HDFS Bolt.

Then, in a separate console window, publish new input data to the Kafka topic (either using
the script run-kafka-sample-producer and provided sample data, or interactively using
the script run-kafka-console-producer), and watch as the protected ciphertext is
written to the end of this output file.

Note that it may take several seconds for the new ciphertext lines to be written to this output
file in HDFS, depending on when the HDFS Bolt syncs the output to the file system.

Script Summary
Examine the scripts in the bin directory to see how they call the Kafka and Storm commands in
the context of the Kafka-Storm Developer Template. This section summaries these scripts.

create-kafka-topic
This script creates a new Kafka topic with the specified name.

Invocation:
./create-kafka-topic <topic_name>

This script expects a single parameter: the name of the Kafka topic to be created.

delete-hdfs-files
This script deletes the output file(s) in HDFS generated by the template’s Storm topology: all
files in the configured HDFS output directory, as specified by the hdfsOutDir property in the
script configuration file vsdistrib.properties, that match the filename pattern
vs-storm-sample-*.txt.

Invocation:
./delete-hdfs-files

This script does not expect any parameters.

CONFIDENTIAL 11-16
Developer Templates Integration Guide Version 5.0 Running the Kafka-Storm Developer Template

delete-kafka-topic
This script deletes the Kafka topic with the specified name.

Invocation:
./delete-kafka-topic <topic_name>

This script expects a single parameter: the name of the Kafka topic to be deleted.

delete-storm-topology
This script terminates the Storm topology with the specified name.

Invocation:
./delete-storm-topology <topology_name>

This script expects a single parameter: the name of the Storm topology to be terminated.

run-kafka-console-producer
This script calls the script kafka-console-producer.sh that is provided in the core Kafka
installation, which reads lines from the console and publishes them to the specified Kafka topic.

Invocation:
./run-kafka-console-producer <topic_name>

This script expects a single parameter: the name of the Kafka topic to which console input will
be published.

run-kafka-sample-producer
This script runs the SampleKafkaProducer class in the Kafka-Storm Developer Template’s
target JAR file, which reads lines from the specified input file and publishes them to the
specified Kafka topic.

Invocation:
./run-kafka-sample-producer <input_file> <topic_name>

This script expects two parameters: 1) the path to the input file, and 2) the Kafka topic to which
lines in that file will be published.

11-17 CONFIDENTIAL
Simplifications of the Kafka-Storm Developer Template Developer Templates Integration Guide Version 5.0

run-storm-topology
This script runs the StormTopology class in the Kafka-Storm Developer Template’s target
JAR file, which submits a new Storm topology with the specified name. This topology reads
records from the specified Kafka topic and protects them using the cryptographic parameters
specified by the named cryptId (and optional API type).

Invocation:
./run-storm-topology <topic_name> <topology_name> <cryptId_name>
<optional_API_type>

This script expects three or four parameters: 1) the name of the Kafka topic from which to read
records, 2) the name under which to run the topology, 3) the name of a cryptId, the
cryptographic parameters of which to use during protect operations, and optionally 4) the type
of Voltage SecureData API to be used (either simpleapi (the default) or rest).

vsdistrib.properties
This configuration file is used by the other scripts in the bin directory (described above) to get
the required cluster settings from a single place in order to avoid redundant editing in multiple
scripts. It defines the following variables:
• kafkaBrokerList

• kafkaServerPropsFile

• kafkaBinDir

• hdfsOutDir

For more information, see "Editing the Distribution-Specific Run-Time Settings" (page 11-11).

Simplifications of the Kafka-Storm Developer Template

The Kafka-Storm Developer Template provides a very basic sample integration, showing a
simple Storm bolt that protects simple input data. Specifically, two of the simplifications made in
this template are:

• Simplified Input Data Format

The ProtectBolt and AccessBolt, as delivered, show Voltage SecureData integration in

the context of processing a basic tuple with only a single string field, with the value to
process retrieved as follows:
String input = tuple.getString(0);

CONFIDENTIAL 11-18
Developer Templates Integration Guide Version 5.0 Simplifications of the Kafka-Storm Developer Template

Your requirements are likely to require more complex processing logic, working on
multiple fields in the input tuple. If this is the case, use the sample bolts as a starting
point and customize the code to handle your more advanced scenario.

• Simplified Configuration

The Storm topology provided with the Kafka-Storm Developer Template reads its
configuration and authentication/authorization settings from local properties files; an
approach that may be more simple than is called for by your scenario. Your production
integration may require alternative approaches for configuring these settings. For more
information about alternative approaches to configuration for the Kafka-Storm
Developer Template, see "Alternative Approaches to Configuration" (page 11-8).

Also keep in mind that Storm bolts do not perform batch processing, instead processing
individual tuples in their execute method. This is especially important when your scenario calls
for sending multiple input items to the Protect Bolt. In such cases, if the Protect Bolt is
configured to perform remote cryptographic processing using the REST API, the network
overhead of this API will not allow the topology to perform well. For large scale message
processing, it is much more efficient to perform local cryptographic operations using the Simple
API. If you need to perform bulk input processing using the remote REST API, you should
consider using the batch processing technologies built on top of Storm, such as Apache
Trident.

11-19 CONFIDENTIAL
Simplifications of the Kafka-Storm Developer Template Developer Templates Integration Guide Version 5.0

CONFIDENTIAL 11-20
12 Troubleshooting
This chapter describes some common problems that can arise when working with the Voltage
SecureData for Hadoop Developer Templates, and provides tips for solving them. It includes
the following topics:

Hadoop Build Issues

• GLib Error During Build (page 12-2)

NiFi Build Issues

• Calling the Simple API From More Than One NAR File (page 12-2)

Issues Running Hadoop Jobs

• Simple Queries using Hive UDFs Fail on Some Hadoop Distributions (page 12-3)

• Queries Using Hive UDFs Fail with Literal Values (page 12-4)

• Hive Script Changes Required When Using Hive 3.0 (page 12-4)

• Hive Queries Fail in Kerberized Clusters When kinit Is Not Performed (page 12-4)

• Binary Hive UDFs Fail Due to Data Being Too Large for the REST API (page 12-4)

• Failure to Copy JAR Files to the hive/lib Directory on All Data Nodes (page 12-4)

• MapReduce Jobs Fail with NoClassDefFoundError (page 12-5)

• Sqoop Steps (Including codegen) Fail with DB Driver Error (page 12-5)

• Sqoop codegen Command Fails with Streaming Result Set Error (page 12-6)

• Sqoop Jobs Fail with ORM Class Method Error (page 12-7)

• Data Protection Operations Fail with TLS Error (page 12-8)

• Simple API Operations for Dates Fail with VE_ERROR_GENERAL Error (page 12-9)

• Simple API Operations Fail with Authentication Error (page 12-9)

• Simple API Operations Fail with Library Load Error (page 12-9)

• Simple API Operations Fail with Network Connection Error (page 12-9)

• Developer Templates Cannot Find the Configuration Files in HDFS (page 12-10)

12-1 CONFIDENTIAL
Hadoop Build Issues Developer Templates Integration Guide Version 5.0

• Unable to Load libvibesimplejava.so (page 12-10)

• Remote Queries Using Hive UDFs Fail on BigInsight 4.0 (page 12-10)

• Hadoop Job Error: “Container exited with a non-zero exit code 134” (page 12-11)

• Hadoop Job Not Failing when Invalid Auth Credentials Used for Access (page 12-11)

• Hadoop Job Fails with REST on Older Voltage SecureData Servers (page 12-11)

• Hadoop Job Fails with Specific REST API Error Code and Message (page 12-11)

• Sqoop Import Job: REST API Error UNSUPPORTED_CODEPOINT (page 12-12)

• Hadoop Job Tasks Fail When Voltage SecureData Server is Overloaded (page 12-12)

• Error When Using Spark (Spark1) Instead of Spark2 (page 12-12)

Hadoop Build Issues

This section discusses issues that can arise when building the Hadoop Developer Templates.

GLib Error During Build

In some cases, you might see a warning similar to the following during a build:
GLib-GIO-ERROR **: Settings schema 'org.gnome.system.proxy' is not
installed

To resolve this issue, install the Glib* package for your Linux distribution. For example, if you
are using CentOs 6, run the command yum install glib*

NiFi Build Issues

This section discusses issues that can arise when building the NiFi Developer Template.

Calling the Simple API From More Than One NAR File
If your scenario involves building more than one NAR file that calls the Simple API, you must
build those NAR files such that the dependent JAR files, including the Simple API JAR file
vibesimplejava.jar, are not included in those NAR files. Instead, you will take steps to get
those JAR files loaded in a way that allows the NAR files to share them without trying to load
them more than once. For more information, see the relevant build note in "Build Notes" (page
2-25).

CONFIDENTIAL 12-2
Developer Templates Integration Guide Version 5.0 Issues Running Hadoop Jobs

Issues Running Hadoop Jobs

Errors that occur when running a Hadoop job can be caused by issues with your Hadoop
implementation, rather than with the Voltage SecureData software. Be sure that you can
successfully run MapReduce, Hive, and Sqoop jobs that do not require the Voltage SecureData
software.

Job errors can also be caused by connection or authentication errors with the Voltage
SecureData Server, or issues with the data protection APIs (the Simple API and the REST API).

If you encounter an error from the Simple API when running the Developer Template code,
verify that you have followed the Simple API installation and verification steps, and that you can
run the sample code provided with the Simple API as the user under which you will be running
the Developer Templates. For example, some errors might be related to file permissions. See
the Voltage SecureData Simple API Developer Guide for details about specific error codes.

Simple Queries using Hive UDFs Fail on Some Hadoop Distributions

On some Hadoop distributions, simple queries that do not use the JOIN directive are handled
locally, without launching a MapReduce job. This introduces some special requirements, as
follows:

• The second (uber) JAR file below depends on the first JAR file, dictating that the latter
must be specified before the former when creating temporary or permanent UDFs:

1. vibesimplejava.jar (no dependencies on any of the others)

2. voltage-hadoop.jar (depends on 1)

3. vsconfig.jar (if used; no dependencies on any of the others)

NOTE: The JAR file voltage-hadoop.jar is built as an uber JAR file that contains
vsrestclient.jar and its JSON and HTTP Client library dependencies.

See the scripts create-hive-udf.hql and create-hive-perm-udf.hql for the

directives used to create temporary and permanent Hive UDFs, respectively, with the
JAR files above specified in the correct order.

NOTE: For some Hadoop distributions, it is necessary to copy these two JAR files
(and the configuration JAR file, vsconfig.jar, if used) to the hive/lib directory
on all data nodes in your Hadoop cluster. While not required for every Hadoop
distribution, this action is otherwise harmless and is documented here as current best
practice for the Hadoop Developer Templates on all Hadoop distributions. For more
information, see "Setting Up to Run the Hive Developer Template" (page 5-24).

• Queries run from a computer outside the Hadoop cluster using a JDBC/ODBC call must
use permanent UDFs.

12-3 CONFIDENTIAL
Issues Running Hadoop Jobs Developer Templates Integration Guide Version 5.0

• Queries run from a node within the Hadoop cluster may use either temporary or
permanent UDFs (permanent UDFs are recommended).

Queries Using Hive UDFs Fail with Literal Values

When using a Hive UDF to protect or access a literal value rather than a column, you might get
an error indicating a failure to load the Simple API native library. The error details would include
either UnsatisfiedLinkError or NoClassDefFoundError. For more information,
including the workarounds, see "Hive UDF Failures with Literal Values" (page 5-16).

Hive Script Changes Required When Using Hive 3.0

If you are using Hive 3.0, as will be the case if you are using HDP 3.0.0, when you are editing the
script create-hive-table.hql to replace instances of the <username> placeholder, you
will want to comment out the original three CREATE TABLE commands and uncomment the
three replacement CREATE EXTERNAL TABLE commands. For information about why this is
required, see "Major Changes in Hive 3.0 and the Script Changes They Required" (page 5-18).

Hive Queries Fail in Kerberized Clusters When kinit Is Not Performed

If you are using the Hive UDFs in a kerberized Hadoop cluster and you failed to perform the
kinit step, you will see the following error:
Caused by: org.apache.thrift.transport.TTransportException: Peer
indicated failure: GSS initiate failed

If you encounter this error, run kinit as usual on your kerberized Hadoop cluster.

Binary Hive UDFs Fail Due to Data Being Too Large for the REST API
By default, the Voltage SecureData Server limits the size of Web Service data to 25 MB, which
may not be sufficient for large data, such as images and video, being protected and accessed
using the binary Hive UDFs with the REST API.

For more information about this type of error and its possible solutions, see "Size Restrictions
When Using the REST API" (page 5-8).

Failure to Copy JAR Files to the hive/lib Directory on All Data Nodes
You should always copy the required two (or three) JAR files to the hive/lib directory on all
data nodes in your Hadoop cluster. While this is not required for all of the supported versions of
all of the supported Hadoop distributions, it should be harmless on those for which it is not
required. Given the general ease of file distribution in Hadoop clusters, the simplest approach is
to include this copy step, regardless of Hadoop distribution and version.

CONFIDENTIAL 12-4
Developer Templates Integration Guide Version 5.0 Issues Running Hadoop Jobs

If you fail to perform this step for a Hadoop cluster for which this requirement exists, you will
encounter ClassLoader issues that generate error messages like the following:
Caused by: java.lang.RuntimeException: Failed to initialize Simple API
at com.voltage.securedata.crypto.LocalCrypto.initSimpleAPI(LocalCrypto.java:93)
at com.voltage.securedata.crypto.LocalCrypto.<init>(LocalCrypto.java:185)
at com.voltage.securedata.crypto.CryptoFactory.getCrypto(CryptoFactory.java:123)
at com.voltage.securedata.hadoop.hive.BaseHiveUDF.getCrypto(BaseHiveUDF.java:106)
at com.voltage.securedata.hadoop.hive.BaseHiveUDF.evaluate(BaseHiveUDF.java:207)
... 44 more
Caused by: java.lang.UnsatisfiedLinkError: Native Library /opt/voltage/simpleapi/
voltage-simple-api-java-5.0.0-Linux-x86_64-64b-r213914/lib/libvibesimplejava.so
already loaded in another classloader
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1903)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1822)

For more information about this distribution-independent Hive preparatory step, see "Setting
Up to Run the Hive Developer Template" (page 5-24).

MapReduce Jobs Fail with NoClassDefFoundError

When running the MapReduce job on a node in the Hadoop cluster, the job might fail to run,
displaying the following error:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apac

he/hadoop/mapreduce/Job
...
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapr
educe.Job
...

This error can indicate that you are attempting to run the MapReduce job from a node that is
not allowed to submit MapReduce jobs (some nodes might not be configured as MapReduce
clients). Verify that your node is permitted to submit regular MapReduce jobs before
attempting to run the Developer Templates.

Sqoop Steps (Including codegen) Fail with DB Driver Error

Sqoop jobs can fail with the following error:

java.lang.RuntimeException: Could not load db driver class: com.

mysql.jdbc.Driver

This error indicates that you have not installed the MySQL JDBC driver needed to connect to
your MySQL database server. Consult your Hadoop distribution’s documentation for
information about how to install the required JDBC driver JAR file. For example, in some
distributions, you must copy the JAR file to the sqoop lib directory (such as /usr/lib/
sqoop/lib).

12-5 CONFIDENTIAL
Issues Running Hadoop Jobs Developer Templates Integration Guide Version 5.0

Sqoop codegen Command Fails with Streaming Result Set Error

When attempting to run the sqoop codegen command against a MySQL database, the
command may fail with the following error:
ERROR manager.SqlManager: Error reading from database: java.sql.
SQLException:Streaming result set com.mysql.jdbc.RowDataDynamic@1cfabc3a is
still active. No statements may be issued when any streaming result sets are
open and in use on a given connection. Ensure that you have called .close()
on any active streaming result sets before attempting more queries.

If you see this error, it concerns Sqoop and older versions of the MySQL connector and is not
related to the Hadoop Developer Templates, per se. The root cause of this issue is an
incompatible connector JAR file, and has been detected in the following Hadoop distributions
(and it may occur in other distributions going forward):

• HDP 2.3

• CDH 5.4

For more information, see the Sqoop bug at the following URL:

https://issues.apache.org/jira/browse/SQOOP-1400

The version of Sqoop that comes with the affected distributions is not compatible with the file
mysql-connector-java-5.1.17.jar. The workaround/solution is to use a newer version
of this connector JAR file, such as mysql-connector-java-5.1.31.jar (or newer).

There are two approaches to working around this problem, described below. Try the simple
work-around first, and if it does not work, try the advanced work-around that has appeared in
previous versions of this document.

Simple Work-Around

In most environments, the simple work-around for this is to explicitly specify the JDBC driver to
use, by adding the argument --driver com.mysql.jdbc.Driver to the sqoop codegen
command:

sqoop codegen \
--username $DATABASE_USERNAME \
-P \
--connect jdbc:mysql://$DATABASE_HOST/$DATABASE_NAME \
--table $TABLE_NAME \
--driver com.mysql.jdbc.Driver \
--class-name com.voltage.sqoop.DataRecord \
--bindir .\
--outdir .
exit $?

CONFIDENTIAL 12-6
Developer Templates Integration Guide Version 5.0 Issues Running Hadoop Jobs

Including the --driver argument in the sqoop codegen command instructs Sqoop to use
the most recent connector JAR file for MySQL installed on the machine, which in most cases will
pick up the correct compatible connector.

Advanced Work-Around

If the simple work-around described above did not work as expected, try this more advanced
work-around, which involves downloading the latest MySQL connector from the MySQL Web
site at the following URL (access to which may require an Oracle account):

http://www.mysql.com/downloads/

After you have downloaded a new MySQL connector JAR file, the exact steps for installing it
depend on your specific Hadoop distribution. For example, for the Hortonworks HDP 2.3
distribution, perform the following steps on all Sqoop client nodes in your Hadoop cluster
before attempting the sqoop codegen command again:

1. Find where the existing MySQL connector JAR file (mysql-connector-java-

5.1.17.jar) is located on your Hadoop node. It is normally found in the directory:
/usr/share/java

2. Copy the new MySQL connector JAR file to this location.

3. Point the symbolic link in this directory from the mysql-connector-java.jar to this
new JAR file:
> cd /usr/share/java

> unlink mysql-connector-java.jar

> ln -s mysql-connector-java-<version-suffix>.jar \
> mysql-connector-java.jar

Where <version-suffix> is the suffix of the new MySQL connector JAR file you
downloaded. For example, the full filename may be:
mysql-connector-java-5.1.37-bin.jar

Sqoop Jobs Fail with ORM Class Method Error

The Sqoop integration uses the Java Reflection API to locate the getter methods in the ORM
class with the column/field names specified in the configuration file vsconfig.xml. Because
Java is case-sensitive, the getter method name must exactly match the case of the column
name specified in that configuration file. If there is a case-mismatch, misspelling, or if the
specified column name does not exist in the table, you see a message in the logs similar to the
following:

java.lang.RuntimeException: Unable to find ORM class method: get_<fieldName>

...
Caused by: java.lang.NoSuchMethodException:
com.voltage.sqoop.DataRecord.get_<fieldName>()

12-7 CONFIDENTIAL
Issues Running Hadoop Jobs Developer Templates Integration Guide Version 5.0

If you see this error when running the Sqoop job, check the configuration file vsconfig.xml
to make sure the column names in the Sqoop section exactly match the column names in the
database table, including their case. You can also look at the source code of the generated ORM
class file DataRecord.java, to see the exact names of the generated getter methods.

For example, when generating the ORM class from a table in an Oracle database, you have to
specify table and schema names in uppercase when running the codegen command.
Therefore, the field names configured in the configuration file vsconfig.xml also need to be
specified using uppercase:
<field name = "EMAIL" cryptId = "alpha"/>

This is required to match the generated getter method in the ORM class: get_EMAIL. Although
Oracle will interactively accept lowercase names, the Sqoop integration uses the case-sensitive
Java Reflection API and will not treat email the same as EMAIL.

Data Protection Operations Fail with TLS Error

If your Hadoop job fails with a TLS error when attempting to connect to the Voltage
SecureData Server, one of the following errors will be returned, depending on which Voltage
SecureData data protection API you are using:

Simple API:
VE_ERROR_CANNOT_VERIFY_CERT

REST API:
sun.security.validator.ValidatorException: PKIX path building
failed: sun.security.provider.certpath.SunCertPathBuilderException:
unable to find valid certification path to requested target

Either of these errors indicates that the trusted root certificate that signed the TLS certificate
for your Voltage SecureData Server is not trusted by the data protection API in use.

If you are using an untrusted TLS certificate for your Voltage SecureData Server, you must
install the corresponding trusted root certificate into the Simple API trustStore directory on
every data node upon which the Simple API is installed. You must also run the c_rehash
command in each of those trustStore directories. For more information, see the Voltage
SecureData Simple API Installation Guide.

NOTE: The sample code in the Developer Templates uses the same trustStore directory
for the Simple API and the REST API, as explained in "Multiple Developer Template
TrustStores - Background and Usage" (page 3-56). No additional steps are required to
update the JVM truststore used by the REST API.

CONFIDENTIAL 12-8
Developer Templates Integration Guide Version 5.0 Issues Running Hadoop Jobs

Simple API Operations for Dates Fail with VE_ERROR_GENERAL Error

If you attempt to protect or access date values stored as strings (such as from either CSV input
files processed by MapReduce or VARCHAR table columns processed by Sqoop), the Simple API
might return the error code VE_ERROR_GENERAL. This occurs because Simple API support for
date encryption in pre-4.3 versions is limited to a specific format.

If this error occurs, use the class SimpleAPIDateTranslator to convert date formats to and
from the date format expected by the Simple API. For more information, see "Data Translation"
(page 3-67).

Simple API Operations Fail with Authentication Error

If the Simple API returns the error VE_ERROR_AUTHORIZATION_DENIED and you are sure that
you are providing the correct shared secret in the configuration file vsauth.xml, check the
clock time on both your client and server machines. This error can indicate that they are not set
correctly, or are otherwise out-of-sync.

For shared secret authentication, the underlying Voltage SecureData cryptographic code on
the client includes a time stamp in the hash-protected internal authentication token, allowing
for expiration after 24 hours. This means that incorrect clock settings on either the client or the
server can cause authentication failures if the Voltage SecureData Key Server detects that the
“issuance date” in the token is too old. When this occurs, the following message can be found in
the Key Server debug logs:

issuanceDate is too old: <timestamp>

Simple API Operations Fail with Library Load Error

Your Hadoop jobs might fail to run if the Simple API shared library on the data nodes in your
Hadoop cluster are not found or do not have the correct permissions. If this occurs, you might
see text similar to the following in the error output by the hadoop command:

Caused by: java.lang.UnsatisfiedLinkError: Can't load library: /opt

/voltage/simpleapi/lib/libvibesimplejava.so

If you see this type of error, make sure that the Simple API is properly installed on all Hadoop
data nodes in your cluster and that the permission settings for the shared library file
libvibesimplejava.so are set correctly to allow the corresponding user to run the job.

Simple API Operations Fail with Network Connection Error

The following error messages indicate that one or more trusted root certificate files in the
Simple API trustStore directory are not readable by the user running a job:
• 572 VE_Error_Network_Connect.

• error code 35 Error 0200100D:lib(2):func(1):Reason(13)

12-9 CONFIDENTIAL
Issues Running Hadoop Jobs Developer Templates Integration Guide Version 5.0

This can occur when the permissions for the trustStore directory have been set manually, or
if there are restrictive umask settings on the node. You must reset the permissions for the
trustStore directory on all nodes to be world-readable.

Developer Templates Cannot Find the Configuration Files in HDFS

In some distributions, including the Hortonworks HDP 2.2 distribution, new Configuration
objects have the default FileSystem set to file instead of hdfs, regardless of the setting in the
core-site.xml file. This means that the code in the Developer Templates cannot locate the
configuration files vsauth.xml and vsconfig.xml in HDFS. A workaround in the
HDFSConfigLoader class allows the Developer Templates to locate these configuration files
in HDFS correctly.

This workaround, which is primarily needed for the HDP 2.2 distribution, assumes the following
location (which is the default location for HDP 2.2) for the core-site.xml file:
/etc/hadoop/conf/core-site.xml

Verify that your version of the file core-site.xml is in the directory /etc/hadoop/conf
before you do a build. If the file is in a different location, you can update the following line of
code in the file HDFSConfigLoader.java with the correct location:
private static final String CORE_SITE_XML =
"/etc/hadoop/conf/core-site.xml";

Unable to Load libvibesimplejava.so

If the Simple API is installed in the directory /root, the Hadoop job will fail with the following
error:
java.lang.UnsatisfiedLinkError: Can't load library: /root/...

If you installed the Simple API in the directory /root, you must reinstall it in a different location.

Remote Queries Using Hive UDFs Fail on BigInsight 4.0

When using a Hive UDF to protect or access a literal value in IBM BigInsight 4.0, you might get
an error indicating failure to process permanent UDFs, displaying the following error in console:
java.sql.SQLException: Error while processing statement: FAILED: Execution
Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

And generating the following HiveServer2 log entry:

Diagnostic Messages for this Task: Error: java.lang.RuntimeException:java.
lang.NullPointerException

To resolve this, restart HiveServer2 and re-run the Remote Hive query.

CONFIDENTIAL 12-10
Developer Templates Integration Guide Version 5.0 Issues Running Hadoop Jobs

Hadoop Job Error: “Container exited with a non-zero exit code 134”
The full error will look something like this:
Exit code: 134
Stack trace: ExitCodeException exitCode=134:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
...
at java.lang.Thread.run(Thread.java:745)
Shell output: main : command provided 1
Container exited with a non-zero exit code 134

If the Hadoop job (for example, MapReduce) fails with this error, confirm that the permissions
on the Developer Templates directory specify that it is accessible by the user running the job. A
typical cause for this error is if the Developer Templates directory was copied from another
location as the root user and is not set for access by other users.

Hadoop Job Not Failing when Invalid Auth Credentials Used for Access
If a Hadoop access (decryption or de-tokenization) job runs to completion without throwing an
exception, even though the authentication/authorization credentials are not correct, check the
value of the following configuration setting in the configuration file vsconfig.xml:
<general returnProtectedValueOnAccessAuthFailure="true_or_false" />

If this value is set to true, then this behavior is expected: authentication/authorization failures
are trapped and ignored during access operations, with the protected values returned instead
as part of the successful completion. If you do not want this behavior, change this configuration
setting to false. Note that false is the default setting.

Hadoop Job Fails with REST on Older Voltage SecureData Servers

If the Hadoop job fails with the following error, it means that the REST API was not detected on
the Voltage SecureData Server:
Unable to connect to REST service on specified SecureData server.
Check that SecureData server is at version 6.0 or higher.

If your Voltage SecureData Server has a version earlier than 6.0, it does not support the REST
API. If you want to use the REST API, you must upgrade your Voltage SecureData Server to
version 6.0 or later.

Hadoop Job Fails with Specific REST API Error Code and Message
If the Hadoop job fails with an HTTP status, error code, and message of the following form, the
Voltage SecureData REST API Developer Guide will provide additional details about the error:
httpStatus: <status_number>; code: <error_code>; message: <message_text>

12-11 CONFIDENTIAL
Issues Running Hadoop Jobs Developer Templates Integration Guide Version 5.0

Sqoop Import Job: REST API Error UNSUPPORTED_CODEPOINT

When attempting to run the Sqoop import command on a table that includes non-ASCII data
in the name column, the command may fail with the following error when calling the REST API:
com.voltage.securedata.hadoop.crypto.CryptoException: REST API call failed
for formatInfo [FormatInfo[format=AlphaExtendedTest1, identity=test@test.
int, authMethod=SharedSecret, sharedSecret=********]]:

httpStatus: 400; code: 30160; message: UNSUPPORTED_CODEPOINT: Data contains

codepoint(s) outside of the supported range to ignore (pass through).

[Additional Details: Input contains unsupported Unicode codepoint.];

dataIndex: 9

This error usually indicates that the data was not loaded into the database table as UTF8,
causing corruption in the interpreted bytes. To correct this, make sure you include the directive
to specify the character set as UTF8 so that the non-ASCII data in the name column is loaded
correctly:
LOAD DATA LOCAL INFILE '/<your_absolute_path>/plaintext.csv' INTO
TABLE voltage_sample CHARACTER SET UTF8 FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';

Hadoop Job Tasks Fail When Voltage SecureData Server is Overloaded

When running Hadoop jobs on a cluster with many MapReduce containers, tasks in the job may
fail if too many key requests from those tasks (started at around the same time) all concurrently
request cryptographic keys from the Voltage SecureData Server.

The exact error messages in the log files on the Hadoop client and Voltage SecureData Server
vary, but the general behavior is a failure of one or more tasks to download keys from the Key
Server, even though there are no problems (such as invalid authentication credentials) with the
actual requests themselves.

If you experience this issue, note that you generally need to scale your Voltage SecureData
Server cluster to handle the load coming from your Hadoop cluster. There is no strict formula
for determining exactly how many hosts you need in your Voltage SecureData Server cluster,
but the general recommendation is about one Voltage SecureData Server (configured with
1000 server threads each) per 1000 MapReduce containers. You may have to try a few
different cluster configurations, starting with this recommendation, to find the one that works
best for you.

Error When Using Spark (Spark1) Instead of Spark2

If you attempt to run the Spark Developer Template with Spark (Spark1) instead of Spark2, you
will see an error such as the following:
Exception in thread "main" java.lang.NoSuchMethodError:
scala.runtime.IntRef.create(I)Lscala/runtime/IntRef;
at com.voltage.securedata.spark.rdd.SDSparkBatchProcessor
.processRDD(SDSparkBatchProcessor.scala:128)

CONFIDENTIAL 12-12
Developer Templates Integration Guide Version 5.0 Issues Running Hadoop Jobs

at com.voltage.securedata.spark.rdd.SDSparkDriver$.main(SDSparkDriver.scala:60)
at com.voltage.securedata.spark.rdd.SDSparkDriver.main(SDSparkDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl
.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$
.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:730)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Spark2 is significantly different than the original version of Spark (Spark1), including internal
changes and changes to the API. The Spark Developer Template requires Spark2 to execute
successfully. You can check your version of Spark using the following command:
spark-submit --version

12-13 CONFIDENTIAL
Issues Running Hadoop Jobs Developer Templates Integration Guide Version 5.0

CONFIDENTIAL 12-14
A Voltage SecureData Server Configuration
The formats, identities, and authentication credentials that work with the Developer Templates
are specific to the dataprotection Voltage SecureData Server hosted by Micro Focus Data
Security (voltage-pp-0000.dataprotection.voltage.com). If you don’t have access to
this hosted Voltage SecureData Server, you will need to configure your own Voltage
SecureData Server with identical settings in order to run the Developer Templates, as delivered.
This section specifies those settings. For additional assistance, contact your Voltage
SecureData administrator or see the Voltage SecureData Administrator Guide.

Formats

A format specifies the settings used during data protection operations. These settings include
restrictions on input data, the appearance of the protected data, and the type of protection,
either encryption or tokenization. The settings are bundled together as a format that can be
referenced by name. The format name is a required cryptId setting in the XML configuration file
vsconfig.xml. For details, see "Format" (page 3-20).

Use the Management Console to configure the formats you need to protect and access your
data. Navigate to the Data Protection Settings > Format Settings tab on the Management
Console to create these formats. You can use either encryption or tokenization to protect credit
card numbers and social security numbers. You can only use encryption to protect strings,
numbers, and dates.

The following formats, used by the Developer Templates, are already configured on the
dataprotection Voltage SecureData Server:

• Alphanumeric

• SSN

• cc-sst-6-4

• DATE-ISO-8601

• AlphaExtendedTest1 (Required for the REST API only, and available only in
versions 6.0 and later of the Voltage SecureData Server)

The Alphanumeric and SSN formats are pre-configured on all Voltage SecureData Servers, but
if you are not using the dataprotection Voltage SecureData Server, you must configure the
latter three formats above.

Navigate to the Data Protection Settings > Format Settings tab on the Management
Console if you need to create these formats.

A-1 CONFIDENTIAL
Formats Developer Templates Integration Guide Version 5.0

cc-sst-6-4 Format
Click the Create Credit Card Format link to open the Add new Credit Card format page.
Configure the new credit card format using the following settings:

• Format Name: cc-sst-6-4

• Luhn Digit Handling: Ignore checksum

• Leading/Trailing digits to not protect: Leading: 6

Trailing: 4

• Data Protection Type: Secure Stateless Tokenization™ - randomly...

• Tokenization Identity: [email protected]

When successfully configured, this format’s settings appear as follows:

CONFIDENTIAL A-2
Developer Templates Integration Guide Version 5.0 Formats

DATE-ISO-8601 Format
Click the Create Date Format link to open the Add new Date format page. Configure the new
date format using the following settings:

• Format Name: DATE-ISO-8601

• Format String: YYYY-MM-DD

• Minimum Year: 1900

• Maximum Year: 2030

• Data Protection Type: FPE – Format Preserving Encryption

When successfully configured, this format’s settings appear as follows:

AlphaExtendedTest1
To create an extended alphabet format for use with the REST API, click the Create Variable-
length String Format link to open the Add new Variable Length String format page.
Configure the new variable-length string format using the following settings:

• Format Name: AlphaExtendedTest1

• Specify alphabet using codepoint range(s): Enabled

• Ignore Characters not in Alphabet: Enabled

• Alphabet:
0x41-0x5A,0x61-0x7A,0xC0,0xC2,0xC4,0xC6-0xCB,0xCE-0xCF,0xD4,0xD6,
0xD9,0xDB-0xDC,0xDF,0xE0-0xE2,0xE4,0xE6-0xEB,0xEE-0xEF,0xF4,0xF6,
0xF9,0xFB-0xFC,0x130-0x131,0x11E-0x11F,0x15E-0x15F

A-3 CONFIDENTIAL
Identity Developer Templates Integration Guide Version 5.0

• Data Protection Type: FPE2 - Format-Preserving Encryption Version 2

When successfully configured, this format’s settings appear as follows:

Identity

The REST API limits access to data protection and access operations by using an identity to
determine the level of authorization.

NOTE: Unlike the REST API, the Simple API only supports authentication, without different
levels of authorization.

Encryption formats can be used by multiple identities, but each tokenization format is bound to
a single identity. A Voltage SecureData administrator specifies the identity when creating a
tokenization format, and then sets up authorization rules for all identities that are to be used for
authorization. The identity is a required cryptId setting in the configuration file vsconfig.xml.
For more information, see "Authentication and Authorization Overview" (page 3-1) and
"Identity" (page 3-17).

The rules, which are configured using the Web Service > Identity Authorization tab of the
Management Console, control which operations can be performed when using a specific set of
authentication credentials and a matching identity.

• Protect - Permits a user to encrypt or tokenize data. Without permission to Protect, the
user can perform decryption and de-tokenization actions only.

• Access - Determines how much information is visible in the decrypted or de-tokenized

data. The following values are available:

• No Access - Does not permit viewing of decrypted or de-tokenized data.

CONFIDENTIAL A-4
Developer Templates Integration Guide Version 5.0 Authentication

• Masked Access - Permits viewing only a portion of the decrypted or de-tokenized

data, such as the last four digits of credit card numbers.

• Full Access - Permits viewing all of the decrypted or de-tokenized data.

For example, if you need to both tokenize and de-tokenize data, with the ability to see the
plaintext values for the de-tokenized data, the Voltage SecureData administrator must enable
an identity authorization rule that authorizes both Protect and Full Access for the identity
bound to the tokenization format. When you run the Voltage SecureData for Hadoop software,
you must use authentication credentials that meet the criterion for matching this rule.

NOTE: If an identity authorization rule grants masked access, some or all of the plaintext
values in the output will display mask characters instead of the plaintext values. If an identity
authorization rule specifies no access, the plaintext values will not be displayed at all.

Authentication

By default, the Voltage SecureData for Hadoop Developer Templates use the shared secret
authentication method that is already configured on the dataprotection Voltage
SecureData Server. If you are not using that Voltage SecureData Server, you must configure the
shared secret or LDAP (username/password) authentication method in both the Key
Management > Authentication tab and the Web Service > Identity Authorization tab in the
Management Console.

NOTE: The default values in the configuration file vsauth.xml use an identity of
[email protected] and a Shared Secret authentication method with the secret value of
voltage123. If you are using different values on your Voltage SecureData Server, you must
update the configuration file vsauth.xml with those values.

Contact your Voltage SecureData administrator or see the Voltage SecureData Administrator
Guide for details about configuring an identity and authentication method.

Authentication verifies that users running protect and access operations have identified
themselves to the Voltage SecureData Server with valid credentials. Two types of
authentication can be used:

• Shared Secret - An arbitrary string (used like a password) that is shared between the
Voltage SecureData for Hadoop software and the Voltage SecureData Server. An
Voltage SecureData administrator enters this string when creating the authentication
method, and communicates its value to users of client applications.

• LDAP - A set of credentials, in the form of username/password, that is managed using a

separate LDAP server. An Voltage SecureData administrator sets up communication
between the Voltage SecureData Server and the LDAP server. The administrator can
verify that you have access to a username/password that is authorized to run protect
and access operations.

A-5 CONFIDENTIAL
Authentication Developer Templates Integration Guide Version 5.0

The Voltage SecureData Server verifies that the authentication credentials used in the protect
or access operations are valid for at least one of the authentication methods available for that
Voltage SecureData Server. Authentication information must be included in the configuration
file vsauth.xml. For more information, see "Authentication and Authorization Overview" (page
3-1).

You must ensure that at least one authentication method is configured in both the Key
Management > Authentication tab and the Web Service > Identity Authorization tab in the
Management Console.

NOTE: After changing the settings in the Management Console, navigate to the System tab
and click the Deploy button.

CONFIDENTIAL A-6

EMV Reader Writer Software v8
100% (1)
EMV Reader Writer Software v8
5 pages
Addressing Modes 8085 8086 Sample
No ratings yet
Addressing Modes 8085 8086 Sample
2 pages
Answers To Testing Throughout The Software Life Cycle Section
No ratings yet
Answers To Testing Throughout The Software Life Cycle Section
4 pages
Transformer Vs MOE
No ratings yet
Transformer Vs MOE
7 pages
MD Azar Resume
No ratings yet
MD Azar Resume
2 pages
Unit Iv
No ratings yet
Unit Iv
25 pages
Logcat 1728111734468
No ratings yet
Logcat 1728111734468
28 pages
Hitachi Data Systems
No ratings yet
Hitachi Data Systems
11 pages
EasySTONE 6.7 ENG
No ratings yet
EasySTONE 6.7 ENG
1,482 pages
Cyber Attack Iii Robotics
No ratings yet
Cyber Attack Iii Robotics
5 pages
DeviceNet The Ultimate Step-By-Step Guide
From Everand
DeviceNet The Ultimate Step-By-Step Guide
Gerardus Blokdyk
No ratings yet
Communication Technologies Tutorial
100% (1)
Communication Technologies Tutorial
71 pages
Bca Semester-I 2024-25
No ratings yet
Bca Semester-I 2024-25
13 pages
Micromate
No ratings yet
Micromate
2 pages
Lecture 3 Multi-Core Computing
No ratings yet
Lecture 3 Multi-Core Computing
38 pages
Correction BAC TP 2018 4éme SI
No ratings yet
Correction BAC TP 2018 4éme SI
4 pages
Exercises Meeting 6
No ratings yet
Exercises Meeting 6
2 pages
MIB Counter
No ratings yet
MIB Counter
31 pages
3HE17899AAAA01 - V1 - SR Linux R21.11 Product Overview
No ratings yet
3HE17899AAAA01 - V1 - SR Linux R21.11 Product Overview
48 pages
SecureData 6.9.7 Installation
No ratings yet
SecureData 6.9.7 Installation
80 pages
FYBSc CS NEP
No ratings yet
FYBSc CS NEP
41 pages
CMND7 Create - Manual v2
No ratings yet
CMND7 Create - Manual v2
28 pages
Data Communication and Computer Networks
No ratings yet
Data Communication and Computer Networks
26 pages
Literature Review On Waterfall Model
100% (2)
Literature Review On Waterfall Model
8 pages
Dynamic Cold Plate - 2023
No ratings yet
Dynamic Cold Plate - 2023
4 pages
AVT Workbook Video Streaming Whitepaper
No ratings yet
AVT Workbook Video Streaming Whitepaper
25 pages
Cd00208802 Stm32 Cryptographic Library Stmicroelectronics
No ratings yet
Cd00208802 Stm32 Cryptographic Library Stmicroelectronics
131 pages
Foxboro DCS: System Definition V3.6
100% (1)
Foxboro DCS: System Definition V3.6
20 pages
Toshiba Microcomputer and Peripheral LSIs
No ratings yet
Toshiba Microcomputer and Peripheral LSIs
62 pages
Seek Thermal SDK Quick Start Guide
No ratings yet
Seek Thermal SDK Quick Start Guide
5 pages
Spring Boot Maven Plugin Reference
No ratings yet
Spring Boot Maven Plugin Reference
108 pages
Postgresql Internals-14 Parts1-2 en
No ratings yet
Postgresql Internals-14 Parts1-2 en
223 pages
Polytechnic University of The Philippines Biñan Campus: Pup Library Management System Proposal-Timeline For 1 Year
No ratings yet
Polytechnic University of The Philippines Biñan Campus: Pup Library Management System Proposal-Timeline For 1 Year
4 pages
W3 - Lecture 6-RAD and Agile
No ratings yet
W3 - Lecture 6-RAD and Agile
32 pages
Openshift Installation Steps
No ratings yet
Openshift Installation Steps
18 pages
How To Deploy A Next - Js Application To A VPS
No ratings yet
How To Deploy A Next - Js Application To A VPS
21 pages
Neo4j - Graph Database PDF
No ratings yet
Neo4j - Graph Database PDF
19 pages
DO-178C Workflow With Qualified Code Generation
No ratings yet
DO-178C Workflow With Qualified Code Generation
1 page
MySQL PhpAdmin Vs Tiaportal Wincc Advanced
No ratings yet
MySQL PhpAdmin Vs Tiaportal Wincc Advanced
5 pages
Agilent 5530 Dynamic Calibrator - Update Information
No ratings yet
Agilent 5530 Dynamic Calibrator - Update Information
11 pages
MySQL Intake44 Lect3
No ratings yet
MySQL Intake44 Lect3
29 pages
(H660EW) - ONT Dasan
No ratings yet
(H660EW) - ONT Dasan
7 pages
Maintenance Boot Key Usage
No ratings yet
Maintenance Boot Key Usage
2 pages
Trellix Epolicy Orchestrator - On-Prem 5 10 0 Installation Guide 2024-11-27-10-17-10
No ratings yet
Trellix Epolicy Orchestrator - On-Prem 5 10 0 Installation Guide 2024-11-27-10-17-10
125 pages
Coq 8.10.2 Reference Manual PDF
No ratings yet
Coq 8.10.2 Reference Manual PDF
643 pages
Adobe CQ 5.6 Advanced Developer Student Workbook - FINAL - 20130403
No ratings yet
Adobe CQ 5.6 Advanced Developer Student Workbook - FINAL - 20130403
308 pages
Sg245204 - Subarea To APPN Migration - HPR and DLUR Implementation
No ratings yet
Sg245204 - Subarea To APPN Migration - HPR and DLUR Implementation
356 pages
Terraform From Bigginer To Master
100% (4)
Terraform From Bigginer To Master
90 pages
Introduction To Rate Monotonic Scheduling
100% (1)
Introduction To Rate Monotonic Scheduling
4 pages
Docs Frrouting Org Dev Guide en Latest
No ratings yet
Docs Frrouting Org Dev Guide en Latest
206 pages
RSL10 Sample Code Users Guide
No ratings yet
RSL10 Sample Code Users Guide
22 pages
Unified Threat Management
No ratings yet
Unified Threat Management
3 pages
Coursera Enterprise Catalog - Master
No ratings yet
Coursera Enterprise Catalog - Master
1,702 pages
Manoj Jayakumar: Experience Summary
No ratings yet
Manoj Jayakumar: Experience Summary
7 pages
WA5901G01 ProductOverview
No ratings yet
WA5901G01 ProductOverview
26 pages
Neo4j Operations Manual 3.5 PDF
No ratings yet
Neo4j Operations Manual 3.5 PDF
378 pages
Frrouting Developers Guide
No ratings yet
Frrouting Developers Guide
315 pages
Ibm Websphere Application Server V8.5 Administration and Configuration Guide For Liberty Profile
No ratings yet
Ibm Websphere Application Server V8.5 Administration and Configuration Guide For Liberty Profile
226 pages
MonetDB User Guide
No ratings yet
MonetDB User Guide
49 pages
SOCETSET User Man
No ratings yet
SOCETSET User Man
1,175 pages
Acx 2100
No ratings yet
Acx 2100
3,270 pages
OpenScape Voice V7, Service Manual - Installation and Upgrades, Installation Guide, Issue 25
No ratings yet
OpenScape Voice V7, Service Manual - Installation and Upgrades, Installation Guide, Issue 25
963 pages
Apache Maven Users Guide
No ratings yet
Apache Maven Users Guide
342 pages
HN9800 User
No ratings yet
HN9800 User
34 pages
OpenEdge Development Basic Development Tools
No ratings yet
OpenEdge Development Basic Development Tools
86 pages
68/4 DIT Road, East Hazipara, Rampura, Dhaka-1219, Bangladesh
No ratings yet
68/4 DIT Road, East Hazipara, Rampura, Dhaka-1219, Bangladesh
8 pages
Ibm Utl Tcsuite 9.63 Anyos Noarch User Guide PDF
No ratings yet
Ibm Utl Tcsuite 9.63 Anyos Noarch User Guide PDF
142 pages
JSON
No ratings yet
JSON
33 pages
Sophos XG Firewall Virtual Appliance: Getting Started Guide
No ratings yet
Sophos XG Firewall Virtual Appliance: Getting Started Guide
10 pages
HP ProLiant DL980 G7 Server - QuickSpecs v43 (NA) PDF
No ratings yet
HP ProLiant DL980 G7 Server - QuickSpecs v43 (NA) PDF
46 pages
CyberOps v1.1 Instructor Game Instructions
No ratings yet
CyberOps v1.1 Instructor Game Instructions
6 pages
CDP-NSS Administration Guide
No ratings yet
CDP-NSS Administration Guide
661 pages
EQ Administration Guide v5.7 PDF
No ratings yet
EQ Administration Guide v5.7 PDF
504 pages
Logalyze 4.1 Installation Guide en
No ratings yet
Logalyze 4.1 Installation Guide en
9 pages
Copa Sem 1
100% (1)
Copa Sem 1
120 pages
Ubuntu 7.10 Desktop Training Student
100% (6)
Ubuntu 7.10 Desktop Training Student
398 pages
Change IP Address Single Instance ASM
No ratings yet
Change IP Address Single Instance ASM
4 pages
Apache Maven PDF
No ratings yet
Apache Maven PDF
353 pages
Apache Maven
No ratings yet
Apache Maven
356 pages
WSAD Programming Guide
100% (1)
WSAD Programming Guide
870 pages
Xerox ColorQube 9201 - 9202) 9203 System Administrator Guide
No ratings yet
Xerox ColorQube 9201 - 9202) 9203 System Administrator Guide
328 pages
Zope 3 Book
No ratings yet
Zope 3 Book
485 pages
Businessobjects 6.X To Xi Release 2 Migration Guide
No ratings yet
Businessobjects 6.X To Xi Release 2 Migration Guide
350 pages
BNSG-9000 Firmware User's Guide
100% (1)
BNSG-9000 Firmware User's Guide
38 pages
ITCAM sg247151
No ratings yet
ITCAM sg247151
498 pages
JCE-Developer - Guide-4 6 0
0% (1)
JCE-Developer - Guide-4 6 0
108 pages
Kerio Control Adminguide en 7.3.1 4142 PDF
No ratings yet
Kerio Control Adminguide en 7.3.1 4142 PDF
272 pages
OpenScape Business V1, Accounting Manager, User Guide, Issue 1 PDF
No ratings yet
OpenScape Business V1, Accounting Manager, User Guide, Issue 1 PDF
29 pages
Cluster Info
No ratings yet
Cluster Info
16 pages
S2a6620 Clui Manual 1.2
No ratings yet
S2a6620 Clui Manual 1.2
62 pages