Voltage - SecureData - Hadoop - 5.0 - Jul2022Update - Developer 1
Voltage - SecureData - Hadoop - 5.0 - Jul2022Update - Developer 1
Integration Guide
July 2022
Legal notices
© Copyright 2011, 2014, 2016-2020, 2022 Micro Focus or one of its affiliates.
The only warranties for products and services of Micro Focus and its affiliates and licensors (“Micro Focus”) are as may be set forth in the
express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional
warranty. Micro Focus shall not be liable for technical or editorial errors or omissions contained herein. The information contained herein is
subject to change without notice.
Except as specifically indicated otherwise, this document contains confidential information and a valid license is required for possession, use
or copying. If this work is provided to the U.S. Government, consistent with FAR 12.211 and 12.212, Commercial Computer Software,
Computer Software Documentation, and Technical Data for Commercial Items are licensed to the U.S. Government under vendor's standard
commercial license.
Contents
1 CONFIDENTIAL
Building the Datastream Developer Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19
Build Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20
Editing the Properties in the Parent Maven POM File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20
Build Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-24
Chapter 3: Common Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1
Authentication and Authorization Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1
Kerberos Delegation Token HDFS Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3
Configuration Step Summary for Kerberos Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3
Additional Kerberos Steps on the KDC Computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
Additional Kerberos Steps on the KDC Computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
Configuration Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5
Domain Name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6
Hostname . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
Simple API Install Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
Simple API Policy URL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8
Simple API Cache Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-9
Simple API File Cache Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10
Simple API Short FPE Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10
Web Service Batch Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11
REST Hostname . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11
Authentication/Authorization Failure on Access Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12
Product Name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14
Product Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-15
Component-Specific Designator for Client ID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17
Identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17
API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-18
CryptId Name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19
Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20
CryptId AuthId Name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20
Translator Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21
Translator Initialization Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22
CryptId Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22
Component-Specific Designator for Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-23
Field Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-23
Field Name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24
CryptId Name for Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-25
Delegation Token HDFS Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-26
Authentication/Authorization Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-26
Shared Secret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28
Username . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-29
Password . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-30
AuthId Name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-31
XML Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-32
CONFIDENTIAL 2
vsconfig.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-33
High-Level Elements in vsconfig.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-33
Attribute Values in vsconfig.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-37
vsauth.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-38
High-Level Elements in vsauth.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-39
Element and Attribute Values in vsauth.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-40
vs<component>.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-41
High-Level Elements in vs<component>.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-42
Attribute Values in vs<component>.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-43
Java Properties Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-45
vsnifi.properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-45
Specifying the Location of the XML Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-47
-D Generic Option to Specify a Property Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-47
Config-Locator Properties File Packaged as a JAR File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-48
Precedence When Checking for XML Configuration File Locations . . . . . . . . . . . . . . . . . . . . 3-50
Other Approaches to Providing Configuration Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-52
Shared Integration Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-54
Utility and Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-55
Package Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-55
Multiple Developer Template TrustStores - Background and Usage . . . . . . . . . . . . . . . . 3-56
Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-56
Package Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-57
Common Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-57
Package Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-57
Package Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-58
Hadoop Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-60
Package Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-61
Package Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-62
Shared Code for the DataStream Developer Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-63
Package Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-64
Package Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-64
Data Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-65
Package Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-65
Package Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-66
Data Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-67
Package Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-67
Package Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-67
Cryptographic Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-68
Package Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-69
Package Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-69
Using Old Versions of Other Voltage SecureData Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-71
Shared Sample Data for the Hadoop Developer Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-73
plaintext.csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-73
creditscore.csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-74
3 CONFIDENTIAL
Common Procedures for the Hadoop Developer Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-75
Common HDFS Procedures for the Hadoop Developer Templates . . . . . . . . . . . . . . . . . . . . 3-75
Creating a Home Directory in HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-75
Loading Hadoop Developer Template Files into HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-76
Loading Updated Configuration Files into HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-77
Common Procedures for Working with Kerberos Authentication . . . . . . . . . . . . . . . . . . . . . . 3-78
Prerequisites for Using Kerberos Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-78
The Hadoop Developer Templates Delegation Token Scripts . . . . . . . . . . . . . . . . . . . . . . 3-79
Getting Your Kerberos Ticket Granting Ticket . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-83
Getting And Storing Your Kerberos Delegation Token . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-83
Run Your Hadoop Developer Template Job or Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-84
Optional Destruction of the Delegation Token . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-84
Kerberos Authentication When Beeline/HiveServer2 Impersonation is Disabled . . . . 3-84
Logging and Error Handling in the Hadoop Developer Templates . . . . . . . . . . . . . . . . . . . . . . . 3-87
Handling Empty and Net-Empty Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-88
Known Limitations of the Developer Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-91
Hadoop Developer Template Code Needs a Full CSV Parser . . . . . . . . . . . . . . . . . . . . . . . . . . 3-91
Additional Verification of Converter and Translator Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-91
Additional Robustness for Java Properties Configuration File Parsing . . . . . . . . . . . . . . . . . 3-91
Chapter 4: MapReduce Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1
Integration Architecture of the MapReduce Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1
Batch Processing for MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
Configuration Settings for the MapReduce Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4
Updating Field Settings for the MapReduce Developer Template . . . . . . . . . . . . . . . . . . . . . . 4-5
Running the MapReduce Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6
Chapter 5: Hive Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1
History of Hive Support in the Hive Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2
Hive Developer Template 3.10 and Earlier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2
Hive Developer Template 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2
Hive Developer Template 4.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
Hive Developer Template 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
Hive Developer Template 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
Hive Developer Template 5.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
Different Types of Hive UDFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
Hive UDFs for Formatted Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
Hive UDFs for Binary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6
Special Considerations When Using the Binary Hive UDFs . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8
Integration Architecture of the Hive Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12
Java Classes for the Hive UDFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13
Creating and Calling Hive UDFs from the Hive Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14
CONFIDENTIAL 4
Limitations of the Hive UDFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15
Hive UDFs Work on One Data Value at a Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15
Hive UDF Failures with Literal Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16
Hive UDF Failures with HortonWorks HDP 2.2 and later . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-18
Major Changes in Hive 3.0 and the Script Changes They Required . . . . . . . . . . . . . . . . . 5-18
No Batch Processing for Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-19
Integration with Apache Impala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-19
Limitations and Special Requirements When Using Impala . . . . . . . . . . . . . . . . . . . . . . . . . 5-20
Configuration Settings for the Hive Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-21
Example of Updating Settings in the HIVE Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-22
Running the Hive Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-23
Setting Up to Run the Hive Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-24
Edit the Hive Scripts to Replace <username> Placeholder . . . . . . . . . . . . . . . . . . . . . . . . . . 5-24
Copy the Required JAR Files to Your Data Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-25
Create the Hive Tables for the Hive Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-25
Enable Impersonation for Beeline and Remote Hive Queries . . . . . . . . . . . . . . . . . . . . . . . . 5-26
Running Queries Locally From a Node Within Your Hadoop Cluster . . . . . . . . . . . . . . . . . . . 5-27
Running a JOIN Query Using a Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-27
Running a Binary Data Query Using a Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-28
Running a Simple HiveQL Query Interactively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-28
Running Hive Queries Using the Beeline Shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-29
Running a Hive Query from a Remote Computer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-30
Creating Permanent Hive UDFs Using the Hive Command Line . . . . . . . . . . . . . . . . . . . . 5-31
Running a Remote Query Using JDBC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-31
Running a Remote Query Using ODBC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-38
Using the Generic Hive UDFs When Impersonation is Disabled . . . . . . . . . . . . . . . . . . . . . . . . 5-40
Running Queries in the Context of Hive LLAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-43
Running Queries Using Apache Impala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-44
Copy JAR Files to the Nodes Running the Impala Daemon . . . . . . . . . . . . . . . . . . . . . . . . . 5-44
Preparing Protected Data to Load into Impala Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-44
Create the Impala Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-46
Create the Impala UDFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-47
Run Impala Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-48
Drop Impala UDFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-48
Chapter 6: Sqoop Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1
Integration Architecture of the Sqoop Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2
Batch Processing for Sqoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3
Configuration Settings for the Sqoop Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-4
Example of Updating Settings in the SQOOP Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
Running the Sqoop Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6
Load Sample Data into MySQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7
Generate an ORM JAR File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7
Import and Protect Your Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9
Display the Protected Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10
5 CONFIDENTIAL
Advanced Sqoop Import Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10
Non-Batched Version of the Sqoop Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12
Determine Whether the Sqoop Integration Supports a Sqoop Import Option . . . . . . . 6-13
Chapter 7: Spark Integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-1
Integration Architecture of the Spark Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3
RDD and Dataset Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-4
RDD and Dataset Driver Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-4
RDD and Dataset Processor Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5
UDF-Based Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8
DataFrame, Spark SQL, and HiveUDF Driver Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8
DataFrame, Spark SQL, and HiveUDF Processor Functionality . . . . . . . . . . . . . . . . . . . . . . 7-11
Logging and Error Handling in the Spark Developer Template . . . . . . . . . . . . . . . . . . . . . . . . 7-12
Configuration Settings for the Spark Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12
Alternative Approaches to Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14
Sample Data for the Spark Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14
Running the Spark Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14
Run-Time Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-15
Changing the Input and Output Locations and Filenames on HDFS . . . . . . . . . . . . . . . . . . . 7-15
Steps to Run the Spark Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-16
Hadoop Distribution Dependencies for Running the Sample Jobs . . . . . . . . . . . . . . . . . . . . . 7-18
Spark Script Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-19
run-spark-prepare-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-19
run-spark-protect-rdd-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-20
run-spark-protect-dataset-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-20
run-spark-protect-dataframe-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-20
run-spark-protect-sql-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-20
run-spark-protect-hive-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-20
run-spark-access-rdd-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21
run-spark-access-dataset-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21
run-spark-access-dataframe-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21
run-spark-access-sql-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21
run-spark-access-hive-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21
run-pyspark-protect-dataframe-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-22
run-pyspark-protect-sql-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-22
run-pyspark-protect-hive-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-22
run-pyspark-access-dataframe-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-23
run-pyspark-access-sql-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-23
run-pyspark-access-hive-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-23
update-spark-config-files-in-hdfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-24
TrustStores Used by the Spark Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-24
Using the Spark Web Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-24
Limitations of the Spark Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-26
Chapter 8: NiFi Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1
CONFIDENTIAL 6
Quick Start Using the Provided NiFi Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2
Exercising the SecureDataExample Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-6
Using the Workflow With a Different Voltage SecureData Server . . . . . . . . . . . . . . . . . . . . . . . 8-7
Integration Architecture of the NiFi Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-8
Processor Classes for the NiFi Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-8
Configuration Classes for the NiFi Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-9
Logging and Error Handling in the NiFi Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . 8-10
Configuration Settings for the NiFi Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-10
Configuring the Properties of the NiFi SecureDataProcessor . . . . . . . . . . . . . . . . . . . . . . . . . . 8-12
Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-12
Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-13
Identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14
Auth Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-15
SharedSecret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-15
Username . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-15
Password . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-15
API Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-16
SecureDataProcessor Property Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-16
SecureDataProcessor Relationship Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17
Sample Data for the NiFi Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17
Adding the SecureDataProcessor to a Blank Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18
Limitations and Simplifications of the NiFi Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . 8-20
Chapter 9: StreamSets Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-1
Quick Start Using the Provided StreamSets Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2
Exercising the Sample Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-4
Using the Pipelines with a Different Voltage SecureData Server . . . . . . . . . . . . . . . . . . . . . . . . 9-6
Integration Architecture of the StreamSets Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . 9-7
Processor Classes for the StreamSets Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-8
Configuration Classes Used by the StreamSets Developer Template . . . . . . . . . . . . . . . . . . . 9-9
Logging and Error Handling in the StreamSets Developer Template . . . . . . . . . . . . . . . . . . . 9-9
Configuration Settings for the StreamSets Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . 9-10
Configuring the Settings of the Voltage SDProcessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-11
Operation Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-11
Config Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-11
Field to Process and CryptId Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-11
Sample Data for the StreamSets Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-11
Adding the Voltage SDProcessor to a Blank Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-12
Creating a New Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-12
Adding and Configuring an Origin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-13
Adding and Configuring the Voltage SDProcessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-15
Adding and Configuring a Destination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-16
Previewing the Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-18
Running the Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-20
Limitations of the StreamSets Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-21
7 CONFIDENTIAL
Chapter 10: Kafka Connect Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1
Integration Architecture of the Kafka Connect Developer Template . . . . . . . . . . . . . . . . . . . . . . 10-2
Transformation Classes for the Kafka Connect Developer Template . . . . . . . . . . . . . . . . . . . 10-3
Configuration Classes Used by the Kafka Connect Developer Template . . . . . . . . . . . . . . . 10-4
Kafka Connect Developer Template Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4
Logging in the Kafka Connect Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-5
Configuration Settings for the Kafka Connect Developer Template . . . . . . . . . . . . . . . . . . . . . . . 10-5
Kafka Connect Java Properties Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-5
connect-standalone-worker.properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-5
connect-file-source-protect.properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-6
connect-file-sink-access.properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-7
connect-file-sink.properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-8
Kafka Connect Developer Template XML Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . 10-9
Configuration File Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-10
Sample Data for the Kafka Connect Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-11
Running the Kafka Connect Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-12
Run-Time Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-12
Editing the Distribution-Specific Run-Time Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-12
kafkaBrokerList . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-12
kafkaServerPropsFile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-12
kafkaBinDir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-13
Steps to Run the Kafka Connect Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-13
Variations on Running the Kafka Connect Developer Template . . . . . . . . . . . . . . . . . . . 10-15
Script Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-16
create-kafka-topic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-16
delete-kafka-topic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-16
run-kafka-connect-protect-transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-16
vsdistrib.properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-17
Limitations of the Kafka Connect Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-17
Chapter 11: Kafka-Storm Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-1
Integration Architecture of the Kafka-Storm Developer Template . . . . . . . . . . . . . . . . . . . . . . . . 11-2
Kafka Producer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-3
Protect and Access Bolts, and Storm Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-3
Configuration Classes Used by the Kafka-Storm Developer Template . . . . . . . . . . . . . . . . . 11-4
Overview of the Storm Bolt Lifecycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-5
Logging and Error Handling in the Kafka-Storm Developer Template . . . . . . . . . . . . . . . . . 11-6
Configuration Settings for the Kafka-Storm Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . 11-7
Alternative Approaches to Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-8
Sample Data for the Kafka-Storm Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-10
Running the Kafka-Storm Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-10
Run-Time Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-11
CONFIDENTIAL 8
Editing the Distribution-Specific Run-Time Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-11
kafkaBrokerList . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-11
kafkaServerPropsFile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-11
kafkaBinDir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-12
hdfsOutDir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-12
Steps to Run the Kafka-Storm Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-12
Running the Kafka Console-Producer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-15
Script Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-16
create-kafka-topic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-16
delete-hdfs-files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-16
delete-kafka-topic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-17
delete-storm-topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-17
run-kafka-console-producer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-17
run-kafka-sample-producer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-17
run-storm-topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-18
vsdistrib.properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-18
Simplifications of the Kafka-Storm Developer Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-18
Chapter 12: Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-1
Hadoop Build Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
GLib Error During Build . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
NiFi Build Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
Calling the Simple API From More Than One NAR File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
Issues Running Hadoop Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3
Simple Queries using Hive UDFs Fail on Some Hadoop Distributions . . . . . . . . . . . . . . . . . . 12-3
Queries Using Hive UDFs Fail with Literal Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-4
Hive Script Changes Required When Using Hive 3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-4
Hive Queries Fail in Kerberized Clusters When kinit Is Not Performed . . . . . . . . . . . . . . . . . 12-4
Binary Hive UDFs Fail Due to Data Being Too Large for the REST API . . . . . . . . . . . . . . . . 12-4
Failure to Copy JAR Files to the hive/lib Directory on All Data Nodes . . . . . . . . . . . . . . . . . . 12-4
MapReduce Jobs Fail with NoClassDefFoundError . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-5
Sqoop Steps (Including codegen) Fail with DB Driver Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-5
Sqoop codegen Command Fails with Streaming Result Set Error . . . . . . . . . . . . . . . . . . . . . . 12-6
Sqoop Jobs Fail with ORM Class Method Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-7
Data Protection Operations Fail with TLS Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-8
Simple API Operations for Dates Fail with VE_ERROR_GENERAL Error . . . . . . . . . . . . . . . 12-9
Simple API Operations Fail with Authentication Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-9
Simple API Operations Fail with Library Load Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-9
Simple API Operations Fail with Network Connection Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-9
Developer Templates Cannot Find the Configuration Files in HDFS . . . . . . . . . . . . . . . . . 12-10
Unable to Load libvibesimplejava.so . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10
Remote Queries Using Hive UDFs Fail on BigInsight 4.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10
Hadoop Job Error: “Container exited with a non-zero exit code 134” . . . . . . . . . . . . . . . . 12-11
Hadoop Job Not Failing when Invalid Auth Credentials Used for Access . . . . . . . . . . . . . 12-11
Hadoop Job Fails with REST on Older Voltage SecureData Servers . . . . . . . . . . . . . . . . . 12-11
9 CONFIDENTIAL
Hadoop Job Fails with Specific REST API Error Code and Message . . . . . . . . . . . . . . . . . . 12-11
Sqoop Import Job: REST API Error UNSUPPORTED_CODEPOINT . . . . . . . . . . . . . . . . . . 12-12
Hadoop Job Tasks Fail When Voltage SecureData Server is Overloaded . . . . . . . . . . . . 12-12
Error When Using Spark (Spark1) Instead of Spark2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12
Appendix A: Voltage SecureData Server Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . A-1
Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1
cc-sst-6-4 Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2
DATE-ISO-8601 Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3
AlphaExtendedTest1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3
Identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4
Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-5
CONFIDENTIAL 10
1 Introduction to the Developer Templates
This introduction provides a broad description of the Voltage SecureData Developer
Templates, a brief description of each of the reference implementations it contains, and their
intended use. It also describes the organization of the remainder of this document.
Originally, the Developer Templates were strictly the Hadoop Developer Templates, providing
reference implementations of integrating Voltage SecureData APIs with several different
Hadoop components (MapReduce, Hive, and Sqoop). In subsequent releases, these reference
implementations have been enhanced and more reference implementations have been added.
The additional reference implementations include Hadoop components (such as Spark and
Impala) as well as similar open source solutions that don’t necessarily involve a Hadoop cluster
(such as NiFi, Kafka, Storm, and StreamSets).
• Java 8 source code that shows how calls to the Voltage SecureData Simple API and
REST API can be integrated into:
• Hadoop jobs, including MapReduce, Hive, Impala, Sqoop, and Spark, known
collectively as the Hadoop Developer Templates.
NOTE: The Spark integration is unique in that it includes both Scala and
Python source code.
• Stream-oriented workflows that use NiFi, StreamSets, Kafka Connect, and Kafka
and Storm working together, known collectively as the DataStream Developer
Templates.
• Configuration files that provide the information required for the demonstrated protect
and access operations.
• Sample data to demonstrate the protect and access operations in the reference
integrations.
• Where appropriate, scripts to run the setup and run the various Hadoop and
DataStream reference integrations.
• Remote Hive query infrastructure directories and files that let you simulate a Business
Intelligence (BI) tool conducting a Hive query from a computer outside your Hadoop
cluster.
1-1 CONFIDENTIAL
What are the Developer Templates? Developer Templates Integration Guide Version 5.0
• Support files for the various Hadoop and DataStream Developer Template reference
integrations.
• This document, the Voltage SecureData Developer Templates for Hadoop Integration
Guide, which provides detailed descriptions of, and instructions for using, the Developer
Templates.
• MapReduce: Protects plaintext and accesses ciphertext in CSV files between a source
HDFS location and a target HDFS location.
• Hive: UDFs that access columns of ciphertext in a Hive table using a HiveQL query.
• Impala: UDFs that access columns of ciphertext in an Impala table using a SQL query.
• Sqoop: Protects plaintext as it is imported from a relational database table into HDFS;
• Spark: Uses several different Spark data representation and query technologies to
protect plaintext and access ciphertext in CSV files between a source HDFS location and
a target HDFS location. The Spark Developer Template includes reference
implementations in both Scala and Python (using the PySpark libraries).
• NiFi: Uses a NiFi processor to protect plaintext and access ciphertext flowing through it.
• Kafka-Storm: Protects plaintext read from a Kafka topic using a Storm bolt and writes
the result to a file in HDFS.
These reference integrations use the Simple API and REST API to protect and access
Personally Identifiable Information (PII) using Format-Preserving Encryption (FPE). They also
use the REST API to protect and access Payment Card Industry (PCI) data using Secure
Stateless Tokenization™ (SST).
NOTE: Although the Developer Templates use FPE and SST as described above, these
protection technologies are not strictly tied to PII and PCI in this manner.
These reference integrations have been tested on several Hadoop distributions. For more
information, see the Voltage SecureData Developer Templates for Hadoop Version 5.0 Release
Notes.
CONFIDENTIAL 1-2
Developer Templates Integration Guide Version 5.0 Intended Use of the Developer Templates
In some cases, and for some aspects of the provided source code, the Developer Templates
demonstrate best practices when using the Voltage SecureData APIs, providing source code
that you can re-purpose more or less as is in your production solutions. In other cases and
aspects, the Developer Templates employ simplified techniques appropriate for demonstration
code that will need to be reworked for use in production solutions.
That said, the source code provided is meant to serve only as a starting point in the
development of production solutions. You must determine how best to adapt this code for your
own uses. You will need to customize this code to perform different types of protect and access
operations, on different data types and formats, and even perhaps use different underlying
libraries or adapt it to work with different technologies, such as Hadoop’s Pig or Flume.
For example, the Developer Templates code uses the Apache HttpClient 4.x library with its
client REST library, but your production implementation can use a different client-side library or
tool. The Voltage SecureData Server provides a REST-based web service, so you can use any
client library that sends and receives the appropriate type of messages.
• Chapter 2, “Install and Build the Developer Templates” - This chapter provides
instructions for installing and building all of the Developer Templates, as well as
providing information about the version requirements for supporting software.
1-3 CONFIDENTIAL
How this Document is Organized Developer Templates Integration Guide Version 5.0
This chapter also provides information about other common aspects of the Developer
Templates, such as the requirements for configuration files they use, an overview of the
sample data shared by the Hadoop Developer Templates, common HDFS procedures
for the Hadoop Developer Templates, logging and error handling in the Hadoop
Developer Templates, and known limitations across all of the Developer Templates.
In general, detailed information about the Developer Templates that is not specific to a
individual template can be found in this chapter.
• Chapter 5, “Hive Integration” - This chapter provides information that is specific to the
Hive reference integration, including its architecture, its configuration, and how to run it
using the provided sample data. It also provides information about the Impala reference
integration.
• Chapter 6, “Sqoop Integration” - This chapter provides information that is specific to the
Sqoop reference integration, including its architecture, its configuration, and how to run
it using the provided sample data.
• Chapter 7, “Spark Integration” - This chapter provides information that is specific to the
Spark reference integration, including its architecture, its configuration, and how to run it
using the provided sample data.
• Chapter 8, “NiFi Integration” - This chapter provides information that is specific to the
NiFi reference integration, including its architecture, its configuration, and how to build,
deploy, and run the NiFi Developer Template using the provided ready-to-run NiFi
workflow. It also explains how to integrate the NiFi processor that demonstrates Voltage
SecureData functionality into a NiFi workflow from scratch.
• Chapter 10, “Kafka Connect Integration” - This chapter provides information that is
specific to the Kafka Connect reference integration, including its architecture, its
configuration, and how to build and run the Kafka Connect Developer Template using
the provided Kafka source and sink connector transformations.
CONFIDENTIAL 1-4
Developer Templates Integration Guide Version 5.0 How this Document is Organized
1-5 CONFIDENTIAL
How this Document is Organized Developer Templates Integration Guide Version 5.0
CONFIDENTIAL 1-6
2 Install and Build the Developer Templates
This chapter describes the required related software as well as how to install and build the
Voltage SecureData Developer Templates for Hadoop on Linux.
This section describes the other software required to use the Developer Templates.
NOTE: See the Voltage SecureData Simple API Installation Guide for instructions for
installing the Simple API.
For the Hadoop Developer Templates, install the Simple API in the same directory on all data
nodes in your Hadoop cluster.
For the Datastream Developer Templates, install the Simple API in a directory on the
computer(s) where your data stream workflow will execute.
The XML configuration file vsconfig.xml and the NiFi configuration file
vsnifi.properties use the configuration setting "Simple API Install Path" (page 3-7) to
specify the Simple API installation directory. If you use these files’ default installation
directory /opt/voltage/simpleapi as your Simple API installation directory, you can
leave this configuration setting unchanged. Otherwise, you will need to change it in one or
both of these files, depending on which Developer Template(s) you are using.
You may want to run one or more of the sample programs provided with the Simple API in
order to independently verify that your Simple API installation is working correctly. If you do,
do so as the user under which you will be running these Developer Templates. For more
information about running the Simple API sample programs, see the Voltage SecureData
Simple API Developer Guide.
In order to use the Hadoop Developer Templates for production, you must have the Voltage
SecureData Server, version 5.8.2 or later, potentially including a license to use the SST
technology.
Using version 6.0 or later of the Voltage SecureData Server and version 5.0 or later of the
Simple API is recommended. This will enable you to protect and access extended character sets
using either the REST API, available in version 6.0 or later of the Voltage SecureData Server, or
using the Simple API, available in version 5. 0 or later.
2-1 CONFIDENTIAL
Related Software Requirements Developer Templates Integration Guide Version 5.0
Java Requirements
You must have the following versions of Java and Apache Maven to build and run the Hadoop
Developer Templates:
NOTE: The NiFi Developer Template requires the use of Maven 3.1.0 or later.
• Java 8
Hadoop Requirements
To use the Hadoop Developer Templates, you must already have a Hadoop cluster configured.
This version of the Hadoop Developer Templates has been tested on several versions of the
following Hadoop distributions:
• MapR
Refer to the latest Release Notes for the list of Hadoop distribution versions with which this
version of the Hadoop Developer Templates has been tested. Also note that this version of the
Hadoop Developer Templates is likely to continue to work with earlier versions of these
Hadoop distributions, with which earlier versions of the Hadoop Developer Templates were
tested.
https://aws.amazon.com
The following sections provide an overview of the steps required, but do not attempt to provide
complete instructions:
• Setting Up to Run the Hadoop Developer Templates on Amazon EMR (page 2-5)
CONFIDENTIAL 2-2
Developer Templates Integration Guide Version 5.0 Related Software Requirements
Setting up the Amazon EMR environment involves acquiring the Amazon Web Services
(AWS) prerequisites for Amazon EMR, configuring your Amazon EMR network, and
creating your Amazon EMR cluster. The high-level steps for accomplishing this are outlined
below. For detailed instructions, refer to Amazon’s on-line documentation.
Perform the following steps to acquire the AWS prerequisites required for using
Amazon EMR:
NOTE: An Amazon S3 bucket is used to store cluster log files and output data.
Because of Hadoop requirements, S3 bucket names used with Amazon EMR
have the following constraints:
• Must contain only lowercase letters (a-z), digits (0-9), periods (.),
and hyphens (-)
If you already have an S3 bucket that meets the criteria specified above, it can
be used. Otherwise create a new S3 bucket with a name that meets these
criteria.
3. Create and download an Amazon Elastic Compute Cloud (Amazon EC2) key pair
(as a .pem file).
The first step in setting up the EMR cluster is to configure the Network Configuration
settings which include the following steps:
Within the AWS’s EMR Advanced Options for cluster creation, there are four steps for
configuring your EMR cluster:
2-3 CONFIDENTIAL
Related Software Requirements Developer Templates Integration Guide Version 5.0
1. (Step 1: Software and Steps) Configure your EMR cluster software by choosing
Hadoop, Hive, Spark, Sqoop, Tez, and ZooKeeper, noting that version numbers
for each component may vary.
NOTE: The Spark Developer Template requires that your EMR cluster include
Spark2 (and only Spark2).
You can also add any other components that your scenario requires.
2. (Step 2: Hardware) Configure the virtual hardware for your EMR cluster by
setting the number of nodes of each type (Master, Core, and Task), and their
associated characteristics.
3. (Step 3: General Cluster Settings) Set your general EMR cluster options, such as
its name, logging and debugging characteristics, and so on.
4. (Step 4: Security) Set the security options for your EMR cluster, such as the EC2
key pair and security group created previously.
5. Click Create cluster to create your EMR cluster with the characteristics you have
configured.
AWS will provision and create your EMR cluster, and when successful, moving through
the Bootstrapping and Starting states until reaching the Waiting state, and resulting
in a Summary tab that will look similar to the following:
The Hardware tab will show the Master and Core nodes (and optional Task nodes) that
you have configured:
CONFIDENTIAL 2-4
Developer Templates Integration Guide Version 5.0 Related Software Requirements
In order to run the Hadoop Developer Templates on Amazon EMR, you first need to be able
to remotely log into your Amazon EMR nodes. To do so, you will need your Amazon E2C
key pair (the third AWS prerequisite mentioned above, which was used when setting up the
security in Step 4 of creating your Amazon EMR cluster) and the public ID/DNS of the
relevant node(s):
ssh -i <location_of_keypair_pem_file> <public_id@public_DNS_name>
You created and downloaded your Amazon E2C key pair in an earlier step. To find the
public ID/DNS of the relevant node, click on its ID in the cluster Hardware tab, as shown
above.
Some of the installation steps for the Hadoop Developer Templates require that you be
logged in as the root user. To do so on an Amazon EMR node, several additional steps are
required. After remotely logging into the relevant node, perform these steps:
Allow root login and password authentication by uncommenting the following lines
and setting values to yes:
PermitRootLogin yes
PasswordAuthentication yes
After completing these steps, log in as the root user on the specified Amazon EMR nodes in
order to install the prerequisites and the Hadoop Developer Templates themselves. Follow
these high-level steps:
1. Install Apache Maven on the Master node or node where the Hadoop Dev
Templates will be installed by A) following the instructions on the Apache Maven
Web site (https://maven.apache.org/install.html), or B) using the
Amazon Machine Images (AMI) repositories:
2-5 CONFIDENTIAL
Related Software Requirements Developer Templates Integration Guide Version 5.0
2. On all of your Amazon EMR nodes, add a new user and set the corresponding
password:
useradd awsuser
NOTE: You can use another valid username. If you do, substitute that username
as appropriate in the steps that follow.
3. On all of your Amazon EMR nodes, install version 8 of the Java Development Kit
(JDK8) and set the environmental variable JAVA_HOME in the bash profile (~./
bash_profile) for both the root user and for the awsuser user created above.
NOTE: JDK8 may already be installed and the environment variable JAVA_HOME
may already be set. Use the following commands to check:
java -version
echo $JAVA_HOME
4. Install MySQL on one of the Core machines for Sqoop queries to work
6. Install the Simple API on all Amazon EMR nodes. For detailed step-by-step
instructions, see the Voltage SecureData Simple API Developer Guide.
7. Install the Hadoop Developer Templates as the user awsuser on the Master node
or one of the Core nodes (from where you intend to run the Hadoop Developer
Templates). For detailed step-by-step instructions, see "Installing the Developer
Templates" (page 2-7).
8. Build the Hadoop Developer Templates as the user awsuser on the same node
where you installed them. For detailed step-by-step instructions, see "Building the
Developer Templates" (page 2-10).
CONFIDENTIAL 2-6
Developer Templates Integration Guide Version 5.0 Installing the Developer Templates
9. Create an HDFS directory for user awsuser and change its ownership:
su - hdfs
After completing these steps, you are ready to run the Hadoop Developer Templates
samples on Amazon EMR. For information about running a particular Developer Template,
see its corresponding Running section in the template-specific chapters of this guide. For
example:
This section provides instructions for installing the Developer Templates and includes a
description of the corresponding installation payload.
The Developer Templates are packaged in an archive file with a name of the following form:
voltage-hadoop-src-<version>-<build_id>.sh.tar
4. Read through the license agreement and enter y to accept the license.
A message displays the default location in which the Developer Templates will be
installed. This is a subdirectory that includes the version number and package name.
2-7 CONFIDENTIAL
Installing the Developer Templates Developer Templates Integration Guide Version 5.0
5. Enter Y to choose the default location, or enter n to install the Developer Templates into
the directory in which you copied the .tar package file.
<install_dir> Folder
This folder contains the Maven parent POM file (pom.xml) that controls the build process for
the two multi-module Maven projects below it:
• The Hadoop Developer Templates (MapReduce, Hive and Impala, and Sqoop).
NOTE: The Spark Developer Template, because it is written in Scala and Python, is built
using a different tool: Simple Build Tool (sbt). For more information, see "Building the Spark
Developer Template" (page 2-17).
<install_dir>/bin Folder
This directory contains numerous scripts, including HQL and SQL scripts and scripts specific to
Kerberos authentication, used to run the Hadoop Developer Template samples for MapReduce,
Hive, Impala, and Sqoop.
<install_dir>/clientsamples Folder
This directory hierarchy contains a variety of files for building and executing a Hive query using
a JDBC driver and an ODBC driver on Windows.
For more information about the contents of this directory hierarchy, see Chapter 5, “Hive
Integration”.
<install_dir>/common-crypto Folder
This directory hierarchy contains the Java source code and Javadoc package files for the
infrastructure common to the Developer Templates, including the REST client and build
support infrastructure for the inclusion of the Simple API.
This directory contains a sub-directory named simpleapi, initially empty except for a README
file. This sub-directory is where the Maven build directives and several scripts expect to find the
required Simple API JAR file, vibesimplejava.jar. You will need to copy this file here.
CONFIDENTIAL 2-8
Developer Templates Integration Guide Version 5.0 Installing the Developer Templates
<install_dir>/config Folder
This directory contains the XML configuration files for the Hadoop Developer Templates:
vsconfig.xml and vsauth.xml. It also contains their corresponding schema files in the sub-
directory schema.
<install_dir>/configlocator Folder
This directory contains the Java Properties configuration file config-locator.properties,
which is used in one of the alternative schemes for specifying the location of the standard XML
configuration files vsconfig.xml and vsauth.xml.
For more information about this alternative scheme, see "Config-Locator Properties File
Packaged as a JAR File" (page 3-48).
<install_dir>/dev-templates-src Folder
This directory contains the source code for the Hadoop Developer Templates other than the
Spark Developer Template (MapReduce, Hive, Impala, and Sqoop), including the Hadoop-
specific authentication and configuration code they share.
<install_dir>/eula Folder
This directory contain the End-User License Agreement to which you are obligated in the file
Micro_Focus_EULA.txt.
<install_dir>/sampledata Folder
This directory contains the sample data text files for the Hadoop Developer Templates,
including the sample data file encoded_binary.csv, which contains Base64-encoded binary
data for demonstrating the Hive and Impala binary UDFs.
<install_dir>/spark Folder
This directory hierarchy contains the sub-directories and files specific to the Spark Developer
Template, including the file build.sbt, which specifies the Scala build settings for Spark
Developer Template, and the sub-directories bin, config, lib, project, sampledata, and src that
contain the scripts, a Spark-specific XML configuration file, information and data for building
and running the Spark Developer Template samples, and its Scala and Python source code.
<install_dir>/stream Folder
This directory contains the Maven build files (pom.xml and build_project.sh) for all of the
Datastream Developer Templates (StreamSets, NiFi, Kafka Connect, and Kafka-Storm).
2-9 CONFIDENTIAL
Building the Developer Templates Developer Templates Integration Guide Version 5.0
NOTE: You must edit a variety of properties in the pom.xml file prior to building the
Datastream Developer Templates, such as to specify the Kafka and Storm versions being
used by your Hortonworks or Cloudera distribution.
<install_dir>/stream/kafka_connect Folder
This directory hierarchy contains the files and sub-directories specific to the Kafka Connect
Developer Template, including build files, scripts, XML configuration files, sample data, and Java
source code.
<install_dir>/stream/kafka_storm Folder
This directory hierarchy contains the files and sub-directories specific to the Kafka-Storm
Developer Template, including build files, scripts, XML configuration files, sample data, and Java
source code.
<install_dir>/stream/nifi Folder
This directory hierarchy contains the files and sub-directories specific to the NiFi Developer
Template, including build files, scripts, XML configuration files, sample data, a workflow
template, and Java source code.
<install_dir>/stream/stream_common Folder
This directory hierarchy contains the files and sub-directories shared by many of the
Datastream Developer Templates (StreamSets, Kafka Connect, and Kafka-Storm), including
build files and Java source code.
<install_dir>/stream/streamsets_processor Folder
This directory hierarchy contains the files and sub-directories specific to the StreamSets
Developer Template, including build files, scripts, XML configuration files, sample data and
pipelines, resources, and Java source code.
• Building the MapReduce, Hive, and Sqoop Developer Templates (page 2-11)
CONFIDENTIAL 2-10
Developer Templates Integration Guide Version 5.0 Building the Developer Templates
Use the Hadoop Developer Templates source code and build infrastructure to generate the
JAR files to be copied to your Hadoop environment for the MapReduce, Hive, and Sqoop
Developer Templates. The build instructions in this section are divided into the following
categories, each addressed in its own sub-section:
• Editing the Properties in the Parent Maven POM File (page 2-11)
Build Prerequisites
The following software packages must be installed and available at compile-time in order to
build the Java-based Hadoop Developer Templates:
NOTE: The NiFi Developer Template, which is built as part of the Datastream
Developer Templates, requires the use of Maven 3.1.0 or later. For more information,
see "Building the Datastream Developer Templates" (page 2-19).
2-11 CONFIDENTIAL
Building the Developer Templates Developer Templates Integration Guide Version 5.0
Element: repo.id
This element provides a descriptive name of the remote repository, such as hortonworks
or cloudera, depending on the Hadoop distribution source you are using.
Element: repo.url
This element provides the full URL of the remote repository from which the relevant
dependency JAR files will be pulled. Standard repository URLs for Hortonworks, Cloudera,
MapR, and EMR, respectively, are as follows (where appropriate, shown on two lines to
improve readability):
• http://repo.hortonworks.com/content/groups/public
• https://repository.cloudera.com/artifactory/cloudera-repos
• https://repository.mapr.com/nexus/content/groups/mapr-public/
• https://<s3-endpoint>/<region-ID>-emr-artifacts/
<emr-release-label>/repos/maven/
Where:
<s3-endpoint> is the Amazon S3 endpoint of the region for the repository. For
example: s3.us-west-1.amazonaws.com.
An example of the full URL is the following (shown on two lines to improve
readability):
https://s3.us-west-1.amazonaws.com/
us-west-1-emr-artifacts/emr-5.30.0/repos/maven
NOTE: This EMR repository does not contain the artifacts required to
successfully build the Spark and Sqoop templates. See the Release Notes for
work-arounds.
Element: hadoop.annotations.version
This element provides the version of the Apache Hadoop Annotations library that you want
to use.
Element: hadoop.common.version
This element provides the version of the Apache Hadoop Common library that you want to
use.
CONFIDENTIAL 2-12
Developer Templates Integration Guide Version 5.0 Building the Developer Templates
Element: hadoop.mapreduce.version
This element provides the version of the Apache Hadoop MapReduce library that you want
to use.
Element: hive.exec.version
This element provides the version of the Apache Hive Exec library that you want to use.
Element: sqoop.version
This element provides the version of the Apache Sqoop library that you want to use.
Element: maven.local.dir
This element provides the name of the directory that is set as the local Maven repository
(by default, ${user.home}/.m2/repository).
NOTE: Some newer Hadoop distributions have changed such that they no longer use
the deprecated Cloudera package for their Sqoop-generated ORM classes. This Maven
build process will dynamically adapt to the use of either the older, deprecated Cloudera
Sqoop package or the newer Apache Sqoop package. For more information, see "Support
for Newer Apache Sqoop 1.x Versions" (page 2-16).
2-13 CONFIDENTIAL
Building the Developer Templates Developer Templates Integration Guide Version 5.0
<hadoop.annotations.version>2.7.0-mapr-1808</hadoop.annotations.version>
<hadoop.common.version>2.7.0-mapr-1808</hadoop.common.version>
<hadoop.mapreduce.version>2.7.0-mapr-1808</hadoop.mapreduce.version>
<hive.exec.version>2.3.6-mapr-1912</hive.exec.version>
<sqoop.version>1.4.7-mapr-1904</sqoop.version>
<maven.local.dir>/home/testuser/.m2/repository</maven.local.dir>
Build Steps
After you have edited the properties in the root POM file, follow the steps in this section to build
the MapReduce, Hive, and Sqoop Developer Templates.
NOTE: In the following steps, <install-dir> is the directory in which you installed the
Developer Templates.
1. Ensure that both Maven and javac are in your path, and that JAVA_HOME is set to the
installation location of JDK 8, and that wsimport is available as part of that JDK
installation.
2. Copy the file vibesimplejava.jar from a Simple API installation to the Hadoop
Developer Templates installation:
From: <simpleapi-install-dir>/lib
To: <install-dir>/common-crypto/simpleapi
3. Change directory (cd) to the directory in which you installed the Developer Templates
(<install_dir>).
CONFIDENTIAL 2-14
Developer Templates Integration Guide Version 5.0 Building the Developer Templates
NOTE: Because install is set as the default goal in the pom.xml file in this
directory, no parameters are required. You could also run either of the following two
commands with the same result:
mvn install
mvn clean install (removes all relevant JAR files before building)
When the build completes successfully, following JAR files have been added to the bin
directory of your build location:
• voltage-hadoop-<version>.jar
All of the core classes and their main dependencies, such as the
vsrestclient.jar and its JSON and HTTP Client library dependencies, are
packaged into a single JAR file, simplifying the commands used to run the
various Hadoop jobs and queries. Such a combined JAR file is often referred to
as an uber JAR file or fat JAR file.
This main Hadoop Developer Templates’ uber JAR file is built with the
appropriate version number in its filename. For example, this JAR filename in the
first such version in which this change was made is:
voltage-hadoop-4.0.0.jar. Further, the Hadoop Developer Templates also
started using the common convention of creating a symbolic link (symlink) to
this versioned filename using an unversioned symlink name (see below). This
allows the various scripts to remain the same from version to version, using the
appropriate latest version of the JAR file by referring to the unversioned symlink
name.
For simplicity, in the remainder of this document, the unversioned symlink name,
voltage-hadoop.jar, is used to refer to the corresponding versioned uber
JAR file.
• voltage-hadoop.jar
2-15 CONFIDENTIAL
Building the Developer Templates Developer Templates Integration Guide Version 5.0
• voltage-hadoop-core.jar
In addition to the default goal, the Maven build also generates the Javadoc
documentation in the javadocs subdirectory of the main build location.
5. (Optional) If you want to run the Hadoop Developer Templates on another Hadoop
node, copy the following subdirectories to the location where you will run the Hadoop
jobs:
• bin
• config
• sampledata
The newest releases of Sqoop 1.x, such as Sqoop 1.4.7 included in Hortonworks Data Platform
(HDP) 3.0, no longer references the deprecated Cloudera package or its classes. Instead, both
its core Sqoop JAR file and any ORM classes it generates use the Apache package and its
classes. Without accommodating modifications, the Sqoop Developer Template will fail to build
on Hadoop distributions with this change.
To solve this build issue without sacrificing backward compatibility, the Hadoop Developer
Templates’ Maven build infrastructure detects which Sqoop package (Cloudera or Apache) is
used by the core Sqoop JAR file in the compile-time classpath and dynamically adjusts the
import directives in the Sqoop Developer Template’s Java source files. If the old Cloudera
package is being used, import directives for its classes are used; otherwise, import directives
for the same classes in the new Apache packages are used. This approach allows the build
process to complete successfully regardless of which Sqoop package is used in a particular
Hadoop distribution.
Note that it is possible to build the Sqoop Developer Template successfully but still have
runtime issues related to incompatible Sqoop packages. If you specify the wrong Sqoop version
using the sqoop.version property element in the root-level POM.xml file, the Sqoop
Developer Template integration classes in that JAR file may not import and use the appropriate
Sqoop packages for the runtime version of Sqoop on your Hadoop cluster. If this happens,
CONFIDENTIAL 2-16
Developer Templates Integration Guide Version 5.0 Building the Developer Templates
despite the build having succeeded, the Sqoop import job can fail at runtime with one of the
following errors, depending on exactly how the compile-time and runtime Sqoop jars are
mismatched:
• If you build by including the newer Apache Sqoop JAR file at compile-time but try to use
a Cloudera (old) Sqoop JAR file at runtime (for example, sqoop-1.4.7.x.jar used to
build, but using sqoop-1.4.6.x.jar on an HDP 2.6.x cluster), you will get the
following error:
java.lang.ClassCastException: com.voltage.securedata.hadoop.
sqoop.SqoopImportProtector cannot be cast to com.cloudera.
sqoop.lib.SqoopRecord
• If you build by including the older Cloudera Sqoop JAR file at compile-time but try to
use an Apache (new) Sqoop JAR file at runtime (for example, sqoop-1.4.6.x.jar
used to build, but using sqoop-1.4.7.x.jar on an HDP 3.0 cluster), you will get the
following error:
java.lang.ClassNotFoundException: com.cloudera.sqoop.lib.
SqoopRecord
If you encounter either of the above exceptions when running the Sqoop import job using the
JAR file voltage-hadoop.jar, ensure that you have specified the correct Sqoop version
(using the sqoop.version property element in the root-level POM.xml file) and then rebuild
the Hadoop Developer Templates. As described above, the Maven build infrastructure will
examine the Sqoop JAR file in the local Maven repository and then modify the Sqoop
Developer Template Java source code so that the correct matching classes are imported for
their use.
Use the Spark Developer Template Scala and Python source code and build infrastructure to
generate the JAR files to be copied to your Hadoop environment for this template. The build
instructions in this section are divided into the following categories, each addressed in its own
sub-section:
Build Pre-Requisites
The following software must be installed in the build environment:
• Scala
• sbt
2-17 CONFIDENTIAL
Building the Developer Templates Developer Templates Integration Guide Version 5.0
NOTE: The Spark Developer Template was developed using version 2.11.8 of Scala and
version 1.1.0 of sbt. sbt pulls in any project dependencies automatically from the
org.apache.spark Maven repository. The file <install-dir>/spark/build.sbt
specifies the Scala version to use along with the library versions that are compatible with
that version of Scala. You can change the version of Scala to a different version as long as
you also adjust the versions of the dependent libraries accordingly. The versions listed in the
build.sbt file, as delivered, are:
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.0.0"
libraryDependencies += "org.apache.spark" % "spark-hive_2.11" % "2.0.0"
libraryDependencies += "org.scala-lang" % "scala-reflect" % "2.11.8"
libraryDependencies += "org.scala-lang" % "scala-library" % "2.11.8"
Build Steps
After installing the Hadoop Developer Templates package by following the steps in "Installing
the Developer Templates" (page 2-7), confirming the build prerequisites, and making any
version changes in the file build.sbt, follow these steps to build the Spark Developer
Template:
1. Install the Simple API on the Hadoop nodes on which you will be running Spark. You
must install the same version of the Simple API, in the same location, on all nodes in your
Hadoop cluster.
2. Edit the configuration files. If you are initially running the Spark sample using the
dataprotection Key Server hosted by Micro Focus Data Security, you should only
need to edit the installPath attribute of the simpleAPI element in the following file:
<install-dir>/config/vsconfig.xml
Set this property to the path where you installed the Simple API on the Spark worker
nodes in the previous step.
When you begin to run the Spark sample and your own Spark jobs using your own
Voltage SecureData Server and Key Server, you will need to make additional changes to
the XML configuration files vsconfig.xml and vsauth.xml as well as to the Spark-
specific configuration file vsspark-rdd.xml.
3. Build the other Hadoop Developer Templates by following the steps in "Building the
MapReduce, Hive, and Sqoop Developer Templates" (page 2-11).
CONFIDENTIAL 2-18
Developer Templates Integration Guide Version 5.0 Building the Developer Templates
This script will copy all of the necessary CryptoFactory and related jar files to the
folder <install-dir>/spark/lib, and if necessary, copy the configuration files and
plaintext CSV file used by the Spark sample to HDFS.
The following directory will be created regardless of whether the sbt build command is
successful:
<install-dir>/spark/target
And if the sbt build command is successful, a JAR file with the following name will be placed in
a sub-directory of the target directory:
<install-dir>/spark/target/scala-<version>/
spark-<version>-SNAPSHOT.jar
This JAR file will contain the compiled Scala code to be used by the scripts that run the Spark
sample job.
NOTE: sbt downloads dependencies from remote repositories on the Internet, so your Spark
build computer must have Internet access. If you do not have Internet access from your
Hadoop cluster, you can build the Spark project on a different computer with Internet access
and then transfer the resulting project and target directories to the computer on which
you will run the Spark sample or your custom Spark job.
• <install_dir>/stream/kafka_storm
• <install_dir>/stream/nifi
• <install_dir>/stream/streamsets_processor
The build process for the Java-based Datastream Developer Templates is managed as a multi-
module Maven project. Under this model, the Maven build directives in the Project Object
Model (POM) file pom.xml in the top-level directory is defined as the parent POM of the
pom.xml files in each of these sub-directories.
2-19 CONFIDENTIAL
Building the Developer Templates Developer Templates Integration Guide Version 5.0
Use the Datastream Developer Templates source code and build infrastructure to generate the
JAR files to be copied to your runtime environment for the StreamSets, NiFi, Kafka Connect, and
Kafka-Storm Developer Templates. The build instructions in this section are divided into the
following categories, each addressed in its own sub-section:
• Editing the Properties in the Parent Maven POM File (page 2-20)
This section provides information about the pre-requisites for building the Datastream
Developer Templates, defining properties as required before initiating the Maven build, and
then using Maven to build one or more of these Developer Templates.
Build Prerequisites
The following software packages must be installed and available at compile-time in order to
build the Kafka Connect Developer Template:
NOTE: The NiFi Developer Template requires the use of Maven 3.1.0 or later.
Property: repo.id
This property provides a descriptive name of the remote repository, such as hortonworks,
depending on the distribution source you are using.
If you provide an alternative Maven repository using the repo.url property, provide a
descriptive name for that repository as the value of this property.
Property: repo.url
This property provides the full URL of the remote repository from which the relevant
dependency JAR files will be pulled. For example:
CONFIDENTIAL 2-20
Developer Templates Integration Guide Version 5.0 Building the Developer Templates
https://repo.hortonworks.com/content/groups/public
https://repository.cloudera.com/artifactory/cloudera-repos
If no value is provided for this property (its value is left as placeHolderValue), the default
Maven repository will be used. To use a different Maven repository for the Datastream
Developer Template you are building, provide the URL of the alternative repository as the
value of this property.
Property: kafka.artifact.id
This property provides the artifact ID for Kafka. Use the artifact id in the following JAR
filename in the libs directory of the Kafka installation location:
<kafka-install-location>/libs/<kafka-artifact-id>-<version>.jar
The property value is the <kafka-artifact-id> part of the filename above, which starts
with the name kafka_. For example:
kafka_2.10
Provide a value for this property only when you are building the Kafka Connect Developer
Template and/or the Kafka-Storm Developer Template. Otherwise, leave this property set
to placeHolderValue.
Property: kafka.version
This property provides the version number of the relevant Kafka dependency JAR files. Use
the version number in the following JAR filename in the libs directory of the Kafka
installation location:
<kafka-install-location>/libs/<kafka-artifact-id>-<version>.jar
The property value is the <version> part of the filename above. For example:
0.10.0.2.5.3.0-37
Provide a value for this property only when you are building the Kafka Connect Developer
Template and/or the Kafka-Storm Developer Template. Otherwise, leave this property set
to placeHolderValue.
Property: storm.version
This property provides the version number of the relevant Storm dependencies and JAR
files. Use the version number in the following JAR filename in the lib directory of the Storm
installation location:
<storm-install-location>/lib/storm-core-<version>.jar
The property value is the <version> part of the filename above. For example:
2-21 CONFIDENTIAL
Building the Developer Templates Developer Templates Integration Guide Version 5.0
1.0.1.2.5.3.0-37
Provide a value for this property only when you are building the Kafka-Storm Developer
Template. Otherwise, leave this property set to placeHolderValue.
Property: storm.kafka.client.version
This property provides the version number of the KafkaSpout dependencies and JAR files,
which is usually the same as the storm.version value. If you cannot find a JAR file with a
name of the form storm-kafka-client-<version>.jar in your Storm installation, use
the same value as for the property storm.version, described above.
The property value is the <version> part of the filename above. For example:
1.0.1.2.5.3.0-37
If using the same value as for the property storm.version does not work, use the
following links to look up the correct value:
• HDP: http://repo.hortonworks.com/content/groups/public/org/apache/storm/storm-kafka-client/
• CDH: https://mvnrepository.com/artifact/org.apache.storm/storm-kafka-client
Provide a value for this property only when you are building the Kafka-Storm Developer
Template. Otherwise, leave this property set to placeHolderValue.
Property: storm.hdfs.version
This property provides the version number of the HdfsBolt dependencies and JAR files,
which is usually the same as the storm.version value. Use the version number in the
following JAR filename in the contrib (HDP) or external (CDH) directory of the Storm
installation location:
<storm-install-location>/contrib/storm-hdfs/
storm-hdfs-<version>.jar
- or -
<storm-install-location>/external/storm-hdfs/
storm-hdfs-<version>.jar
The property value is the <version> part of the filename above. For example:
1.0.1.2.5.3.0-37
Provide a value for this property only when you are building the Kafka-Storm Developer
Template. Otherwise, leave this property set to placeHolderValue.
CONFIDENTIAL 2-22
Developer Templates Integration Guide Version 5.0 Building the Developer Templates
Property: nifi.api.version
This property provides the version number of the NiFi API dependencies and JAR files. Use
the version number in the JAR filename nifi-api-<version>.jar within the NiFi
installation location. For example:
Provide a value for this property only when you are building the NiFi Developer Template.
Otherwise, leave this property set to placeHolderValue.
Property: nifi.utils.version
This property provides the version number of the NiFi Utilities dependencies and JAR files.
Use the version number in the JAR filename nifi-utils-<version>.jar within the
NiFi installation location. For example:
Provide a value for this property only when you are building the NiFi Developer Template.
Otherwise, leave this property set to placeHolderValue.
2-23 CONFIDENTIAL
Building the Developer Templates Developer Templates Integration Guide Version 5.0
NOTE: You must make sure to specify the appropriate versions of these dependencies, from
the appropriate repository, depending on your specific environment. If you specify a
repository URL and/or set of dependency versions that do not match your environment, the
Maven build may still succeed but, for example, the generated Storm topology JAR file may
later fail to run properly. If you encounter unexpected runtime failures, make sure you built
the relevant Datastream Developer Template(s) using the correct repository and
dependencies for your environment.
Build Steps
After installing the Voltage SecureData Developer Templates for Hadoop package, confirming
that build prerequisites are met, and correctly editing the property values at the top of the
parent POM file pom.xml, build one or more of the Datastream Developer Templates by
following these steps:
NOTE: In the following steps, <install-dir> is the directory in which you installed the
Developer Templates.
1. Install the Simple API on the computer(s) on which you are running one or more of the
Datastream Developer Templates. If you are running a cluster, such as a cluster of Storm
workers, install the same version of the Simple API in the same location on all nodes in
the cluster.
2. Copy the file vibesimplejava.jar from a Simple API installation to the Developer
Templates installation on each computer on which it has been installed:
From: <simpleapi-install-dir>/lib
To: <install-dir>/common-crypto/simpleapi
4. Confirm that you have correctly edited the properties at the top of the parent Maven
build file pom.xml in this directory to specify the relevant values for your environment
and the Datastream Developer Templates you intend to build and use. For more
information, see "Editing the Properties in the Parent Maven POM File" (page 2-20).
5. Run the script build_project, which simplifies the Maven command line to build each
of the Datastream Developer Templates, one or more times to build one or more of
these Developer Templates, or to build them all at once:
CONFIDENTIAL 2-24
Developer Templates Integration Guide Version 5.0 Building the Developer Templates
NOTE: This parameter has a leading colon (:) character because it identifies
an artifactId in a POM.xml file two levels down. The parameters for
building the other Datastream Developer Templates identify modules defined
in the top-level POM.xml file.
These build commands always rebuild the Voltage SecureData code on which they all
depend, starting with the shared code in <install-dir>/common-crypto, then the
common code for the Datastream Developer Templates (<install_dir>/stream/
stream_common), and finally the code for the individual Datastream Developer
Template(s). The results, including the dependency JAR files for each Datastream
Developer Template, are packaged as shown for that Developer Template (usually as a
single uber JAR file):
Build Notes
• By default, when building the StreamSets Developer Template, several tests are run
using built-in data and the dataprotection Key Server hosted by Micro Focus Data
Security. These tests also assume that the Simple API has been installed in the directory
/opt/voltage/simpleapi. If this Key Server is not available, or if the Simple API is
installed in a different directory, the build (and test) process will fail.
To build the StreamSets Developer Template without running the associated tests,
include the extra command line parameter -DskipTests in the call to the
build_projects script:
2-25 CONFIDENTIAL
Building the Developer Templates Developer Templates Integration Guide Version 5.0
NOTE: Any extra parameters specified when calling the build_project script will
be passed, as is, to the Maven command line within it.
• Storm topologies are generally run using a single JAR file. Because of this, the Maven
build directives file pom.xml uses the Maven Shade Plugin to produce an uber-jar file
that contains all of the relevant dependencies packaged together. This includes the
Kafka and Storm libraries as well as all their dependencies and sub-dependencies,
transitively. For this reason, the resulting uber-jar file vs-kafka-storm-1.0.jar is
relatively large: approximately 60 MB. For reference and comparison, the classes for the
actual Kafka-Storm Developer Template integration code are originally packaged into
the JAR file original-vs-kafka-storm-1.0.jar, which at about 15 KB, is much
smaller. The required dependencies account for the large difference in these file sizes.
In contrast, the uber JAR file for the Kafka Connect Developer Template (vs-kafka-
connect-1.0.jar) is considerably smaller due to fewer downloaded dependencies
(and sub-dependencies).
• During the build process, Maven downloads dependencies from remote repositories on
the Internet. This means that the computer on which you are running the Maven build
process must have Internet access. If your target environment does not have Internet
access, you can run the Maven build on a different computer that does have Internet
access and then transfer the resulting files to your target environment.
• When the Simple API is called from multiple processors packaged in different NAR files,
JNI ClassLoader issues arise. In NiFi, each individual NAR file is loaded by a different
child ClassLoader, causing errors related to an attempt to load the Simple API more
than once.
NOTE: The Simple API for Java library uses Java Native Interface (JNI) for optimal
performance, with the core cryptographic operations written in the C language. While
this maximizes cryptographic processing speed, it requires that the native library be
loaded from a single ClassLoader, for use by all calling code running in the JVM.
The first NAR file to run will successfully load the Simple API native library, but a
subsequent processor to run, if it is in a different NAR file built as described above, will
fail with the following error:
The solution for this issue is to build your NAR files such that the supporting JARs,
including the Simple API JAR file vibesimplejava.jar, are not packaged within the
NAR file itself. Instead, the supporting JAR files will be copied to the bin folder along
with the less comprehensive NAR file. All of these files can then be copied to the
directory <nifi-install-location>/lib. This works because any JAR files in this
directory are loaded once by a shared ClassLoader in the main NiFi parent classpath.
CONFIDENTIAL 2-26
Developer Templates Integration Guide Version 5.0 Building the Developer Templates
To build the NiFi Developer Template NAR file without including the supporting JAR
files within it, run the build command with the parameters -P and external:
./build_project :vs-nifi-processors -P external
After this build command completes, copy the multiple NAR files (such as vs-nifi-
processors.nar) and all of the supporting JAR files (including
vibesimplejava.jar) from the NiFi Developer Template directory
<install-dir>/stream/nifi/bin to the directory <nifi-install-
location>/lib on the NiFi server. After you start (or restart) your NiFi server, the
supporting JAR files, and the Simple API native library, will be loaded once for shared
use.
NOTE: Even if you build the NAR file with the supporting JAR files packaged
internally in the bundled-dependencies directory, you can still copy the individual
supporting JAR files to the NiFi server’s lib directory so that they will be loaded
properly for shared use by SecureData processors in different NAR files. This will
avoid the UnsatisfiedLinkError problem, but will inefficiently include redundant
JAR files in multiple NAR files, making them larger than necessary.
2-27 CONFIDENTIAL
Building the Developer Templates Developer Templates Integration Guide Version 5.0
CONFIDENTIAL 2-28
3 Common Infrastructure
The Voltage SecureData for Hadoop Developer Templates share common infrastructure. Some
of this sharing is across all of the Developer Templates, including those that are explicitly
related to Hadoop (MapReduce, Hive, Sqoop, and Spark), NiFi, and Kafka-Storm. Other aspects
of the shared infrastructure is designed for use by the shared aspects of the templates related
to those templates that run in the context of a Hadoop cluster. For example, accessing the
configuration information in the XML configuration files shared by the MapReduce, Hive,
Sqoop, and Spark templates.
• Shared Sample Data for the Hadoop Developer Templates (page 3-73)
• Logging and Error Handling in the Hadoop Developer Templates (page 3-87)
The Hadoop Developer Templates support four methods of authentication, three of which also
provide authorization (to derive a cryptographic key using the specified identity):
• LDAP Plus Shared Secret - This method provides authentication using a shared secret
and authorization using LDAP (generally through LDAP group membership).
3-1 CONFIDENTIAL
Authentication and Authorization Overview Developer Templates Integration Guide Version 5.0
• Shared Secret - This method provides authentication using a shared secret and does
not provide user-level authorization.
Corresponding configuration on the Voltage SecureData Server for your chosen method of
authentication (and authorization, if any) is required.
NOTE: Kerberos authentication requires several special configuration steps, including steps
on the computer running the Kerberos Key Distribution Center (KDC). For more information,
see the following subsection, "Configuration Step Summary for Kerberos Authentication"
(page 3-3).
The Developer Templates that run in the context of a Hadoop cluster (MapReduce, Hive,
Sqoop, and Spark) use the following configuration settings, specified in the XML configuration
file vsauth.xml, to provide default and cryptId-specific authentication/authorization
information:
The StreamSets Developer Template uses the following configuration settings, specified in the
XML configuration file vsauth.xml, to provide default and cryptId-specific authentication/
authorization information:
The Kafka Connect Developer Template and the Kafka-Storm Developer Template use the
following configuration settings, specified in the XML configuration file vsauth.xml, to
provide default and cryptId-specific authentication/authorization information:
The NiFi Developer Template uses the following configuration settings on the Properties tab
of the Configure Processor dialog box of the NiFi processor SecureDataProcessor to
provide authentication/authorization information for that processor:
CONFIDENTIAL 3-2
Developer Templates Integration Guide Version 5.0 Authentication and Authorization Overview
NOTE: The username cannot contain a colon character (:). If it does, authentication will
likely fail. This is because the username and password are combined when communicating
with the Voltage SecureData Key Server, using a colon character as a separator.
When using Kerberos authentication, the shared Hadoop Developer Template code will use the
calling user’s Kerberos ticket granting ticket (TGT) to request a delegation token from the
Voltage SecureData Server, storing the returned token in this location in HDFS with file
permissions set appropriately. The Hadoop job tasks (for MapReduce, Hive, Sqoop, and so on)
running on the individual data nodes in the cluster read that token from this location in HDFS
and use it to request cryptographic keys for local (Simple API) protect or access operations or
to make remote (REST API) protect or access requests to the Voltage SecureData Server.
3-3 CONFIDENTIAL
Authentication and Authorization Overview Developer Templates Integration Guide Version 5.0
1. On the computer running the KDC, use the kadmin.local command to add the
Kerberos service principal for the Voltage SecureData Server cluster hostname and
export the service principal to a file named kms.keytab, as follows:
kadmin.local
kadmin.local: addprinc HTTP/voltage-pp-0000.<district-domain>@<Kerberos-realm>
kadmin.local: ktadd -k kms.keytab HTTP/voltage-pp-0000.<district-domain>@<Kerberos-realm>
kadmin.local: exit
NOTE: The service principal name of the Voltage SecureData Server, used in the two
kadmin.local commands above, is not the full URL for the Voltage SecureData
Server. The service principal name must match the form used above AND must
exactly match the value entered into the Principal Name field in the System >
Kerberos page in the Voltage SecureData Management Console. For example:
HTTP/[email protected]
2. As the root user, securely copy the files kms.keytab and /etc/krb5.conf from the
KDC computer to a location from which you can use the Management Console to upload
them.
IMPORTANT: The first of these files, kms.keytab, contains sensitive information that
must be kept secure. The second file, /etc/krb5.conf, defines the Kerberos realm,
which is not sensitive.
1. On the System->Kerberos page, upload the two files from the previous step to the
Management Console.
2. On the System->Kerberos page, enter the service principal name in the Principal
Name field exactly as it was specified in the kadmin.local commands above, and
then click the Save Settings button.
3. On the Key Management > Authentication page, configure one or more LDAP
Authentication Methods for the Voltage SecureData Key Server. This is what will be
used to authorize users for specific identities for key requests from the Simple API.
4. Also on the Key Management > Authentication page, check the Enable Kerberos
Authentication for Key Requests checkbox to enable Kerberos authentication for key
requests from the Simple API (after which they will be authorized using the methods
configured in the previous step).
CONFIDENTIAL 3-4
Developer Templates Integration Guide Version 5.0 Configuration Settings
6. Also on the Web Service->Identity Authorization page, check the Enable Kerberos
Authentication for Web Service REST Requests checkbox to enable Kerberos
authentication for REST API calls (after which they will be authorized using the rules
configured in the previous step).
7. Deploy the settings from the Management Console to the host(s) in the Voltage
SecureData Server cluster.
IMPORTANT: The Kerberos protocol uses tickets with timestamps. In order for Kerberos
authentication to function properly, the Hadoop nodes, the KDC and the Voltage SecureData
Server must have synchronized system clocks. If the clock times on these computers drift too
far apart, Kerberos authentication may fail, resulting in error messages such as the following
being logged in the Voltage SecureData Key Server debug.log file:
Micro Focus Data Security strongly recommends that you use the Network Time Protocol
(NTP) to keep the clock times synchronized on the relevant computers.
Configuration Settings
This section provides detailed information about the configuration settings available for the
Hadoop Developer Templates through the use of two different types of configuration files:
• The XML configuration files used by the Hadoop Developer Templates that run in the
context of a Hadoop cluster (MapReduce, Hive, Sqoop, and Spark). For more
information, see "XML Configuration Files" (page 3-32).
• The Java Properties configuration files used by the NiFi Developer Template and the
Kafka-Storm Developer Template. For more information, see "Java Properties
Configuration Files" (page 3-45).
This section provides information about the relevant configuration settings in a generic way,
while also identifying how this information is provided for the different Hadoop Developer
Templates. The configuration settings fall into two classes, as follows:
3-5 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0
• Sometimes relevant to both the XML configuration files (MapReduce, Hive, Sqoop, and
Spark), the Java Properties configurations files (Nifi and Kafka-Storm), and even to NiFi
user interface choices, the direct settings that identify Voltage SecureData resources
and that provide relevant information needed during cryptographic operations. An
example of the former is the Voltage SecureData Server that you are using. Examples of
the latter include the data protection format and the identity for cryptographic key
generation required for a particular protect or access operation.
• Relevant to only the XML configuration files (MapReduce, Hive, Sqoop, and Spark), the
XML infrastructure settings that are used in the structuring of your XML configuration
files and which are referenced elsewhere within the XML itself and sometimes as UDF
parameter values that identify groupings of configuration settings. Examples of this type
of configuration setting include the names given to sets of cryptographic settings
(known as cryptIds) and names given to sets of authentication/authorization settings
(known as authIds).
Domain Name
Use the Domain Name (direct) setting to specify the security district domain name of the
relevant Voltage SecureData Server. This is the part of the Voltage SecureData Server
hostname without the voltage-pp-0000 prefix.
This value, when specified, is used to construct three other optional configuration settings of
which the domain name is a part:
• REST Hostname:
voltage-pp-0000.<domainName>
If any of these optional configuration settings are specified individually, those settings will be
used instead of settings constructed from the domain name setting.
In the XML configuration file vsconfig.xml, this setting is specified using the domainName
attribute (and its value) of the secureDataServer element. The default version of the
configuration file vsconfig.xml provided with the Hadoop Developer Templates specifies
the following demonstration security district domain, hosted by Micro Focus:
dataprotection.voltage.com
NOTE: In the XML configuration file vsconfig.xml, you must specify either a domain name
(as described here) or a hostname (as described in the following section), but not both. If
both are specified, the hostname will be used and the domain name will be ignored.
CONFIDENTIAL 3-6
Developer Templates Integration Guide Version 5.0 Configuration Settings
To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).
Hostname
Use the Hostname (direct) setting to specify the full server hostname of the relevant Voltage
SecureData Server. This typically includes the standard voltage-pp-0000 prefix, required
when making requests to the Voltage SecureData Key Server.
This value, when specified, is used to construct three other optional configuration settings of
which the hostname is a part:
• REST Hostname:
<hostName>
If any of these optional configuration settings are specified individually, those settings will be
used instead of settings constructed from the hostname setting.
In the XML configuration file vsconfig.xml, this setting is specified using the hostName
attribute (and its value) of the secureDataServer element. For example, you could change
the default version of the configuration file vsconfig.xml provided with the Hadoop
Developer Templates to use this alternate approach to specify the full hostname of the
demonstration security district domain hosted by Micro Focus:
voltage-pp-0000.dataprotection.voltage.com
NOTE: In the XML configuration file vsconfig.xml, you must specify either a hostname (as
described here) or a domain name (as described in the previous section), but not both. If
both are specified, the hostname will be used and the domain name will be ignored.
To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).
In the XML configuration file vsconfig.xml, this setting is specified using the installPath
attribute (and its value) of the simpleAPI element.
In the Java Properties configuration file vsnifi.properties (NiFi), this setting is specified
using the property name simpleapi.install.path (and its value).
3-7 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0
In both types of configuration files, the default value for this setting is:
/opt/voltage/simpleapi
If you have installed the Simple API in a different directory on the data nodes in your Hadoop
cluster, you must change this configuration setting accordingly.
NOTE: Calls to the Web Service API (the REST API), at least when using the provided Web
Service clients, such as the calls to the REST API in the various Developer Templates, also
use the trustStore directory from the Simple API install location to initialize the trusted
root certificates used when connecting to the remote Web Service server. Therefore, you
must specify this attribute value even if you are not using the Simple API.
To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).
To see this setting in the context of the Java Properties configuration file
vsnifi.properties, see “vsnifi.properties” (page 3-45).
In the XML configuration file vsconfig.xml, this setting is optional and may be specified
using the policyUrl attribute (and its value) of the simpleAPI element. When this setting is
not specified, the client policy file URL is constructed using either the domainName or the
hostName attribute value of the secureDataServer element, as follows:
https://voltage-pp-0000.<domainName>/policy/clientPolicy.xml
or
https://<hostName>/policy/clientPolicy.xml
In the XML configuration file vsconfig.xml, as shipped for use with the Hadoop Developer
Templates, a default Simple API Policy URL is constructed from the specified domain name:
dataprotection.voltage.com
Use this setting if your client policy file URL cannot be constructed from the domainName or
the hostName attribute value, whichever one you supply.
In the Java Properties configuration file vsnifi.properties, this setting is specified using
the name simpleapi.policy.url (and its value). In both of these Java Properties
configuration files, as shipped, the value for this setting is (shown on two lines for improved
readability):
https://voltage-pp-0000.voltage-pp-0000.dataprotection.voltage
.com/policy/clientPolicy.xml
CONFIDENTIAL 3-8
Developer Templates Integration Guide Version 5.0 Configuration Settings
To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).
To see this setting in the context of the Java Properties configuration file
vsnifi.properties, see “vsnifi.properties” (page 3-45).
• none - Do not cache downloaded items. This setting is not recommended for
production environments.
Nevertheless, using file-based caching on the local file system of each data
node may still result in fewer network interactions with the Voltage
SecureData Server. New processes launched on a given data node will be
able to used (file-based) cached cryptographic keys, which is not the case
when in-memory caching is used.
In the XML configuration file vsconfig.xml, this setting is optional and may be specified
using the cacheType attribute (and its value) of the simpleAPI element.
In the Java Properties configuration files vsnifi.properties, this setting is specified using
the name simpleapi.cache.type (and its value).
In both types of configuration files, the default value for this setting is memory (note that
although it is not required, in the provided XML configuration file vsconfig.xml, the default
value of memory is specified explicitly). For more information about the in-memory and file-
based caching modes used by the Simple API, see the Voltage SecureData Simple API
Developer Guide.
3-9 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0
To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).
To see this setting in the context of the Java Properties configuration file
vsnifi.properties, see “vsnifi.properties” (page 3-45).
In the XML configuration file vsconfig.xml, this setting is optional and may be specified
using the fileCachePath attribute (and its value) of the simpleAPI element.
In the Java Properties configuration files vsnifi.properties, this setting is optional and
may be specified using the name simpleapi.file.cache.path (and its value).
In both types of configuration files, if this setting is not specified when file-based caching is
being used, the default location for caching is a subdirectory named cache directly subordinate
to the Simple API’s installation location (<installDir>/cache). If this setting is specified
when file-based caching is not being used, its value is ignored. For more information about this
setting’s path value when file-based caching is being used, see the Voltage SecureData Simple
API Developer Guide.
CAUTION: If file-based caching is configured for the Simple API, the user account under
which the job is running must have permissions to write to (and create, if necessary) the
specified caching directory. Make sure to set the directory permissions appropriately, based
on the user account(s) under which your Hadoop jobs will be run. Incorrect permissions can
cause errors, such as VE_ERROR_FILE_CREATE_DIR, when the job attempts to create the
Simple API LibraryContext object.
To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).
To see this setting in the context of the Java Properties configuration file
vsnifi.properties, see “vsnifi.properties” (page 3-45).
CONFIDENTIAL 3-10
Developer Templates Integration Guide Version 5.0 Configuration Settings
considered the lower limit of being cryptographically secure are allowed to be protected and
accessed anyway. For more information about this behavior, see the Voltage SecureData
Simple API Developer Guide.
In the XML configuration file vsconfig.xml, this setting is optional and may be specified
using the shortFPE attribute (and its value) of the simpleAPI element.
In the Java Properties configuration file vsnifi.properties, this setting is optional and may
be specified using the name simpleapi.shortfpe (and its value).
In both types of configuration files, if this setting is not specified, its default value is false.
IMPORTANT: In previous versions of the Hadoop Developer Templates, this value was not
configurable without changing the Developer Templates source code. Further, its default
value in the source code allowed the protection and access of short FPE values (equivalent
to a setting of true in version 4.1 and later). Using these different default settings on the
same data could cause protection and access operations that previously succeeded to fail
after upgrading.
To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).
To see this setting in the context of the Java Properties configuration file
vsnifi.properties, see “vsnifi.properties” (page 3-45).
In the XML configuration file vsconfig.xml, this setting is optional and may be specified
using the batchSize attribute (and its value) of the webService element. If this setting is not
specified, its default value is 2000.
NOTE: Not all Hadoop Developer Templates support batching. And even for those
templates that do support batching, this configuration setting is currently only used by the
Sqoop Developer Template and the RDD and Dataset variants of the Spark Developer
Template. Although other Hadoop Developer Templates, such as MapReduce and NiFi, also
perform batching, their batch sizes are specified directly in the source code.
To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).
REST Hostname
Use the REST Hostname (direct) setting to specify the full hostname of the REST server that
will be used for REST API operations.
3-11 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0
In the XML configuration file vsconfig.xml, this setting is optional and may be specified
using the restHostName attribute (and its value) of the webService element. When this
setting is not specified, the REST hostname is constructed using the domainName or taken as
the hostName attribute value of the secureDataServer element, as follows:
voltage-pp-0000.<domainName>
or
<hostName>
In the XML configuration file vsconfig.xml, as shipped for use with the Hadoop Developer
Templates, a default REST hostname is constructed from the specified domain name:
voltage-pp-0000.dataprotection.voltage.com
Use this setting if your REST hostname cannot be constructed from the domainName attribute
value or is not that same as the hostName attribute value, whichever one you supply.
In the Java Properties configuration file vsnifi.properties, this setting is required when
the REST API is used and is specified using the name rest.hostname (and its value). Its
default setting in both of these Java Properties files is:
voltage-pp-0000.dataprotection.voltage.com
To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).
To see this setting in the context of the Java Properties configuration file
vsnifi.properties, see “vsnifi.properties” (page 3-45).
Depending on your purposes, transparently returning the protected value (rather than an
error) in this case may still allow analytics to be performed. When this behavior is enabled and
an authentication/authorization error is detected during access, the operation returns the
protected value instead of throwing an exception. Also, the trapped and ignored exception is
logged as a warning in the Hadoop job logs for auditing/debugging purposes. This logging
allows you to see whether this behavior was triggered for the cryptographic API call(s).
CONFIDENTIAL 3-12
Developer Templates Integration Guide Version 5.0 Configuration Settings
When this behavior is enabled, how the Developer Template code determines when an
applicable authentication/authorization failure has occurred depends on the API being used, as
follows:
• REST API - For authentication failures, when the HTTP status code 401
(UNAUTHORIZED) is returned. For authorization failures, when the HTTP status code
403 (FORBIDDEN) is returned, except when the error code in the JSON response is
30635 (TOKENIZATION_IDENTITY_MISMATCH).
When any of these error conditions occur during access, the setting of this configuration value
is checked to determine the resulting behavior:
• When set to false, the authentication/authorization error is logged and the Hadoop
job fails with an exception.
In the XML configuration file vsconfig.xml, this setting is optional and may be specified
using the returnProtectedValueOnAccessAuthFailure attribute (and its value) of the
general element.
In the Java Properties configuration file vsnifi.properties, this setting is optional and may
be specified using the name return.protected.value.on.access.auth.failure (and
its value).
In both types of configuration files, if this setting is not specified, its default value is false.
CAUTION: Sometimes an internal server error, which can occur when an LDAP server is
unavailable, will be represented to the client as an authentication/authorization failure that is
indistinguishable from a failure to actually authenticate/authorize the provided username/
password/identity. When the return-protected-value-on-auth-failure behavior is enabled,
the protected value is returned as a successful result without any indication of the true
nature of the internal server error, even in the Hadoop job logs. You will need to determine
whether this behavior is acceptable for your expected use of the data and the likelihood of
this type of error based on your LDAP redundancy and network stability.
To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).
3-13 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0
To see this setting in the context of the Java Properties configuration file
vsnifi.properties, see “vsnifi.properties” (page 3-45).
Product Name
Use the Product Name (direct) setting to specify a custom product name for one or more
Hadoop Developer Template client types, depending upon the context in which it is specified.
The specified custom product name, if any, or the default product name, is included in requests
to the Voltage SecureData Server when using the Simple API to request cryptographic keys
and when making the REST API requests. This information, whether the default value or
(especially) a customized value, can be useful for tracking and reporting purposes because it is
logged for these types of requests. Details for the relevant APIs are as follows:
• Simple API:
Registered with the Simple API as the Product field of the Client Identifier, and
included in cryptographic key requests.
NOTE: The Simple API has the following character restrictions on the product name:
Uppercase and lowercase letters, digits, and the following additional characters:
. ( ) { } [ ] - _
• REST API:
Concatenated with the product version with a slash character (/) between them and
included as the User-Agent field in the REST request header:
User-Agent=<product_name>/<product_version>
In the XML configuration file vsconfig.xml, this setting is optional. It may be specified at two
levels:
• For a specific Hadoop component, using the product attribute of a clientId element.
Any setting specified for a specific Hadoop component will override the global setting, if
present.
CONFIDENTIAL 3-14
Developer Templates Integration Guide Version 5.0 Configuration Settings
In the XML configuration files with names of the form vs<component>.xml, this setting is
optional and may be specified for the Hadoop component indicated by the filename using the
product attribute of the clientId element. Any setting specified in this way will override any
corresponding settings in the XML configuration file vsconfig.xml.
Its default setting at the global level in the XML configuration file vsconfig.xml, as shipped,
is the same as its default setting when no value is specified (either globally or for a specific
Hadoop component):
SecureData-Hadoop-Dev-Templates
In the Java Properties configuration file vsnifi.properties, this setting is optional and may
be specified using the name product.name (and its value). This setting is commented out in
this Java Properties file, resulting in its respective default value (NiFi Developer
Template) being used in Simple API and REST API requests.
These event logs are subsequently processed by the event aggregator on the Voltage
SecureData Server. For more information about event logging and aggregation, see the
Voltage SecureData Server product documentation.
To see this setting in the context of the XML configuration files vsconfig.xml and
vs<component>.xml, see “Attribute Values in vsconfig.xml” (page 3-37) and “Attribute
Values in vs<component>.xml” (page 3-43), respectively.
To see this setting in the context of the Java Properties configuration file
vsnifi.properties, see “vsnifi.properties” (page 3-45).
Product Version
Use the Product Version (direct) setting to specify a custom product version for one or more
Hadoop Developer Template client types, depending upon the context in which it is specified.
The specified custom product version, if any, or the default product version, is included in
requests to the Voltage SecureData Server when using the Simple API to request cryptographic
keys and when making the REST API requests. This information, whether the default value or
(especially) a customized value, can be useful for tracking and reporting purposes because it is
logged for these types of requests. Details for the relevant APIs are as follows:
• Simple API:
Registered with the Simple API as the Version field of the Client Identifier, and
included in cryptographic key requests.
3-15 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0
NOTE: The Simple API has the following character restrictions on the product
version:
Uppercase and lowercase letters, digits, and the following additional characters:
. ( ) { } [ ] - _
• REST API:
Concatenated with the product name with a slash character (/) between them and
included as the User-Agent field in the REST request header:
User-Agent=<product_name>/<product_version>
In the XML configuration file vsconfig.xml, this setting is optional. It may be specified at two
levels:
• For a specific Hadoop component, using the version attribute of a clientId element.
Any setting specified for a specific Hadoop component will override the global setting, if
present.
In the XML configuration files with names of the form vs<component>.xml, this setting is
optional and may be specified for the Hadoop component indicated by the filename using the
version attribute of the clientId element. Any setting specified in this way will override any
corresponding settings in the XML configuration file vsconfig.xml.
In the Java Properties configuration file vsnifi.properties, this setting is optional and may
be specified using the name product.version (and its value). This setting is commented out
in both of these Java Properties files, resulting in no value being provided as the version in
Simple API and REST API requests.
NOTE: These event logs are subsequently processed by the event aggregator on the
Voltage SecureData Server. For more information about event logging and aggregation, see
the Voltage SecureData Server product documentation.
CONFIDENTIAL 3-16
Developer Templates Integration Guide Version 5.0 Configuration Settings
To see this setting in the context of the XML configuration files vsconfig.xml and
vs<component>.xml, see “Attribute Values in vsconfig.xml” (page 3-37) and “Attribute
Values in vs<component>.xml” (page 3-43), respectively.
To see this setting in the context of the Java Properties configuration file
vsnifi.properties, see “vsnifi.properties” (page 3-45).
• mr - Define one or both custom client ID values for the MapReduce Developer
Template
• hive - Define one or both custom client ID values for the Hive Developer Template
• sqoop - Define one or both custom client ID values for the Sqoop Developer Template
• spark - Define one or both custom client ID values for the Spark Developer Template
If you set a custom component-level product name and/or product version for a particular
Hadoop component, the value(s) you set will override both A) any global product name and/or
product version you set using the clientIds element, and B) the default values that are
automatically provided by the Hadoop Developer Templates (SecureData-Hadoop-Dev-
Templates and <product_version>, respectively).
In the XML configuration file vsconfig.xml, this setting is required in all component-specific
client identifiers and must be specified using the component attribute (and its value) of each
clientId element. Each such clientId element must have one of the four unique names
provided above.
In the XML configuration file vsconfig.xml, as shipped, no individual client identifiers are
defined.
To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).
Identity
Use the Identity (direct) setting to specify an identity for use when deriving cryptographic
keys, including for authorization purposes where appropriate.
3-17 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0
In the context of cryptIds, the specified identity can be used with one or more cryptIds,
depending upon whether it is specified as the default identity or a cryptId-specific identity. In
the XML configuration file vsconfig.xml, this cryptId setting is optional at either the global
level or at the level of individual cryptIds, but not at both. It must be specified at one or both of
the following levels:
• For an individual cryptId, using the identity attribute of a cryptId element. Any
setting specified for an individual cryptId will override the global setting, if present.
At the global level in the XML configuration file vsconfig.xml, as shipped, this configuration
setting is set as [email protected], which allows the Hadoop Developer Templates to run
successfully using the public-facing Voltage SecureData Server dataprotection, hosted by
Micro Focus.
In the context of the NiFi Developer Template, the identity is specified interactively on the
Properties tab of the Configure Processor dialog box for the SecureDataProcessor, and
serves as the identity for cryptographic operations performed by that NiFi processor.
In the context of the Kafka-Storm Developer Template, the identity is specified in the Java
Properties configuration file vsauth.properties and serves as a global identity, used for all
cryptographic key derivations for that template. The name portion of the name/value pair is
identity and the default identity provided in this configuration file is:
auth.identity = [email protected]
To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).
To see this setting in the context of the Java Properties configuration file
vsauth.properties, see “Specifying the Location of the XML Configuration Files” (page 3-
47).
API
Use the API (direct) setting to optionally specify the Voltage SecureData API that you want to
perform one or more cryptographic operations. The choices are:
• simpleapi - Use the Simple API for the relevant cryptographic operation(s)
• rest - Use the REST API for the relevant cryptographic operation(s)
In the context of cryptIds, the specified API can be used with one or more cryptIds, depending
upon whether it is specified as the default API or at the level of individual cryptIds:
• For an individual cryptId, using the api attribute of a cryptId element. Any setting
specified for an individual cryptId will override the global setting, if present.
CONFIDENTIAL 3-18
Developer Templates Integration Guide Version 5.0 Configuration Settings
If this optional setting is not specified at either level, cryptIds will default to using the Simple
API.
In the XML configuration file vsconfig.xml, as shipped, this configuration setting is set as
simpleapi at the global level.
In the context of the NiFi Developer Template, the API is specified interactively (using a drop-
down box) on the Properties tab of the Configure Processor dialog box for the
SecureDataProcessor, and serves as the API choice for cryptographic operations performed
by that NiFi processor.
In the context of the Kafka-Storm Developer Template, the API defaults to the Simple API, but
can be changed by specifying the REST API as an optional fourth parameter (rest) to the
Kafka-Storm script run-storm-topology.
To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).
CryptId Name
Use the CryptId Name (XML infrastructure) setting to specify a name for an individual cryptId.
The provided name will generally appear in the following contexts, thereby identifying the
cryptId to be used when protecting or accessing an item of sensitive data:
• The value of the cryptId attribute for one or more field elements in the XML
configuration file vsconfig.xml.
• The value of the cryptId attribute for one or more field elements in XML
configuration files with names of the form vs<component>.xml.
• A parameter to a UDF in the Hive Developer Template or a UDF variant of the Spark
Developer Template.
In the XML configuration files vsconfig.xml, this setting is required in all individually
specified cryptIds and must be specified using the name attribute (and its value) of each
cryptId element. Each such cryptId element must have a unique name.
In the XML configuration file vsconfig.xml, as shipped, several individual cryptIds are
defined with names that include alpha, date, cc, and ssn.
This setting is not relevant to the NiFi Developer Template nor to the Kafka-Storm Developer
Template.
To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).
3-19 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0
Format
Use the Format (direct) setting to specify the name of a data protection format to be used
when performing the relevant cryptographic operation. The format name identifies the name of
an FPE or SST format that has been centrally configured using the Voltage SecureData
Management Console.
NOTE: Only the REST API can be used in conjunction with SST formats.
In the context of cryptIds, the specified format is used to protect or access any field associated
with that cryptId. In the XML configuration file vsconfig.xml, this setting is required in all
individually specified cryptIds and must be specified using the format attribute (and its value)
of each cryptId element. As shipped, the XML configuration file vsconfig.xml specifies
cryptIds with several different formats, including Alphanumeric, DATE-ISO-8601, cc-sst-
6-4, and ssn.
NOTE: When specifying a format for a cryptId, note that the format name AES is reserved for
IBSE/AES encryption using the binary Hive UDFs (and, potentially, future AES support in
other Hadoop Developer Templates) and must not be used by your Voltage SecureData
administrator when defining formats using the Management Console. This applies across all
of the Hadoop Developer Templates, including the NiFi Developer Template and the Kafka-
Storm Developer Template.
In the context of the NiFi Developer Template, the format is specified interactively on the
Properties tab of the Configure Processor dialog box for the SecureDataProcessor, and
specifies the data protection format for cryptographic operations performed by that NiFi
processor.
In the context of the Kafka-Storm Developer Template, the format is specified as the required
third parameter to the Kafka-Storm script run-storm-topology.
To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).
In the XML configuration file vsconfig.xml, this setting is optional for individually specified
cryptIds and may be specified using the authId attribute (and its value) of each cryptId
element. If not provided, the authentication/authorization performed in association with this
cryptId will be according to the authentication method and credentials provided in the
authDefault element of the configuration file vsauth.xml.
CONFIDENTIAL 3-20
Developer Templates Integration Guide Version 5.0 Configuration Settings
In the XML configuration file vsconfig.xml, as shipped, several individual cryptIds are
defined without authId attributes, therefore using the default authentication/authorization
information in the configuration file vsauth.xml.
For more information about using the configuration file vsauth.xml to configure
authentication methods and credentials for the Hadoop Developer Templates, see "vsauth.xml"
(page 3-38).
This setting is not relevant to the NiFi Developer Template nor to the Kafka-Storm Developer
Template.
To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).
Translator Class
Use the Translator Class (direct) setting to specify the name of a custom Java translator class
that you can use to translate the data before and/or after the Voltage SecureData
cryptographic processing is performed. You can write and configure your own custom
translator class to perform specific pre- and/or post-processing or you can use one of the
translator classes provided with the Hadoop Developer Templates.
In the XML configuration file vsconfig.xml, this setting is optional for individually specified
cryptIds and may be specified using the translatorClass attribute (and its value) of each
cryptId element. If not provided, no pre- and/or post-processing is performed.
In the XML configuration file vsconfig.xml, as shipped, several individual cryptIds are
defined without translatorClass attributes, therefore not performing any pre- and/or post-
processing.
This setting is not relevant to the NiFi Developer Template nor to the Kafka-Storm Developer
Template.
NOTE: If you are using a version of the Simple API prior to 4.3 to protect and access date
formats, you will need to use the included translator class
LegacySimpleAPIDateTranslator for the relevant cryptIds. This is necessary because
older versions of the Simple API did not handle date formats directly, and required pre- and
post-processing using a special internal date/time syntax. By default, this version of the
Hadoop Developer Templates assumes a newer version of the Simple API and does not use
this translator class. For more information about translator classes, see "Data Translation"
(page 3-67).
To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).
3-21 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0
In the XML configuration file vsconfig.xml, this setting is optional for individually specified
cryptIds and may be specified using the translatorInitData attribute (and its value) of
each cryptId element. If not provided, no initialization data will be provided to the specified
translator class, if any.
NOTE: If this value is specified, it will be passed to the init(String initData) method
in the constructed instance of the class specified using the corresponding
translatorClass attribute. This value can be used to initialize the translator instance with
additional advanced configuration settings. For more information about how you would use
it in your translator class implementations, see the definition of method init(String
initData) in the interface StringTranslator and the abstract class
BaseStringTranslator.
In the XML configuration file vsconfig.xml, as shipped, several individual cryptIds are
defined without translatorClass or translatorInitData attributes, therefore not
performing any pre- and/or post-processing.
This setting is not relevant to the NiFi Developer Template nor to the Kafka-Storm Developer
Template.
To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).
CryptId Description
Use the CryptId Description (XML infrastructure) setting to specify a description of the
purpose of, or use case for, this cryptId. The description value is for informational purposes only
and has no effect on protect or access behavior.
In the XML configuration file vsconfig.xml, this setting is optional for individually specified
cryptIds and may be specified using the description attribute (and its value) of each
cryptId element.
In the XML configuration file vsconfig.xml, as shipped, several individual cryptIds are
defined without description attributes.
This setting is not relevant to the NiFi Developer Template nor to the Kafka-Storm Developer
Template.
To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).
CONFIDENTIAL 3-22
Developer Templates Integration Guide Version 5.0 Configuration Settings
• mr - Define field mappings by CSV column index for the MapReduce Developer
Template
• sqoop - Define field mappings by database column name for the Sqoop Developer
Template
• spark - Define field mappings by index for the Resilient Distributed Dataset (RDD)
and Dataset variants of the Spark Developer Template
The Hadoop Developer Template components that use User Defined Functions (UDFs)
explicitly pass the name of the relevant cryptId as a UDF parameter, thereby eliminating the
need to supply field mappings using a fields element in either the global XML configuration
file vsconfig.xml or in a component-specific XML configuration file (such as vsspark-
rdd.xml). As of this release, such components include the Hive Developer Template and the
DataFrame, Spark SQL, and HiveUDF variants of the Spark Developer Template.
In the XML configuration file vsconfig.xml, this setting is required in all individually specified
sets of fields and must be specified using the component attribute (and its value) of each
fields element. Each such fields element must have one of the three unique names
defined above.
In the XML configuration file vsconfig.xml, as shipped, sets of fields are defined for the
MapReduce (mr) and Sqoop (sqoop) Developer Templates.
To see this setting in the context of the XML configuration file vsconfig.xml, see “Attribute
Values in vsconfig.xml” (page 3-37).
Field Index
Use the Field Index (direct) setting to specify the index of a field/column that is subject to a
protect or access operation when the corresponding Hadoop Developer Template is run.
Field index values are integers and are zero-based (0 is the index of the first column or field).
Every field element in the XML configuration file vsconfig.xml must include a field index
or field name (see "Field Name" on page 3-24), but not both.
3-23 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0
Field indexes are used for the MapReduce Developer Template and the RDD and Dataset
variants of the Spark Developer Template.
In the XML configuration file vsconfig.xml, this setting is conditionally required in all
individually specified fields for the relevant Hadoop components and can be specified using the
index attribute (and its value) of each field element.
In the relevant component-specific XML configuration files with names of the form
vs<component>.xml, this setting is required and must be specified for the Hadoop
component indicated by the filename using the index attribute (and its value) of each field
element.
Do not specify field mappings (including field indexes) for the same Hadoop component using
both the XML configuration file vsconfig.xml and a component-specific XML configuration
file with a name of the form vs<component>.xml (if field mappings are specified in both files,
the field mappings in the component-specific XML configuration file will be used).
In the XML configuration file vsconfig.xml, as shipped, a set of fields is defined for the
MapReduce (mr) Developer Template that uses field indexes.
In the XML configuration file vsspark-rdd.xml, as shipped, a set of fields is defined for the
RDD and Dataset variants of the Spark Spark Developer Template that uses field indexes.
To see this setting in the context of the XML configuration files vsconfig.xml and
vs<component>.xml, see “Attribute Values in vsconfig.xml” (page 3-37) and “Attribute
Values in vs<component>.xml” (page 3-43), respectively.
Field Name
Use the Field Name (direct) setting to specify the name of a database column that is subject to
protect operations during Sqoop Developer Template import processing (such as when the
component attribute of the enclosing fields element in the XML configuration file
vsconfig.xml is set to sqoop).
Every field element in the XML configuration file vsconfig.xml must include a field name
or field index (see "Field Index" on page 3-23), but not both.
In the XML configuration file vsconfig.xml, this setting is conditionally required in all
individually specified fields for the Sqoop Developer Template and can be specified using the
name attribute (and its value) of each field element.
In the component-specific XML configuration file vssqoop.xml, if used, this setting is required
and must be specified for the Sqoop Developer Template using the name attribute (and its
value) of each field element.
CONFIDENTIAL 3-24
Developer Templates Integration Guide Version 5.0 Configuration Settings
Do not specify field mappings (including field names) for the same Hadoop component using
both the XML configuration file vsconfig.xml and the XML configuration file vssqoop.xml
(if the same field mapping is specified in both files, the field mapping in the component-specific
XML, configuration file will be used).
In the XML configuration file vsconfig.xml, as shipped, the set of fields defined for the Sqoop
Developer Template uses field names to specify the database columns subject to protect
operations during Sqoop import processing.
To see this setting in the context of the XML configuration files vsconfig.xml and
vs<component>.xml, see “Attribute Values in vsconfig.xml” (page 3-37) and “Attribute
Values in vs<component>.xml” (page 3-43), respectively.
In the XML configuration file vsconfig.xml, this setting is required for each field to be
cryptographically processed and can be specified using the cryptId attribute (and its value)
of each field element.
Do not specify field mappings (which include a cryptId name) for the same Hadoop component
using both the XML configuration file vsconfig.xml and a component-specific XML
configuration file.
As shipped, the XML configuration file vsconfig.xml defines the fields subject to
cryptographic processing, including the relevant cryptIds defined in this same configuration file,
for the MapReduce Developer Template and the Sqoop Developer Template.
As shipped, the XML configuration file vsspark-rdd.xml defines the fields subject to
cryptographic processing, including the relevant cryptIds defined in the XML configuration file
vsconfig.xml, for the RDD and Dataset variants of the Spark Developer Template.
To see this setting in the context of the XML configuration files vsconfig.xml and
vs<component>.xml, see “Attribute Values in vsconfig.xml” (page 3-37) and “Attribute
Values in vs<component>.xml” (page 3-43), respectively.
3-25 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0
In the XML configuration file vsauth.xml, as shipped, this configuration setting specifies the
relative HDFS path voltage/config, which resolves to the absolute path /user/
<username>/voltage/config. The relevant user (the user running the Hadoop jobs and
the vsk* scripts) must have appropriate write permissions to create that directory, if necessary,
and to write files into that directory. Due to the sensitive nature of the Kerberos delegation
tokens stored in this directory, your security considerations may dictate that you restrict
permissions on this directory and the files it contains to as few users as possible.
To see this setting in the context of the XML configuration file vsauth.xml, see “Element and
Attribute Values in vsauth.xml” (page 3-40).
Authentication/Authorization Method
Use the Authentication/Authorization Method (direct) setting to specify the method by
which authentication and authorization will be performed with the Voltage SecureData Server
during cryptographic operations. Valid values for this setting are:
CONFIDENTIAL 3-26
Developer Templates Integration Guide Version 5.0 Configuration Settings
vsconfig.xml and vsauth.xml for all Developer Templates other than NiFi
In the context of the cryptIds defined in the configuration file vsconfig.xml and authIds
defined in the configuration file vsauth.xml, use the Authentication/Authorization
Method (direct) setting to specify:
- and/or -
When Kerberos is used anywhere in your configuration, you must also specify an HDFS
directory path using the delegationTokenHdfsPath attribute of the kerberos element
in the configuration file vsauth.xml. Wherever Kerberos is specified as the value of the
authMethod attribute, you must not include the sharedSecret, username, or password
subordinate elements.
Wherever SharedSecret is specified as the value of the authMethod attribute, you must
include only the sharedSecret subordinate element (do not include the username or
password subordinate elements).
Wherever UserPassword is specified as the value of the authMethod attribute, you must
include only the username and password subordinate elements (do not include the
sharedSecret subordinate element), the values of which specify, respectively, either:
3-27 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0
If you choose SharedSecret, you must also provide the shared secret as the value of the
SharedSecret property on the Properties tab.
If you choose UserPassword, you must also provide the username and password as the
values of the Username and Password properties, respectively, on the Properties tab.
To see this setting in the context of the XML configuration file vsauth.xml, see “Element and
Attribute Values in vsauth.xml” (page 3-40).
Shared Secret
Use the Shared Secret (direct) setting to specify, when applicable, the shared secret to be used
to authenticate cryptographic operations with the Voltage SecureData Server.
In the context of the authIds defined in the configuration file vsauth.xml, include the
sharedSecret element and its value when the authMethod attribute of the containing
authDefault and/or authId element is set to SharedSecret:
<authDefault authMethod="SharedSecret">
<sharedSecret>shared_secret</sharedSecret>
</authDefault>
and/or
<authIds>
<authId name="authId_name" authMethod="SharedSecret">
<sharedSecret>shared_secret</sharedSecret>
</authId>
</authIds>
In the context of the Kafka-Storm Developer Template’s Java Properties configuration file
vsauth.properties, include the auth.sharedSecret property when the auth.method
property is set to SharedSecret:
auth.method = SharedSecret
auth.sharedSecret = <actual shared secret>
In the context of the NiFi Developer Template, on the Properties tab of the Configure
Processor dialog box for the SecureDataProcessor, interactively set the shared secret as
the value of the SharedSecret property on the Properties tab when SharedSecret is set in the
Auth Method drop-down box.
CONFIDENTIAL 3-28
Developer Templates Integration Guide Version 5.0 Configuration Settings
To see this setting in the context of the XML configuration file vsauth.xml, see “Element and
Attribute Values in vsauth.xml” (page 3-40).
To see this setting in the context of the Java Properties configuration file
vsauth.properties, see “Specifying the Location of the XML Configuration Files” (page 3-
47).
Username
Use the Username (direct) setting to specify, when applicable, the username to be used to
authenticate cryptographic operations with the Voltage SecureData Server. When the
username is set to {LOCALUSER}, LDAP + Shared Secret authentication is being used and the
accompanying password must specify a valid shared secret. When the username is set to
anything else, Username and Password authentication is being used and the accompanying
password must specify the (usually LDAP) password for the specified user.
In the context of the authIds defined in the configuration file vsauth.xml, include the
username element and its value when the authMethod attribute of the containing
authDefault and/or authId element is set to UserPassword:
<authDefault authMethod="UserPassword">
<username>username_or_{LOCALUSER}</username>
<password>password_or_sharedsecret</password>
</authDefault>
and/or
<authIds>
<authId name="authId_name_here" authMethod="UserPassword">
<username>username_or_{LOCALUSER}</username>
<password>password_or_sharedsecret</password>
</authId>
</authIds>
In the context of the Kafka-Storm Developer Template’s Java Properties configuration file
vsauth.properties, include the auth.username property when the auth.method
property is set to UserPassword:
auth.method = UserPassword
auth.username = <username_or_{LOCALUSER}>
auth.password = <password_or_sharedsecret>
In the context of the NiFi Developer Template, on the Properties tab of the Configure
Processor dialog box for the SecureDataProcessor, interactively set the username or
{LOCALUSER} as the value of the Username property on the Properties tab when
UserPassword is set in the Auth Method drop-down box.
To see this setting in the context of the XML configuration file vsauth.xml, see “Element and
Attribute Values in vsauth.xml” (page 3-40).
3-29 CONFIDENTIAL
Configuration Settings Developer Templates Integration Guide Version 5.0
To see this setting in the context of the Java Properties configuration file
vsauth.properties, see “Specifying the Location of the XML Configuration Files” (page 3-
47).
Password
Use the Password (direct) setting to specify, when applicable, the password or shared secret to
be used to authenticate cryptographic operations with the Voltage SecureData Server. When
the username is set to {LOCALUSER}, LDAP + Shared Secret authentication is being used and
this password setting must specify a valid shared secret. When the username is set to anything
else, Username and Password authentication is being used and this password setting must
specify the (usually LDAP) password for the specified user.
In the context of the authIds defined in the configuration file vsauth.xml, include the
password element and its value when the authMethod attribute of the containing
authDefault and/or authId element is set to UserPassword:
<authDefault authMethod="UserPassword">
<username>username_or_{LOCALUSER}</username>
<password>password_or_sharedsecret</password>
</authDefault>
and/or
<authIds>
<authId name="authId_name_here" authMethod="UserPassword">
<username>username_or_{LOCALUSER}</username>
<password>password_or_sharedsecret</password>
</authId>
</authIds>
In the context of the Kafka-Storm Developer Template’s Java Properties configuration file
vsauth.properties, include the auth.password property when the auth.method
property is set to UserPassword:
auth.method = UserPassword
auth.username = <username_or_{LOCALUSER}>
auth.password = <password_or_sharedsecret>
In the context of the NiFi Developer Template, on the Properties tab of the Configure
Processor dialog box for the SecureDataProcessor, interactively set the password or
shared secret as the value of the Password property on the Properties tab when
UserPassword is set in the Auth Method drop-down box.
To see this setting in the context of the XML configuration file vsauth.xml, see “Element and
Attribute Values in vsauth.xml” (page 3-40).
To see this setting in the context of the Java Properties configuration file
vsauth.properties, see “Specifying the Location of the XML Configuration Files” (page 3-
47).
CONFIDENTIAL 3-30
Developer Templates Integration Guide Version 5.0 Configuration Settings
AuthId Name
Use the required AuthId Name (XML infrastructure) setting to specify a name for an authId, as
specified using the authId elements defined within the authIds element. Specify a unique
name for each authId element using its required name attribute (and its value), thereby
creating a named authentication/authorization method that can be referenced within a cryptId
specification using the authId attribute of the cryptId element. This mechanism is used to
optionally associate an individually defined and named authentication/authorization method
and its associated credentials with one or more cryptIds.
To see this setting in the context of the XML configuration file vsauth.xml, see “Element and
Attribute Values in vsauth.xml” (page 3-40).
3-31 CONFIDENTIAL
XML Configuration Files Developer Templates Integration Guide Version 5.0
Beginning with version 4.1, the Voltage SecureData for Hadoop Developer Templates that run
in the context of a Hadoop cluster (MapReduce, Hive, Sqoop, and Spark) use the following XML
configuration files instead of the Java Properties configuration files used in previous releases:
The information in these configuration files corresponds to the information provided in the
corresponding Java Properties files (vsconfig.properties, vsauth.properties, and
vs<component>.properties) used by these Hadoop Developer Templates in previous
releases, with several important differences, summarized here:
• Information used by specific types of protect and access operations, such as a data
protection format, an identity for cryptographic key generation, and so on, are now
grouped together with an identifying name as a cryptId. This is similar to the alias
grouping concept used in previous versions of the Hive Developer Template and UDF
variants of the Spark Developer Template. CryptIds provide a mechanism for grouping
this type of information, required for all protect and access operations, for use by all of
the relevant Hadoop components.
• Index, such as for a CSV file in the MapReduce Developer Template, or the non-
UDF variants of the Spark Developer Template, or by:
• Name, such as for a database column name in the Sqoop Developer Template.
A field mapping associates the index or name with a corresponding cryptId, which
provides the information necessary to protect or access the type of data in that field.
CONFIDENTIAL 3-32
Developer Templates Integration Guide Version 5.0 XML Configuration Files
The XML configuration files used by the Developer Templates are always validated using their
corresponding XSD schema files: vsconfig.xsd, vsauth.xsd, and vscomponent.xsd. The
schema files are usually located in a sub-directory config/schema within the directory for a
particular Developer Template. For example, the XSD schema files for the Kafka Connect
Developer Template are located in the following directory:
<install_dir>/stream/kafka_connect/config/schema
The remainder of this section provides detailed information about these three types of XML
configuration files, in separate sub-sections:
vsconfig.xml
The XML configuration file vsconfig.xml is capable of providing all of the non-
authentication/authorization configuration information for the Hadoop Developer Templates
that run in the context of a Hadoop cluster (MapReduce, Hive, Sqoop, and Spark). In addition to
providing global and component-specific configuration information needed by these templates,
it also introduces the concept of a cryptId (pronounced “cryp-tid”), which encapsulates the
information required to perform protect and access operations on a particular field of data.
Finally, this XML configuration file captures the information needed to associate data fields for
each of the different types of templates with a corresponding cryptId, providing information
about the format of that data, the identity to use when retrieving a cryptographic key for
protecting or accessing that data, the Voltage SecureData API to use, and so on.
The following two sub-sections, High-Level Elements in vsconfig.xml and Attribute Values in
vsconfig.xml, provide detailed information about the XML structure used in this configuration
file and how the configuration information is provided as attribute values, respectively.
NOTE: The schema definition for the configuration file vsconfig.xml is in the following file:
<install_dir>/config/schema/vsconfig.xsd
3-33 CONFIDENTIAL
XML Configuration Files Developer Templates Integration Guide Version 5.0
<vs:configuration schemaVersion="1"
xmlns:vs="https://www.voltage.com/sd/config"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
1
<secureDataServer attributes only />
3
<simpleAPI attributes only />
1
<webService attributes only />
1
<general attributes only />
1
<clientIds optional attributes >
2
<clientId attributes only />
</clientIds>
3
<cryptIds optional attributes >
4
<cryptId attributes only />
</cryptIds>
1
<fieldMappings>
4
<fields one required attribute>
4
<field attributes only />
</fields>
</fieldMappings>
</vs:configuration>
The remainder of this section provides a summary of the configuration information in each of
the elements of the high-level XML structure of the configuration file vsconfig.xml.
secureDataServer Element
Use the secureDataServer element to specify the relevant Voltage SecureData Server
using one or the other (but not both) of its two attributes: domainName or hostName.
NOTE: If both domainName and hostName are specified, the hostname will be used and
the domain name will be ignored.
If you provide all three of the following configuration settings, which are otherwise built
using the attribute setting(s) of this element, this element is not needed and may be left out
of the XML configuration file vsconfig.xml:
• Simple API Policy URL (page 3-8) - The value of the policyUrl attribute of the
simpleAPI element.
CONFIDENTIAL 3-34
Developer Templates Integration Guide Version 5.0 XML Configuration Files
• REST Hostname (page 3-11) - The value of the restHostName attribute of the
webService element.
simpleAPI Element
Use the simpleAPI element to provide configuration information related to the use of the
Simple API: the required attribute installPath and several optional attributes for
controlling Simple API behavior: policyUrl, cacheType, fileCachePath, and
shortFPE.
NOTE: Because the Hadoop Developer Templates use a CryptoFactory class (and
other associated classes) to cache the Simple API LibraryContext and Crypto
instances for re-use, any changes to settings that affect the LibraryContext instance,
such as the install path and the cache and short FPE settings, require that the
CryptoFactory instance be reinitialized.
In the case of Hadoop jobs that launch a new JVM every time they run, such as
MapReduce, Sqoop, and Spark, no additional steps are needed: starting those jobs will
create a new CryptoFactory instance using the new settings in the configuration files
(XML or Java Properties). However, other Hadoop Developer Templates use long-
running services that have already initialized the CryptoFactory instance (and thus
the underlying libraryContext instance), those services will need to be restarted if
you change these Simple API settings in the configuration. Specifically, this includes the
following integrations/jobs:
webService Element
Use the webService element to provide optional configuration information related to the
use of the REST API: the attribute batchSize and the REST API hostname attribute
restHostName.
general Element
Use the general element to provide optional configuration information that applies to all
of the APIs, which is presently the single attribute that controls behavior when an access
operation fails its authentication/authorization step:
returnProtectedValueOnAccessAuthFailure
3-35 CONFIDENTIAL
XML Configuration Files Developer Templates Integration Guide Version 5.0
Use the clientIds and clientId elements to provide optional configuration information
for sending a customized product name and version in requests to the Voltage SecureData
Server when using the Simple API to request cryptographic keys and when making the
REST API requests.
<clientIds product="global product description"
version="global product version">
<clientId component="component designator"
product="component-specific product description"
version="component-specific product version" />
<clientId ... />
</clientIds>
Use the cryptIds and cryptId elements to provide named sets of format/identity/
authentication information, for reference (either directly or through field mappings) when
performing cryptographic processing on specific data values.
The enclosing cryptIds element allows for the definition of default values for the identity
and the API, while the subordinate cryptId elements allow those choices to be overridden,
as well as allowing for the definition of an identifying name for the cryptId, the associated
data protection format, a reference to the associated authentication information, the name
and initialization data of the associated translator Java class, if any, and an optional
informational description.
<cryptIds defaultIdentity="default identity for key derivation"
defaultApi="default API to use">
<cryptId name="cryptId name"
format="data protection format"
identity="identity for key derivation"
authId="authId name"
api="API to use"
translatorClass="translator class name"
translatorInitData="translator initialization data"
description="informational description" />
<cryptId ... />
</cryptIds>
Use the fieldMappings, fields, and field elements to provide information about
which fields/columns are subject to cryptographic operations for a subset of the Hadoop
Developer Template components (MapReduce, Sqoop, and the non-UDF variants of Spark).
CONFIDENTIAL 3-36
Developer Templates Integration Guide Version 5.0 XML Configuration Files
The enclosing fieldMappings element contains one or more fields elements, the
component attribute(s) of which identify the relevant Hadoop Developer Template
component for which the enclosed field elements define the fields/columns subject to
cryptographic processing.
Each fields element contains one or more field elements, each of which identifies a
field/column, either by index (using its index attribute and value) or by name (using its
name attribute and value). In both cases, the corresponding cryptId, which provides
information about how to protect and access the specified field/column, is identified using
the cryptId attribute and its value (which maps to the value of the name attribute of the
relevant cryptId).
<fieldMappings>
<fields component="component designator">
<field index="field index, when appropriate"
name="field name, when appropriate"
cryptId="cryptId name" />
<field ... />
</fields>
</fieldmappings>
3-37 CONFIDENTIAL
XML Configuration Files Developer Templates Integration Guide Version 5.0
<cryptIds defaultIdentity="Identity"
defaultApi="API">
<cryptId name="CryptId Name"
format="Format"
identity="Identity"
authId="CryptId AuthId Name"
api="API"
translatorClass="Translator Class" />
translatorInitData=
"Translator Initialization Data" />
description="CryptId Description" />
<cryptId ... />
</cryptIds>
<fieldMappings>
<fields component="Component-Specific Designator for Fields">
<field index="Field Index"
or
name="Field Name"
cryptId="CryptId Name for Field" />
<field ... />
</fields>
<fields ... />
</fieldMappings>
</vs:configuration>
vsauth.xml
The configuration file vsauth.xml provides the authentication/authorization configuration
information for the Hadoop Developer Templates that run in the context of a Hadoop cluster
(MapReduce, Hive, Sqoop, and Spark). It can provide individually named sets of authentication/
authorization information (the method and its associated credentials) that can be referenced
by cryptIds in the XML configuration file vsconfig.xml as well as a default set of
authentication/authorization information for use by cryptIds that do not reference an
individually named set. Finally, for when Kerberos authentication/authorization is used, it
provides a way to specify the HDFS directory where Kerberos delegation tokens will be stored.
The following two sub-sections, High-Level Elements in vsauth.xml and Element and Attribute
Values in vsauth.xml, provide detailed information about the XML structure used in this
configuration file and how the configuration information is provided as element and attribute
values, respectively.
NOTE: The schema definition for the configuration file vsauth.xml is in the following file:
<install_dir>/config/schema/vsauth.xsd
CONFIDENTIAL 3-38
Developer Templates Integration Guide Version 5.0 XML Configuration Files
1
<kerberos delegationTokenHdfsPath="HDFS path" />
The remainder of this section provides a summary of the configuration information in each of
the elements of the high-level XML structure of the configuration file vsauth.xml.
kerberos Element
Use the kerberos element to specify the HDFS directory in which Kerberos delegation
tokens issued by the Voltage SecureData Server for Kerberos-authenticated users will be
stored. This element is required whenever any of the authMethod attributes (default or
otherwise) is set to Kerberos; otherwise it is ignored.
Back to High-Level Elements in vsauth.xml
Use the authDefault and its credential subordinate elements to provide a default
authentication/authorization method to be used when a particular cryptId does not include
an authId attribute. Depending on the chosen authentication/authorization method, there
will either be zero (Kerberos), one (SharedSecret), or two (UserPassword)
subordinate elements expected.
3-39 CONFIDENTIAL
XML Configuration Files Developer Templates Integration Guide Version 5.0
<username>username</username>
<password>password</password>
</authDefault>
Use the authIds element, its subordinate (one or more) authId elements (and their
credential subordinate elements, when applicable) to provide a set of named
authentication/authorization method/credential pairings. Each authId element defines a
name attribute, an authMethod attribute, and depending on the value of the latter
attribute, zero (Kerberos), one (SharedSecret), or two (UserPassword) subordinate
elements. The value of the name attribute may be provided as the value of the authId
attribute of one or more cryptId elements in the configuration file vsconfig.xml.
<authIds>
<authId name="authId name"
authMethod="authentication/authorization method">
No subordinate elements
or
<sharedSecret>shared secret</sharedSecret>
or
<username>username</username>
<password>password</password>
</authId>
<authId ... />
<authIds>
CONFIDENTIAL 3-40
Developer Templates Integration Guide Version 5.0 XML Configuration Files
</authDefault>
<authIds>
<authId name="AuthId Name"
authMethod="Authentication/Authorization Method"/>
No subordinate elements
or
<sharedSecret>Shared Secret</sharedSecret>
or
<username>Username</username>
<password>Password</password>
</authId>
<authId ... />
</authIds>
</vs:authentication>
vs<component>.xml
The set of XML configuration files with names of the form vs<component>.xml provide an
alternative mechanism for providing certain types of component-specific configuration
information for the Hadoop Developer Templates that run in the context of a Hadoop cluster
(MapReduce, Hive, Sqoop, and Spark). Valid values for <component> are the following:
• hive - Provide the clientId element for the Hive Developer Template in the
configuration file vshive.xml. Note that because the relevant field and
cryptId are specified as UDF parameters, the fields element is not
relevant to the Hive Developer Template and should not be specified in
the configuration file vshive.xml.
• sqoop - Provide the clientId and/or fields elements for the Sqoop
Developer Template in the configuration file vssqoop.xml.
• spark-rdd - Provide the clientId and/or fields elements for the RDD and
Dataset variants of the Spark Developer Template in the configuration
file vsspark-rdd.xml.
• spark-udf - Provide the clientId element for the UDF variants of the Spark
Developer Template in the configuration file vsspark-udf.xml. Note
that because the relevant field and cryptId are specified as UDF
parameters, the fields element is not relevant to the UDF variants of
the Spark Developer Template and should not be specified in the
configuration file vsspark-udf.xml.
Use this alternative mechanism when you prefer to provide component-specific settings in the
smaller component-specific XML configuration files instead of providing them together in the
larger shared XML configuration file vsconfig.xml. If you do provide one or more
component-specific XML configuration files for relevant Hadoop components, do not also
provide clientId and/or fields elements for those components in the shared XML
3-41 CONFIDENTIAL
XML Configuration Files Developer Templates Integration Guide Version 5.0
configuration file vsconfig.xml. For example, if you provide a fields element in the
component-specific XML configuration file vsmr.xml, do not provide a fields element in the
shared XML configuration file with its component attribute set to mr. If you do so by mistake,
the settings in component-specific XML configuration files will override the same settings in the
shared XML configuration file vsconfig.xml.
NOTE: The schema definition for XML configuration files with names of the form
vs<component>.xml is in the following file:
<install_dir>/config/schema/vscomponent.xsd
1
<clientId attributes only />
1
<fields>
2
<field attributes only />
</fields>
</vs:componentConfiguration>
The remainder of this section provides a summary of the configuration information in each of
the elements of the high-level XML structure of configuration files with names of the form
vs<component>.xml.
clientId Element
Use the clientId element to provide optional configuration information for sending a
customized product name and/or version in requests to the Voltage SecureData Server
when using the Simple API to request cryptographic keys and when making the REST API
requests.
CONFIDENTIAL 3-42
Developer Templates Integration Guide Version 5.0 XML Configuration Files
Use the fields and field elements to provide optional configuration information about
which fields/columns are subject to cryptographic operations for a subset of the relevant
Hadoop Developer Template components (MapReduce, Sqoop, and the non-UDF variants
of Spark).
The relevant component-specific XML configuration files may contain a single fields
element that defines the fields/columns subject to cryptographic processing for the
component associated with the component-specific XML configuration file in which it
appears.
The fields element contains one or more field elements, each of which identifies a field/
column, either by index (using its index attribute and value) or by name (using its name
attribute and value). In both cases, a corresponding cryptId in the shared configuration file
vsconfig.xml, which provides information about how to protect and access the specified
field/column, is identified using the cryptId attribute and its value (which maps to the
value of the name attribute of the relevant cryptId).
<fields>
<field index="field index, when appropriate"
or
name="field name, when appropriate"
cryptId="cryptId name" />
<field ... />
</fields>
3-43 CONFIDENTIAL
XML Configuration Files Developer Templates Integration Guide Version 5.0
</fields>
</vs:componentConfiguration>
As delivered, the component-specific approach for XML configuration files is illustrated for the
RDD and Dataset variants of the Spark Developer Template in the XML configuration file
vsspark-rdd.xml. This file defines fields 7, 8, 9, and 10 as subject to cryptographic
processing using the cryptIds alpha, date, cc, and ssn, respectively (defined in the shared
configuration file vsconfig.xml):
<fields>
<field index = "7" cryptId = "alpha"/>
<field index = "8" cryptId = "date"/>
<field index = "9" cryptId = "cc"/>
<field index = "10" cryptId = "ssn"/>
</fields>
NOTE: In previous releases of the Hadoop Developer Templates, there were two different
component-specific Java Properties configuration files for the Spark Developer Template:
vsspark-rdd.properties and vsspark-udf.properties. The latter of these was
used to define aliases (previously, aliases served the same role that cryptIds do now: named
bundles of format/identity/auth information.) UDF calls in both the Hive Developer
Template and the UDF variants of the Spark Developer Template now specify cryptId names
as a UDF parameter instead of specifying a component-specific alias name. This direct
specification of the cryptId (always configured in the shared configuration file
vsconfig.xml for use by all Hadoop components) as a UDF parameter, eliminates the
need for ever specifying fields for these UDF-based components, either in the shared
configuration file vsconfig.xml or in the relevant component-specific configuration files
(vshive.xml or vsspark-udf.xml).
CONFIDENTIAL 3-44
Developer Templates Integration Guide Version 5.0 Java Properties Configuration Files
The NiFi Developer Template uses a single Java Properties configuration file:
NOTE: The information required for protect and access operations, including cryptographic
settings such as the data protection format and the identity, as well as the authentication
method and credentials, are entered in the NiFi user interface for the
SecureDataProcessor.
The Java Properties configuration file used by the NiFi Developer Template must confirm to
the following requirements:
• Any parameter name and value pairs spanning multiple lines must use the backslash
character (\) to indicate line continuation.
• Lines beginning with a hash character (#) are treated as comments and not processed.
• The first parameter value in most of the configuration files provides a version number
for the configuration file. For example, the first parameter in the configuration file
vsauth.properties is auth.config.version, with its value set to 1:
auth.config.version = 1
For more information about configuring the NiFi Developer Template, see "Processor Classes
for the NiFi Developer Template" (page 8-8) and "Configuring the Properties of the NiFi
SecureDataProcessor" (page 8-12).
The remainder of this section provides detailed information about this Java Properties
configuration file.
vsnifi.properties
The Java Properties configuration file vsnifi.properties, used by the NiFi Developer
Template, contains the same general configuration values as the XML configuration file
vsconfig.xml used by the other Developer Templates. It contains extensive comments that
explain each value. Without those comments, the available configuration values are shown
below, with blue links to the generic explanation of each value in the section "Configuration
Settings" (page 3-5).
3-45 CONFIDENTIAL
Java Properties Configuration Files Developer Templates Integration Guide Version 5.0
config.version = 1
simpleapi.policy.url = Simple API Policy URL
simpleapi.install.path = Simple API Install Path
simpleapi.cache.type = Simple API Cache Type
simpleapi.file.cache.path = Simple API File Cache Path
simpleapi.shortfpe = Simple API Short FPE Behavior
rest.hostname = REST Hostname
product.name = Product Name
product.version = Product Version
return.protected.value.on.access.auth.failure =
Authentication/Authorization Failure on Access Behavior
The remaining configuration values, related to protect or access operations, such as identifying
the cryptographic operation as either protect or access, the data protection format, the identity,
and so on, and including authentication/authorization information, are provided for the NiFi
Developer Template on the Properties tab of the Configure Processor dialog box for the
relevant SecureDataProcessor. For more information, "Configuration Settings for the NiFi
Developer Template" (page 8-10).
NOTE: Do not change the value of the config.version property. Leave it set to 1.
CONFIDENTIAL 3-46
Developer Templates Integration Guide Version 5.0 Specifying the Location of the XML Configuration Files
The Hadoop Developer Templates support three methods for using the XML configuration
files vsconfig.xml , vsauth.xml, and Hadoop-component-specific configuration files with a
names of the form vs<component>.xml:
• For any of the Hadoop components (MapReduce, Hive, Sqoop, and Spark), modify these
XML configuration files as necessary in the local file system directory <install_dir>/
config and then use the script copy-sample-data-to-hdfs (and possibly the
script run-spark-prepare-job) to copy those files to the HDFS directory /user/
<user>/voltage/config.
• Alternatively, specify one or more XML configuration file locations using well-known
property names with the -D generic option for Hadoop commands. This alternative
method is recommended only for the yarn command used to start the MapReduce
Developer Template. For more information, see "-D Generic Option to Specify a Property
Value" (page 3-47).
• Alternatively, specify one or more configuration file locations using well-known property
names in the Java Properties file config-locator.properties, packaged into the
JAR file vsconfig.jar. This alternative method can be used for any of the Hadoop
components. For more information, see "Config-Locator Properties File Packaged as a
JAR File" (page 3-48).
CAUTION: One of the important steps that occurs in the script copy-sample-data-to-
hdfs is the setting of permissions for the primary Hadoop Developer Template XML
configuration files after they are copied to HDFS. The permissions are set such that only the
relevant Hadoop user can read these files. Because it contains sensitive credentials as
plaintext, this is particularly important for the authentication/authorization configuration file,
normally named vsauth.xml. If you choose to use XML configuration files at a different
location, you must take similar measures to ensure that only authorized users can read them.
-DVOLTAGE_CONFIG_FILE=/apps/mapred/voltage/config/vsconfig.xml
3-47 CONFIDENTIAL
Specifying the Location of the XML Configuration Files Developer Templates Integration Guide Version 5.0
The code in class HDFSConfigLoader will look for the following three names, case-sensitive,
each associated with one of the three primary configuration files, shown here with their default
names:
Expected Name Component Value Component Specifies:
VOLTAGE_CONFIG_FILE Full path to the general XML configuration file.
Normally: vsconfig.xml
VOLTAGE_AUTH_FILE Full path to the authentication/authorization XML
configuration file.
Normally: vsauth.xml
VOLTAGE_COMP_FILE Full path to the component-specific XML
configuration file, if any.
Normally: vs<component>.xml,
To use this method of specifying alternate HDFS locations for the configuration files used for
the MapReduce Developer Template, provide two extra -D parameters to the yarn command
in script files such as run-mr-protect-job and run-mr-access-job. For example:
yarn jar ... \
-DVOLTAGE_CONFIG_FILE=/apps/mapred/voltage/config/vsconfig.xml \
-DVOLTAGE_AUTH_FILE=/apps/mapred/voltage/config/vsauth.xml \
-libjars ...
For information about the precedence when checking for various methods of specifying
configuration file locations, see "Precedence When Checking for XML Configuration File
Locations" (page 3-50).
The configuration loader code in class HDFSConfigLoader automatically looks for this
optional properties file in the job classpath, and if found, reads the alternate configuration file
locations from that file.
CONFIDENTIAL 3-48
Developer Templates Integration Guide Version 5.0 Specifying the Location of the XML Configuration Files
3. Set values for one or more of the following property names in order to specify an
alternate location for the corresponding primary Hadoop Developer Template XML
configuration file:
Expected Property Name Property Value Specifies:
config.hdfs.location Full path to the general XML configuration file.
Normally: vsconfig.xml
auth.hdfs.location Full path to the authentication/authorization
XML configuration file.
Normally: vsauth.xml
comp.hdfs.location Full path to the component-specific XML
configuration file, if any.
Normally: vs<component>.xml,
4. Save your changes and close the Java Properties file config-locator.properties.
This Maven POM file builds the JAR file vsconfig.jar for reference by the Hadoop
jobs in the directory <install_dir>/configlocator/target. It also copies this
JAR file to the directories <install_dir>/bin and <install_dir>/spark/lib.
3-49 CONFIDENTIAL
Specifying the Location of the XML Configuration Files Developer Templates Integration Guide Version 5.0
7. Reference the JAR file vsconfig.jar when running a Hadoop job or when defining a
UDF. Specifically, depending on the type of the Hadoop job, either update the
-libjars line to include this JAR file or add this JAR file to the using line, as follows:
MapReduce:
Sqoop:
sqoop import \
-libjars ...,vsconfig.jar \
--username $DATABASE_USERNAME \
-P \
--connect jdbc:mysql://$DATABASE_HOST/$DATABASE_NAME \
--table $TABLE_NAME \
--jar-file voltage-hadoop.jar \
--class-name com.voltage.securedata.hadoop.sqoop.Sqoo... \
--target-dir voltage/protected-sqoop-import
Hive:
You will need to copy the JAR file vsconfig.jar to the specified location in HDFS.
You should also include the JAR file vsconfig.jar with the other two JAR files
(vibesimplejava.jar and voltage-hadoop.jar) that you must manually copy
into the appropriate hive/lib directory (the parent classpath for the Hive service) on
all data nodes of your Hadoop cluster. For more information, see “Failure to Copy JAR
Files to the hive/lib Directory on All Data Nodes” (page 12-4).
1. The HDFS location specified in the Hadoop job configuration as one of the following
variables:
CONFIDENTIAL 3-50
Developer Templates Integration Guide Version 5.0 Specifying the Location of the XML Configuration Files
2. The HDFS location specified as a -D system property with one of the following names:
3. The HDFS location specified as an environment variable with one of the following
names:
4. The HDFS location specified as a property in the Java Properties file config-
locator.properties, packaged in the JAR file vsconfig.jar, with one of the
following names:
3-51 CONFIDENTIAL
Other Approaches to Providing Configuration Settings Developer Templates Integration Guide Version 5.0
This approach is similar to the approach taken in the samples, but using a local file
system rather than HDFS. The disadvantage is that you will likely need a configuration
management tool to manage the copying of this file to all of the data nodes.
CONFIDENTIAL 3-52
Developer Templates Integration Guide Version 5.0 Other Approaches to Providing Configuration Settings
You will need a way to distribute the required authentication/authorization credentials to the
data nodes running jobs that use SecureData operations. You might decide to use some of the
examples described in this section, or you might decide to use a completely different approach,
depending on your specific integration use-case and Hadoop environment.
For example, even if you use a configuration file approach, this does not necessarily have to be
formatted as a Java Properties file. You might decide to use XML or some other syntax for
specifying configuration settings in one or more files.
3-53 CONFIDENTIAL
Shared Integration Architecture Developer Templates Integration Guide Version 5.0
This section describes the integration architecture shared by the Hadoop Developer
Templates. Some of the Java code that implements this architecture is shared by all of the
Hadoop Developer Templates and some of it is used only by the Hadoop Developer Templates
that operate within the context of a Hadoop cluster.
The following Java packages are part of this shared integration architecture:
These classes provide some general purpose utility and support classes that can be
useful in any of the templates.
These classes provide functionality for accessing configuration information that is that
common across the templates.
These classes provide functionality for converting data between non-string data types
and the string formats expected by the SecureData APIs.
CONFIDENTIAL 3-54
Developer Templates Integration Guide Version 5.0 Shared Integration Architecture
These classes provide functionality for translating data to and from the string formats
expected by the SecureData APIs.
These classes provide an abstraction layer for the Voltage SecureData APIs (the Simple
API and the REST API) used by the Hadoop Developer Templates for cryptographic
processing.
Package Contents
This package defines the following set of classes in .java source files of the same name:
• Base64 - This class provides methods for Base64 encoding and decoding.
• FileUtils - This class provides methods for reading from a text file, which is used
when reading from configuration files.
• Sanitizer - This class provides a method for sanitizing log messages, which is useful
for mitigating Log Forging security vulnerabilities.
NOTE: This class is used to scrub all log messages to remove any newline/tab
characters that may have come from user-provided input. While this functionality
mitigates the forging of illegitimate log messages by ending the current log line
and starting a new one, it does not provide any protection against other types of
logging attacks. These other types of logging attacks, which include cross-site
scripting (XSS), SQL injection, and so on, must be mitigated, as necessary, by any
downstream processes that read and/or display the logs. For example, the
Hadoop log viewer Web UI automatically performs protection against XSS by
escaping any HTML/Javascript characters in the log messages before rendering
them in HTML responses.
If your job logs are being processed by one or more custom downstream systems,
make sure those systems perform the necessary mitigation (escaping, scrubbing,
and so on) as appropriate to their context(s).
3-55 CONFIDENTIAL
Shared Integration Architecture Developer Templates Integration Guide Version 5.0
On Linux platforms, the Simple API uses OpenSSL for secure connections to the Key
Server. OpenSSL uses a trustStore directory that contains trusted root certificates as
individual .pem files, each of which has a hash-named symbolic link that allows for fast
lookups.
Clients using the REST API must configure their local TLS transport mechanism to trust
the required root certificates. Exactly how this is done depends on which REST library
you are using. As delivered, the template code for the REST API uses the Apache HTTP
Client library, which uses the JVM truststore for establishing root certificate trust.
Using this approach, adding a new trusted root certificate for use in the Developer
Templates is exactly the same as just adding a new trusted root certificate for the
Simple API. After the new certificate is added to the Simple API trustStore directory
(and that directory is re-hashed for use by the Simple API), the new certificate is
automatically added to the JVM truststore whenever any of the Developer Template
samples are executed.
Authentication
The following Java package and its associated Java source code provides a thin wrapper class
for requesting a Kerberos authentication/delegation token from the Voltage SecureData Server:
CONFIDENTIAL 3-56
Developer Templates Integration Guide Version 5.0 Shared Integration Architecture
Package Contents
This package defines the following class in a .java source file of the same name:
Common Configuration
The following Java package and its associated Java source code provide classes to read and
parse the configuration settings from several different types of configuration files used by the
Developer Templates:
Package Contents
This package defines the following set of classes and enumerations in .java source files of the
same name:
• FilepathBuilder - This utility class provides a method for constructing file paths.
3-57 CONFIDENTIAL
Shared Integration Architecture Developer Templates Integration Guide Version 5.0
• UserInfo - This interface defines a uniform mechanism for retrieving the current user
in different operating contexts (generic Hadoop versus Hive).
Package Usage
This package provides functionality to read a set of configuration settings that are typically
required by the Voltage SecureData APIs regardless of whether they are used in the context of
Hadoop or not. These settings can come from a variety of sources, such as XML configuration
files on HDFS, as used by the Hadoop Developer Templates, or from a Java Properties file on
the local file system, as used by the NiFi and Kafka-Storm templates.
The XML and Java Properties configuration files processed by the classes in this package must
conform to the relevant XSD files and the requirements specified in "Java Properties
Configuration Files" (page 3-45), respectively.
When first running the sample jobs in the Hadoop Developer Templates, you can leave all the
settings at their default values in the configuration files, except for possibly customizing the
location where you installed the Simple API on the data nodes, as follows:
XML: <simpleAPI installPath="/path/to/simpleapi/location" />
or
Java Properties: simpleapi.install.path = /path/to/simpleapi/location
You may also want to change the configuration settings for your own Voltage SecureData
Server, after first trying out the jobs against the public-facing Voltage SecureData Server
dataprotection, hosted by Micro Focus.
This package will read and populate in-memory versions of the following configuration settings,
required by all of the Developer Templates:
• Configuration Version
CONFIDENTIAL 3-58
Developer Templates Integration Guide Version 5.0 Shared Integration Architecture
• REST Hostname
XML configuration file (vsconfig.xml) and Java Properties configuration file
(vsnifi.properties).
• Product Name
XML configuration files (vsconfig.xml and/or vs<component>.xml) and Java
Properties configuration file (vsnifi.properties).
• Product Version
XML configuration files (vsconfig.xml and/or vs<component>.xml) and Java
Properties configuration file (vsnifi.properties).
• Identity
XML configuration file (vsconfig.xml) and the Properties tab of the Configure
Processor dialog box of the SecureDataProcessor Nifi Processor.
• API
XML configuration file (vsconfig.xml) and the Properties tab of the Configure
Processor dialog box of the SecureDataProcessor Nifi Processor.
3-59 CONFIDENTIAL
Shared Integration Architecture Developer Templates Integration Guide Version 5.0
• Format
XML configuration file (vsconfig.xml) and the Properties tab of the Configure
Processor dialog box of the SecureDataProcessor Nifi Processor.
• Translator Class
XML configuration files only (vsconfig.xml and/or vs<component>.xml).
• Field Index
XML configuration files only (vsconfig.xml and/or vs<component>.xml).
• Field Name
XML configuration files only (vsconfig.xml and/or vs<component>.xml).
• Authentication/Authorization Method
XML configuration file (vsauth.xml) and the Properties tab of the Configure
Processor dialog box of the SecureDataProcessor Nifi Processor.
• Shared Secret
XML configuration file (vsauth.xml) and the Properties tab of the Configure
Processor dialog box of the SecureDataProcessor Nifi Processor.
• Username
XML configuration file (vsauth.xml) and the Properties tab of the Configure
Processor dialog box of the SecureDataProcessor Nifi Processor.
• Password
XML configuration file (vsauth.xml) and the Properties tab of the Configure
Processor dialog box of the SecureDataProcessor Nifi Processor.
NOTE: The configuration settings described in section "Configuration Settings" on page 3-5
also include a number of XML infrastructure settings that are used within the XML files only
to categorize that configuration information and link it together in useful ways. These
settings include: Component-Specific Designator for Client ID, CryptId Name, CryptId AuthId
Name, CryptId Description, Component-Specific Designator for Fields, CryptId Name for
Field, and AuthId Name.
Hadoop Configuration
The following Java package and its associated Java source code provide classes to extend
several common configuration classes for reading and storing configuration properties that are
specific to the shared configuration files used by the Hadoop Developer Templates
(MapReduce, Hive, Sqoop, and Spark), as well as for reading those configuration files from
HDFS:
CONFIDENTIAL 3-60
Developer Templates Integration Guide Version 5.0 Shared Integration Architecture
NOTE: The NiFi Developer Template extends the same common configuration classes with
its own specific configuration classes. For more information, see "Processor Classes for the
NiFi Developer Template" (page 8-8).
Package Contents
This package defines the following set of classes in .java source files of the same name:
This class wraps the call to the populate method of the HadoopConfigPopulator
class in code that reads several Java Properties files from a specified or default location
in HDFS. This approach allows for the addition of other types of configuration loaders
that load configuration data from different input sources; reading from HDFS is just one
example approach demonstrated in the Hadoop Developer Templates.
The class HDFSConfigLoader provides alternative ways to specify the locations of the
Hadoop configuration files:
• Using another configuration file that contains the locations of the primary
configuration files in HDFS, packaged into a well-known JAR file.
For more information about these alternative methods, see "Specifying the Location of
the XML Configuration Files" (page 3-47).
3-61 CONFIDENTIAL
Shared Integration Architecture Developer Templates Integration Guide Version 5.0
• ReleaseInfo - This class provides methods for retrieving release information from
the Java Properties file vsrelease.properties. This information is logged when
configuration information is loaded and used in the REST API’s User Agent field.
• vsmr.xml - Optionally used to specify clientId and field configuration information, for
the MapReduce Developer Template that can also be specified in the shared
configuration file vsconfig.xml.
• vshive.xml - Optionally used to specify clientId configuration information, for the Hive
Developer Template that can also be specified in the shared configuration file
vsconfig.xml.
NOTE: In version 4.2, the field configuration information for the RDD and Dataset variants of
the Spark Developer Template use a component-specific configuration file. The field
configuration information for the MapReduce Developer Template and the Sqoop Developer
Template is provided in the shared configuration file vsconfig.xml. Beginning with
version 4.1, the UDF-based Developer Templates, including the Hive Developer Template
and the UDF variants of the Spark Developer Template, no longer require field configuration
information in a configuration file. Instead, they specify a cryptId to be used when protecting
or access the field to which they are applied.
Package Usage
The Hadoop Developer Templates (MapReduce, Hive, Sqoop, and Spark) share a pair of XML
configuration files and optionally use individual configuration files for each individual Hadoop
component (vsspark-rdd.xml being the only example included):
CONFIDENTIAL 3-62
Developer Templates Integration Guide Version 5.0 Shared Integration Architecture
The implementation that uses these configuration files provides an example of the types of
configuration information that your data nodes will need in order to be able to use one or more
of the three Voltage SecureData APIs while running their jobs. It is also serves as an example of
one possible approach to making this configuration information available to your data nodes.
This package will read and populate the common configuration settings (see "Common
Configuration" (page 3-57)) and also an in-memory version of the following configuration
setting, required when Kerberos authentication is used with the Hadoop-specific Developer
Templates:
All of these Developer Templates are located together in the following directory:
<install-dir>/stream
3-63 CONFIDENTIAL
Shared Integration Architecture Developer Templates Integration Guide Version 5.0
Package Contents
This package defines the following set of classes in .java source files of the same name:
• StreamConfigLoader - This class provides methods for constructing the full paths
to the configuration files (and for legacy situations, for getting the configuration settings
from a Java Properties file into the in-memory container classes).
Package Usage
This package provides functionality common to the DataStream Developer Templates:
StreamSets, Kafka Connect, and Kafka-Storm. At this time, this functionality is used to read the
set of configuration settings that are required by the underlying Voltage SecureData APIs. As
shipped, these settings come from XML configuration files the local file system, but to support
legacy Kafka-Storm installations, can also come from Java Properties configuration files on the
local file system.
The XML and Java Properties configuration files processed by the classes in this package must
conform to the relevant XSD files and the requirements specified in "Java Properties
Configuration Files" (page 3-45), respectively.
When first running the sample jobs and pipelines in the DataStream Developer Templates, you
can leave all the settings at their default values in the configuration files, except for possibly
customizing the location where you installed the Simple API on the data nodes, as follows:
XML: <simpleAPI installPath="/path/to/simpleapi/location" />
or
Java Properties: simpleapi.install.path = /path/to/simpleapi/location
You may also want to change the configuration settings for your own Voltage SecureData
Server, after first trying out the jobs and pipelines against the public-facing Voltage SecureData
Server dataprotection, hosted by Micro Focus.
CONFIDENTIAL 3-64
Developer Templates Integration Guide Version 5.0 Shared Integration Architecture
This package will read the specified configuration files on the local file system and populate in-
memory versions of the configuration settings described in "Common Configuration" (page 3-
57).
Data Conversion
The following Java package and its associated Java source code provide classes to convert
between specific data type objects (such as dates and doubles) and generic strings:
Package Contents
This package defines the following set of classes and interfaces in .java source files of the
same name:
• BigDecimalConverter, DoubleConverter,
FloatingPointNumberConverter, IntegerConverter, LongConverter,
and NumberConverter - These classes provide a class hierarchy for the classes used
to convert various Java numeric data types to and from the string data expected by the
SecureData APIs.
• LegacySimpleAPIDateConverter,
LegacySimpleAPIDateTimeConverter, and
LegacySimpleAPITimeOnlyConverter - These classes provide functionality to
convert between Java date/time objects and the date/time string format expected by
pre-4.3 versions of the Simple API.
NOTE: The DataConvertFactory class, as provided, does not use these classes.
They are only present to support legacy pre-4.3 Simple API date/time operations. If
needed, you will need to change the code in the DataConverterFactory class to
use one or more of the appropriate Legacy* classes when using a pre-4.3 version of
the Simple API.
3-65 CONFIDENTIAL
Shared Integration Architecture Developer Templates Integration Guide Version 5.0
Package Usage
Because the SecureData APIs accept input only as strings, non-string input data sometimes
needs to be changed to a string format before passing it to a SecureData API. The classes in
this package are used to convert from non-string data types to strings prior to data protection
processing, and then to convert the data protection results from strings back to the specific
data types.
These classes are used for the Sqoop Developer Template integration, where the fields in the
database table have specific non-string data types. See the Developer Template code and
Javadocs for the DataConverter interface for more details.
When the Sqoop import integration runs, it automatically determines the data type of the fields
being protected, using the data types of the corresponding getter methods in the generated
object-relational mapping (ORM) class. For more information about how a generated ORM
class is used in the Sqoop template integration, see "Integration Architecture of the Sqoop
Template" (page 6-2). After the data types are determined, the Sqoop template integration
maps any non-string data types to an appropriate DataConverter implementation class,
which performs the specific conversions to and from the corresponding string formats.
For example, a date field would have a specific converter to convert the Java Date object to a
formatted string value, which is needed before calling either the Simple API or the REST API.
After the string result is returned by the relevant API, it is converted back into a Java Date
object for Sqoop to write as output by the ORM class.
NOTE: In most cases, the data type conversion is straightforward. However, there are
situations where it gets more complicated, such as for date processing in pre-4.3 versions of
the Simple API. date processing. Versions of the Simple API prior to 4.3 do not work on
formatted date/time input values directly, and requires translation into an internal syntax of
the following form:
<year>:<month>:<day>:<hour>:<minute>:<second>:::
For more information about more strict date formatting requirements in older versions of the
Simple API, see the Simple API Release Notes for versions 4.2 or earlier.
The mapping from the Simple API or REST API and a specific data type (such as Date) to its
corresponding DataConverter implementation class is encapsulated in the
DataConverterFactory class. Then, specific integrations such as the Sqoop template
integration can call this factory to request the appropriate converter implementation for the
field being processed.
The concrete implementation classes (with class names of the form <datatype>Converter)
implement the methods convertToString and convertFromString to convert between
the specific object type and its corresponding string representation.
CONFIDENTIAL 3-66
Developer Templates Integration Guide Version 5.0 Shared Integration Architecture
Because the Sqoop integration knows the data types of the input fields/columns, this
conversion is performed automatically, without requiring any custom settings in the Hadoop
configuration file vsconfig.xml.
CAUTION: Some of the converter classes provided by the converter package (such as the
converter classes for converting from and back to different numeric data types) are not
exercised by the Sqoop integration, as delivered. If you choose to use them, be sure to test
them thoroughly before you deploy them.
Data Translation
The following Java package and its associated Java source code provide classes to translate
from one string format to and from the string format expected by the SecureData APIs:
Package Contents
This package defines the following set of classes and interfaces in .java source files of the
same name:
Package Usage
The translator package provides classes for translating between different string
representations of data before and after processing by a Voltage SecureData API. An input
value to be processed may need to be pre-processed into a different string format before a
Voltage SecureData API is invoked, and the data protection results post-processed back into
the original string format.
3-67 CONFIDENTIAL
Shared Integration Architecture Developer Templates Integration Guide Version 5.0
NOTE: For converting between different data types and their corresponding string
representations, as required by the Sqoop template import integration, see "Data
Conversion" (page 3-65).
When a Hadoop Developer Template integration cannot automatically determine the data
types of individual values, such as the string inputs in HDFS processed by the MapReduce
Developer Template, they may require translation before and/or after they are protected or
accessed. In such cases, you must configure the appropriate translator implementation class in
the Hadoop configuration file vsconfig.xml so that the correct translation will be performed.
The Developer Template code also shows an advanced option to initialize the translator with
optional custom settings. While this advanced option is not needed for the sample data and
configuration provided with the Developer Templates, the Javadocs and the Developer
Template code itself show how this is implemented. An example of how this custom
initialization could be used would be to update the configuration settings to allow full date/time
(with a time granularity of one second) processing when using the Simple API. For details, see
the method init in the class LegacySimpleAPIDateTranslator.
NOTE: In the case of Sqoop, an explicit translator is often not required because the data type
of the table field/column automatically determines the data conversion to perform, as
described in "Data Conversion" (page 3-65). However, if you are storing date values as a
string (VARCHAR) column in your database table, you might need to configure an explicit
translator to perform pre- and post-processing in the course of protecting or accessing that
field/column.
Cryptographic Abstraction
The following Java package and its associated Java source code provide classes to create a
cryptographic abstraction layer, hiding the details of the calls to the different SecureData APIs
behind a single generic data protection API:
CONFIDENTIAL 3-68
Developer Templates Integration Guide Version 5.0 Shared Integration Architecture
Package Contents
This package defines the following set of interfaces, abstract classes, and classes in .java
source files of the same name:
• CryptoFactory - This class provides methods for caching and re-using crypto
instances for a given set of format information for each of the Voltage SecureData data
protection APIs.
• SimpleApiTester - This class provides a mechanism for testing the Simple API
outside the context of the abstraction layer.
Package Usage
The protect and access APIs provided by this layer are defined in a general Crypto interface
and implemented in the LocalCrypto and RestCrypto classes (mostly implemented in their
shared abstract superclass BaseCrypto).
In the Hadoop Developer Templates, the MapReduce, Hive, and Sqoop template code calls the
getCrypto method of the class CryptoFactory, which returns either a new or recycled
instance of either the LocalCrypto or RestCrypto class, depending on either the global or
column-specific configuration settings that specify which Voltage SecureData data protection
API to use in each case.
In the NiFi Developer Template, the method that processes the NiFi processor’s input stream
calls the getCrypto method of the class CryptoFactory, which returns either a new or
recycled instance of either the LocalCrypto or RestCrypto class, depending on the API
type configured for that processor.
In the Kafka-Storm Developer Template, the Storm bolt template code calls the getCrypto
method of the class CryptoFactory, which returns either a new or recycled instance of the
LocalCrypto class by default, or RestCrypto class if rest is provided as an optional fourth
command line parameter to the script run-storm-topology.
Using this approach, the calls to the Voltage SecureData APIs are isolated in a specific section
of the code, and not called directly by the Hadoop job, Storm bolt, NiFi processor code, and so
on. In other words, instead of relevant Developer Template Java class calling the Simple API or
the REST API directly, they request a Crypto object from the CryptoFactory class, and then
use the returned Crypto instance to perform the data protection operations, without
knowledge of whether these operations are being performed locally by the Simple API or
remotely by the REST API.
3-69 CONFIDENTIAL
Shared Integration Architecture Developer Templates Integration Guide Version 5.0
The data protection abstraction layer, along with some configuration settings, hides which of
the two available SecureData APIs is actually performing the data protection operations on
behalf of the code that is calling the classes in the crypto package. It serves as a good
example of best practices template code that is ready for use in a production environment after
appropriate testing.
• For most of the Developer Templates, from the XML configuration files vsauth.xml
and vsconfig.xml. If you change how configuration is performed for your production
Developer Templates solution, you will need to make corresponding changes in the
crypto package.
• For the NiFi Developer Template, from the configuration file vsnifi.properties and
from the properties configured for the template’s sample processor. Likewise, if you
change how configuration is performed for your production NiFi workflow, you will need
to make corresponding changes to the crypto package.
This factory/interface approach follows the recommended practice of loose coupling and
programming to an interface, not to an implementation.
// Get Crypto instance for specified API type and format info.
Crypto crypto = CryptoFactory.getCrypto(apiType, formatInfo);
NOTE: Using a data protection abstraction layer is a recommended best practice, but not a
requirement when calling the SecureData APIs. The Developer Templates show an example
of using this approach, but you can integrate the SecureData APIs into your Hadoop jobs
and NiFi workflows in different ways.
CONFIDENTIAL 3-70
Developer Templates Integration Guide Version 5.0 Using Old Versions of Other Voltage SecureData Software
By default, the MapReduce Developer Template attempts to protect the data in the name
column in the sample data file plaintext.csv using the Simple API. The first 20 rows of this
sample data file use characters outside the ASCII range (such as the accent characters used in
many European languages) in the name column, such as are highlighted in the following names:
• Fabien Baillairgé
• Adélaïde Clérisseau
• Jean-Noël Emmanuelli
To protect more than the ASCII-range characters in these names, the XML configuration file
vsconfig.xml defines a cryptId named extended that uses the FPE2 format
AlphaExtendedTest1. In order to protect this field as expected, you must be using version
5.0 or later of the Simple API (because version 5.0 of the Simple API was the first version to
include support for FPE2 formats). If you are using an older version of the Simple API, there are
a few alternatives to make the MapReduce Developer Template run successfully:
• In the XML configuration file vsconfig.xml, you could change the cryptId extended
to use the REST API:
In order for this solution to work, you must be using version 6.0 or later of the Voltage
SecureData Server (because version 6.0 of the Voltage SecureData Server was the first
version to include REST API support for FPE2 formats).
• In the XML configuration file vsconfig.xml, change name field (index 1) to use a
cryptId that does not specify a FPE2 format. For example, use the cryptId named alpha
instead:
Note that the extended characters (outside the ASCII range), such as those highlighted
above, will not be protected and will pass through to the ciphertext unchanged.
• In the XML configuration file vsconfig.xml, comment out the field specification for
the name field, leaving it unprotected:
3-71 CONFIDENTIAL
Using Old Versions of Other Voltage SecureData Software Developer Templates Integration Guide Version 5.0
In contrast to the associated Voltage SecureData software version assumed by the MapReduce
Developer Template, the Hive Developer Template, as shipped, does not assume that the
Simple API and the REST API support FPE2 extended characters. In the scripts
run-hive-join-query.hql and run-impala-join-query.sql, the following lines,
respectively, are commented out:
-- accessdata(s.name, 'extended') AS name_decrypted,
To access the name field using the cryptId extended, remove the comment designators (--)
from the beginning of these lines (and to avoid retrieving the ciphertext version as well, remove
the s.name, from the line above).
If you are going to use your own Voltage SecureData Server of version 6.0 or later, you will need
to define an appropriate format named AlphaExtendedTest1, as shown above, or an
equivalent FPE format that includes the required extended characters in its protection alphabet
with a corresponding change to the format.name field(s) above. For information about how to
define this format, see "AlphaExtendedTest1" (page A-3).
For more information about the extended character support provided by the REST API and the
Simple API, see the Voltage SecureData REST API Developer Guide and the Voltage
SecureData Simple API Developer Guide (version 5.0 or later), respectively.
CONFIDENTIAL 3-72
Developer Templates Integration Guide Version 5.0 Shared Sample Data for the Hadoop Developer Templates
The Hadoop Developer Templates share two sample data files, located in the sampledata
subdirectory. The sample data files are CSV files that contain randomly-generated data that
has been tested with the software. The scripts that implement the sample jobs use this sample
data to demonstrate that data can be protected and accessed in your environment. The sample
data files are described in the following sub-sections.
plaintext.csv
This sample data file includes 10,000 rows of plaintext data consisting of the following
columns:
5 - Zip code (mix of 5-digit numbers and 9-digit numbers with a delimiter)
8 - Date of birth (mix of digits with delimiters separating the year/month/day, in the
pattern YYYY-MM-DD)
For the MapReduce Developer Template, the configuration file vsconfig.xml includes
settings needed to protect and access the data in columns 1, 7, 8, 9, and 10 using the Simple
API for all cryptographic operations except for the credit card numbers in column 9, which are
processed by the REST API using an SST format.
For the Sqoop Developer Template, the configuration file vsconfig.xml includes settings
needed to protect and access the data in columns 7, 8, 9, and 10 using the Simple API for all
cryptographic operations except for the credit card numbers in column 9, which are processed
by the REST API using an SST format.
The data that is specified for protect and access operations by these settings in the
configuration file vsconfig.xml includes:
3-73 CONFIDENTIAL
Shared Sample Data for the Hadoop Developer Templates Developer Templates Integration Guide Version 5.0
NOTE: The names in column 1 of the first 20 rows in the sample data file
plaintext.csv include characters outside the normal ASCII range.
• The email addresses in column 7 are protected with a format named Alphanumeric.
This is a Variable-Length String (VLS) format that uses FPE to encrypt PII.
• The date of birth values in column 8 are protected using a format named DATE-ISO-
8601. This is a date format that uses FPE to protect PII.
• The credit card numbers in column 9 are protected using a format named cc-sst-6-4.
This is a credit card format that uses Secure Stateless Tokenization™ (SST) protection
to tokenize Payment Card Industry (PCI) data. In this format, the first six digits and the
last four digits remain in the clear, and the middle digits are tokenized.
• The US Social Security numbers in column 10 are protected using a format named SSN.
This is a US Social Security Number format that uses FPE to protect PII.
For the Sqoop Developer Template, the configuration file vsconfig.xml includes settings
needed to protect and access the data in columns 7, 8, 9, and 10 using the Simple API for all
cryptographic operations except for the credit card numbers in column 9, which are processed
by the REST API using an SST format.
creditscore.csv
This file includes 10,000 rows of plaintext data consisting of the following two columns:
• US Social Security number (with values identical to those in column 10 of the file
plaintext.csv)
CONFIDENTIAL 3-74
Developer Templates Integration Guide Version 5.0 Common Procedures for the Hadoop Developer Templates
This section provides instructions for performing procedures that are common to all of the
Hadoop Developer Templates, including common procedures for setting up HDFS as expected
by the templates and common procedures for working with Kerberos authentication.
If this directory exists, the command prompt returns a message showing the items found, if any.
If this directory does not exist, a message indicates that there is no such file or directory. In this
case you must create the directory and set the owner using commands similar to the following:
sudo -u hdfs hdfs dfs -mkdir /user/<user>
sudo -u hdfs hdfs dfs -chown <user>:<user> /user/<user>
If you see an error that the hdfs command is not found, you can add the hdfs script
location into the PATH variable, using a command similar to the following:
export PATH=$PATH:/opt/mapr/hadoop/hadoop-<version>/bin
Note that if this user directory does not exist, the sample scripts fail with the following Hadoop
security exception:
org.apache.hadoop.security.AccessControlException:
Permission denied: user=<user>, access=WRITE, inode="/user"...
3-75 CONFIDENTIAL
Common Procedures for the Hadoop Developer Templates Developer Templates Integration Guide Version 5.0
Subsequent commands in this chapter refer to relative paths for configuration and input and
output files in HDFS, and must be run as the user account for the home directory specified in
this section.
For example, the configuration files are specified as relative paths in HDFS:
voltage/config/vsauth.xml
voltage/config/vsconfig.xml
In Hadoop, the full absolute paths to these files are resolved relative to your home directory in
HDFS, as follows:
/user/<user>/voltage/config/vsauth.xml
/user/<user>/voltage/config/vsconfig.xml
• Copies the updated configuration files vsauth.xml and vsconfig.xml to the HDFS
config directory created above.
NOTE: This script will copy these two default configuration files to their default
location in HDFS regardless of whether you are using either of the two alternative
methods for specifying configuration files for the Hadoop Developer Templates. For
more information about these alternative methods, see "Specifying the Location of
the XML Configuration Files" (page 3-47).
• Copies the sample data files plaintext.csv and creditscore.csv to HDFS. The
former sample data file is copied to the directory expected by the MapReduce protect
job (the HDFS mr-sample-data directory created above). The latter sample data file is
copied to the directory expected by the Hive table creation script (the HDFS
hive-sample-data directory created above).
CONFIDENTIAL 3-76
Developer Templates Integration Guide Version 5.0 Common Procedures for the Hadoop Developer Templates
• To support the creation of permanent Hive UDFs, copies the required JAR files
(voltage-hadoop.jar and vibesimplejava.jar) to the HDFS hiveudf directory
created above.
NOTE: If you are using the JAR-based alternative configuration file location
approach, as described in "Config-Locator Properties File Packaged as a JAR File"
(page 3-48), you can uncomment a line in this script to also copy the JAR file
vsconfig.jar to the HDFS hiveudf directory created above.
Because this script does nothing other than copy two default configuration files to their default
location in HDFS, it is not useful (as is) if you are using either of the two alternative methods for
specifying different locations for the Hadoop Developer Template configuration files. For more
information about these alternative methods, see "Specifying the Location of the XML
Configuration Files" (page 3-47).
Depending on your scenario, the following two modifications to this script may be useful:
• If you changed the MapReduce, Hive, and/or Sqoop Developer Templates such that you
are using one or more component-specific configuration files, you can modify this script
to also copy the relevant component-specific configuration files to the appropriate
directory in HDFS and set their file access attributes.
NOTE: In this regard, you could also follow the model used by the Spark Developer
Template, which provides the auxiliary script update-spark-config-files-in-
hdfs for updating its component-specific configuration file (vsspark-rdd.xml).
3-77 CONFIDENTIAL
Common Procedures for the Hadoop Developer Templates Developer Templates Integration Guide Version 5.0
When you opt to do so, there are some extra steps that you must take to use this form of
authentication, including the use of several Kerberos-specific scripts provided with the Hadoop
Developer Templates. This section provides instructions for these additional steps. Also, it
begins with a short section that describes the prerequisites for using Kerberos authentication
with the Hadoop Developer Templates and a detailed description of the Kerberos-specific
scripts, their parameters, and other operational details.
NOTE: Kerberos authentication is not supported for the NiFi Developer Template and the
Kafka-Storm Developer Template.
• A functional Kerberos Key Distribution Center (KDC), which the Hadoop cluster is
configured to use.
• Version 6.5 or higher of the Voltage SecureData Server, required for Kerberos server-
side authentication configuration and REST API support.
• Version 5.20 or higher of the Simple API, required for Kerberos authentication support
when requesting cryptographic keys for local protect and access operations.
CONFIDENTIAL 3-78
Developer Templates Integration Guide Version 5.0 Common Procedures for the Hadoop Developer Templates
IMPORTANT: In order to build the Kerberos service ticket on the Hadoop client, the nodes in
the Hadoop cluster must be running Java 8 (Update 151 or higher). Earlier versions of Java
are not be able to build the required service tickets.
• vskinit - This script parallels the Kerberos kinit command and can be run as follows
from the directory <install_dir>/bin:
> ./vskinit <optional parameters>
Use the vskinit command to initialize and store a Kerberos delegation token for the
current user that can be used for authentication of subsequent interactions with the
Voltage SecureData Server. This delegation token is short-lived, and automatically
expires after 24 hours.
• vsklist - This script parallels the Kerberos klist command and can be run as follows
from the directory <install_dir>/bin:
> ./vsklist <optional parameters>
Use the vsklist command to list information about the current user's delegation
token, if they have one. This command will also warn you if the delegation token has
possibly expired, based on the timestamp of the token file.
• vskdestroy - This script parallels the Kerberos kdestroy command and can be run
as follows from the directory <install_dir>/bin:
> ./vskdestroy <optional parameters>
Use the vskdestroy command to destroy the current user's delegation token (delete
the delegation token file), if they have one. This explicit step is useful for immediately
preventing subsequent jobs from using the token to authenticate with the Voltage
SecureData Server before it expires on its own.
For more information about this optional step, see "Optional Destruction of the
Delegation Token" (page 3-84).
NOTE: Make sure that the users running the Hadoop jobs and these vsk* scripts have
sufficient permission to read and write files in the HDFS directory specified by the
Delegation Token HDFS Path configuration setting (page 3-26), including permission to
create the directory if it does not already exist. Otherwise, Hadoop will throw a “Permission
denied” exception when it attempts to write the user's delegation token file when running
the vskinit script.
3-79 CONFIDENTIAL
Common Procedures for the Hadoop Developer Templates Developer Templates Integration Guide Version 5.0
All of the Kerberos-specific scripts provided with the Hadoop Developer Templates share
the same set of optional parameters:
• --config <HDFS-path> Specify a custom location (the path and filename) for
the Hadoop Developer Template configuration file in
HDFS (normally vsconfig.xml). This can be an
absolute or relative (to the user's home directory)
HDFS path. You can use -c instead of --config.
• --auth <HDFS-path> Specify a custom location (the path and filename) for
the Hadoop Developer Template authentication file in
HDFS (normally vsauth.xml). This can be an
absolute or relative (to the user's home directory)
HDFS path. You can use -a instead of --auth.
NOTE: There are comments at the top of each of these Kerberos-specific scripts with
more information about these optional command-line arguments. You can also use the
--help argument interactively to remind yourself about the available optional
parameters.
By default, all of the Kerberos-specific scripts will look for the configuration files
vsconfig.xml and vsauth.xml in the following HDFS directory:
/user/<user>/voltage/config
The vskinit script requests a delegation token from the Voltage SecureData Server. The
hostname used to connect to the Voltage SecureData Server is read from the XML
configuration file vsconfig.xml from either the optional webService element or the
required secureDataServer element. Make sure this setting is specified correctly in
(whatever) vsconfig.xml file you use with the vskinit command (in the default
location or, as described below, in a custom location), even if you are not using the REST
API for cryptographic operations.
The only information that the Kerberos-specific scripts use from the XML configuration file
vsauth.xml is the value of the delegationTokenHdfsPath attribute of the kerberos
element, which determines the HDFS location in which to store the delegation token
downloaded from the Voltage SecureData Server. No other settings from this configuration
file are relevant in the context of running the Kerberos-specific scripts. In particular, note
that all authMethod attribute settings are ignored, so it does not have to be explicitly set to
CONFIDENTIAL 3-80
Developer Templates Integration Guide Version 5.0 Common Procedures for the Hadoop Developer Templates
Kerberos to run these scripts. The scripts will initialize, list, or destroy the user's delegation
token using Kerberos authentication regardless of the authMethod attribute settings
configured in the instance of this configuration file used by each Kerberos-specific script. In
other words, you could run the vskinit script and generate a perfectly good delegation
token that never gets used by a Hadoop job using the same vsauth.xml file because that
file does not specify Kerberos authentication for (any of) its authMethod attribute
setting(s).
As mentioned above, these scripts provide optional parameters for specifying custom
locations for these configuration files (--config and --auth). Because different Hadoop
jobs may use different instances of these configuration files, these parameters are
particularly useful in that scenario, when you are calling one or more of these Kerberos-
specific scripts from within another script that runs a specific Hadoop job. For example,
within a script that uses a Hive Developer Template UDF (that uses Kerberos
authentication and specifies a custom location for both configuration files) to perform a
query, it could be useful to include the following invocation of the vskinit script in that
same script to specify the same custom locations (using absolute HDFS paths) for both the
general and authentication XML configuration files:
./vskinit \
--config /apps/hive/voltage/config/vsconfig.xml \
--auth /apps/hive/voltage/config/vsauth.xml
NOTE: Sharing configuration files from a common location, as shown above, for multiple
users can be useful when their queries are all being run as the system user hive.
Likewise, you may also want to specify a shared location for the delegation token files
stored in HDFS, as described in the previous section. To do so, in the shared version of
the XML configuration file vsauth.xml, set the delegation token storage path to an
appropriate absolute path. For example:
delegation.token.hdfs.path = /apps/hive/voltage/config
If you are running these Kerberos-specific scripts interactively and you want to set a
different default location for the configuration files, you can do so by using the -D generic
option for specifying a property value on the yarn command line within these scripts. For
example, you could edit the script vskinit and add the following (highlighted) parameters
to the yarn command line:
yarn jar "$script_dir"/voltage-hadoop.jar \
com.voltage.securedata.hadoop.auth.VSKInit \
-DVOLTAGE_CONFIG_FILE=<custom_vsconfig_location> \
-DVOLTAGE_AUTH_FILE=<custom_vsauth_location> \
"$@"
For more information about this method of specifying custom locations for Hadoop
Developer Template configuration files, see "-D Generic Option to Specify a Property Value"
(page 3-47).
3-81 CONFIDENTIAL
Common Procedures for the Hadoop Developer Templates Developer Templates Integration Guide Version 5.0
CAUTION: The other method for specifying an alternate location for the configuration
files, which involves packaging the alternate location of the configuration files within a
properties file within a JAR file, is not supported when using Kerberos authentication.
This is because the -libjars option used for this approach does not work with yarn
commands that do not launch MapReduce jobs, such as the Hadoop Developer
Templates Kerberos-specific scripts.
All of the Kerberos-specific scripts provided with the Hadoop Developer Templates
(vskinit, vsklist, and vskdestroy) require the current user to have a valid (not
expired) Kerberos ticket granting ticket (TGT) to succeed. You cannot initialize, list, or
destroy a delegation token without the TGT on which it is based. If you do not have the
required TGT (or it has expired), you will get the standard Kerberos security exception, and
the script will fail. For example:
java.io.IOException: Failed on local exception: java.io.
IOException: javax.security.sasl.SaslException: GSS initiate
failed [Caused by GSSException: No valid credentials provided
(Mechanism level: Failed to find any Kerberos tgt)]
If this occurs, check your Kerberos TGT, and when necessary, renew it by running the
Kerberos kinit command.
Sometimes your Kerberos TGT may be old, but not yet expired. In this case, Hadoop may
attempt to renew it when you run one of the Kerberos-specific scripts. If the Kerberos TGT
renewal fails for some reason, Hadoop will log a WARN message such as the following:
WARN security.UserGroupInformation: Exception encountered while
running the renewal command for <user>@<realm>. (TGT end time:
<timestamp>, renewalFailures: org.apache.hadoop.metrics2.lib.
MutableGaugeInt@<memory_address>,renewalFailuresTotal: org.
apache.hadoop.metrics2.lib.MutableGaugeLong@<memory_address>)
Because the Kerberos TGT is still usable (not actually expired), the relevant delegation
token will still be initialized, listed, or destroyed, as requested. In other words, keep in mind
that the message shown above is just a warning, and does not necessarily indicate a failure
with respect to the delegation token processing. If the script ends with a successful INFO
message, then it succeeded.
NOTE: You may see this same Kerberos TGT renewal WARN message from Hadoop in
situations that have nothing to do with Hadoop Developer Template delegation token
processing. For example, you can see this same WARN message when you list files in
HDFS using the hdfs dfs command. You can avoid seeing this warning repeatedly by
running the Kerberos kinit command to get a fresh Kerberos TGT from your KDC.
CONFIDENTIAL 3-82
Developer Templates Integration Guide Version 5.0 Common Procedures for the Hadoop Developer Templates
NOTE: This required step (whether performed as part of the login procedure or done
explicitly on the command line) is always necessary when using Kerberos and is not specific
to the Hadoop Developer Templates integration.
> ./vskinit
This script, which parallels the Kerberos kinit command, is used to initialize and store a
Kerberos delegation token for the current user that can be used for authentication of
subsequent interactions with the Voltage SecureData Server. This delegation token is short-
lived, and automatically expires after 24 hours.
NOTE: The user’s previously acquired Kerberos ticket granting ticket is used to construct a
Kerberos service ticket for the Voltage SecureData Server hostname voltage-pp-
0000.<district-domain>, with the service principal name HTTP/voltage-pp-
0000.<district-domain>@<Kerberos-realm>. The service ticket is sent to the
Voltage SecureData Key Server in an HTTP request header and authenticated using its
configured keytab file. The returned delegation token is stored in an HDFS file with a name
of the following form:
<delegation.token.hdfs.path_config_value>/<hashed-and-encoded-username>.token
The filename (other than the .token extension) is constructed by hashing the username
using SHA-256 and then Base64-encoding it using a standard variant of that encoding
called “modified Base64 for filename” in which slashes are replaced with dashes.
The permissions on the delegation token file are set to -rw-r----- so that it is limited to
read/write by the user and read by the file’s group. This setting limits access to the token,
while still supporting the case when HiveServer2 doAs impersonation is turned off. For more
information about the Beeline/HiveServer2 scenario in which impersonation (doAs) is
disabled, see "Kerberos Authentication When Beeline/HiveServer2 Impersonation is
Disabled" (page 3-84).
3-83 CONFIDENTIAL
Common Procedures for the Hadoop Developer Templates Developer Templates Integration Guide Version 5.0
These jobs and queries will either run directly as the current user launching the job, or as a
system user (such as hive), which has been added to the delegation token file’s assigned
group. For more information about running jobs and queries as a different user in the
delegation token file’s assigned group, see "Kerberos Authentication When Beeline/
HiveServer2 Impersonation is Disabled" (page 3-84).
On the Voltage SecureData Server side, the provided delegation token is used to perform
authentication, after which authorization is performed using the specified identity with the
configured Key Server Authentication Methods and Web Service Identity Authorization Rules.
The authorization step typically uses LDAP group membership.
> ./vskdestroy
In the context of Kerberos authentication, this also involves some additional configuration
steps, to allow the system user hive to load the configuration files and to locate and read the
delegation token file for the end-user running the query. When Hive UDFs are going to use
Kerberos authentication for multiple users, you would typically use a single set of configuration
files for all users. This is because there is no need for user-specific XML configuration files due
to the fact that, in the XML configuration file vsauth.xml, all authMethod attribute values are
CONFIDENTIAL 3-84
Developer Templates Integration Guide Version 5.0 Common Procedures for the Hadoop Developer Templates
set to Kerberos and no end-user credentials are specified. Similarly, the delegation token file
may be stored in a common location, specified as an absolute path instead of a path that is
relative to the current user.
1. Optionally customize the HDFS directory where the delegation token files created for
the users running the UDFs are stored. Do so by updating the value of the
delegationTokenHdfsPath attribute of the kerberos element in the XML
configuration file vsauth.xml. For more information, see "Kerberos Delegation Token
HDFS Location" (page 3-3).
2. Optionally customize the location of the XML configuration files (vsconfig.xml and
vsauth.xml) to use a common path instead of a user-specific one. For more
information, see "Specifying Custom Configuration File Locations for the Kerberos-
Specific Scripts" (page 3-81).
3. Add the system user hive to the delegation token file’s assigned group. Because
HiveServer2 with impersonation (doAs) disabled runs the UDF as the system user
hive, that system user needs to be able to read the delegation token file created for the
user running the query. The Kerberos-specific script vskinit ends by locking down the
permissions on the delegation token file such that read/write access is allowed by the
current user, and read access by any user in the file’s group (-rw-r-----). Therefore,
in order for the system user hive to be able to read the delegation token file, that
system user must be added to the file’s group, as follows:
> usermod -aG <group-name> hive
IMPORTANT: Because the delegation token file permissions include read access by
the file's group (to support the case when HiveServer2 doAs impersonation is turned
off), it is very important to limit this file group’s membership to privileged users only.
If the group that has been assigned to the token file includes non-privileged
members, then those members will be able to read the user's sensitive delegation
token from the file, representing a significant security issue. It is therefore very
important that you understand and control the token file's group assignment
carefully, and limit access as appropriate.
After that is done, you can verify that the system user hive belongs to the file’s group
(in HDFS), by running the following HDFS command:
> hdfs groups hive
This command lists all the groups to which the system user hive belongs so that you
can verify that the groups assigned to the token files for the end-users who will run the
UDFs are included in this list of groups.
3-85 CONFIDENTIAL
Common Procedures for the Hadoop Developer Templates Developer Templates Integration Guide Version 5.0
NOTE: In some environments, the end-user may belong to a default group which has
the same name as the user, with the delegation token file assigned to that default
user group. In this case, you can add the system user hive to that username directly,
as follows:
To verify that the system user hive now belongs to that username group in HDFS,
run the same hdfs groups hive HDFS command, as described above.
After you perform the optional (1 and 2) and required (3) steps above, you can run the Generic
UDFs through Beeline/HiveServer2, even with impersonation (doAs) disabled. The system user
hive will detect the user running the query and read his/her delegation token file in order to
perform Kerberos-based authentication with the Voltage SecureData Server.
CONFIDENTIAL 3-86
Developer Templates Integration Guide Version 5.0 Logging and Error Handling in the Hadoop Developer Templates
The Hadoop Developer Templates log informational and error messages using the Apache
Commons Logging library. These log messages are written to the Hadoop job log files, which
you can view by using the Hadoop job history web UI or by using the hadoop job -history
command line.
NOTE: In general, the Hadoop Developer Templates are generous with respect to logging,
logging successful operations as well as failures. For performance reasons, the former may
not be appropriate in benchmarking and production environments. Change the logging
behavior as required by modifying the source code and rebuilding.
One change of note from releases prior to 3.0 is in the Hive UDF, in which logging of
successful operations is now commented out.
The logs contain general informational and debugging messages which can be helpful when
troubleshooting issues. Make sure to examine these logs if you encounter any errors when
executing the Hadoop Developer Templates.
When you adapt the Hadoop Developer Templates code to your own purposes, you can use
this logging facility for troubleshooting and debugging.
Other than this logging behavior, the Hadoop Developer Templates clients do not perform
error handling, other than by throwing the exceptions caught from the APIs. Proper handling of
returned errors is better left to developers creating production-level solutions based on the
Developer Templates according to the particular needs of their Hadoop jobs.
If any errors are returned by the REST API when it is called from the templates, they are logged
in the Hadoop job logs, just as in the case of the Simple API. The Developer Templates do not
attempt to handle the REST API errors in any way, other than just logging them. See the
Voltage SecureData REST API Developer Guide for this API’s full list of error codes and
associated messages.
The only exception to this rule is for the special case when the REST API call fails because the
version of the Voltage SecureData Server in use does not support the REST API. This would
happen if you attempt to use the REST API with a version of the Voltage SecureData Server
older than 6.0. This specific error condition is detected, an error is thrown causing the job to fail,
and the following message is written in the Hadoop job logs:
Unable to connect to REST service on specified SecureData server.
Check that SecureData server is at version 6.0 or higher.
3-87 CONFIDENTIAL
Handling Empty and Net-Empty Values Developer Templates Integration Guide Version 5.0
All of the Hadoop Developer Templates sometimes return empty and net-empty input data, as
is, unchanged:
Input data is empty when the input string upon which the cryptographic operation is being
performed contains zero (0) characters.
Input data is net-empty when, due to the nature of the relevant alphabet(s), no characters in
the input are subject to protection or access. The ways in which net-empty input data is
handled depends on the chosen FPE format, as explained below.
NOTE: For VLS formats, the net-empty concept is not relevant unless the Ignore
Characters not in Alphabet option is enabled for the format in question.
When this option is not enabled, valid input can be empty or it can be non-empty, containing
only valid characters in the specified relevant alphabet. If the input contains any characters
not in the specified relevant alphabet, an error is generated.
First, the general concept of net-empty only applies to some types of formats:
Net-Empty Relevant Format
Types Net-Empty Irrelevant Format Types
Credit Card (CC) Specified-Format String (SFS)
US Social Security Number (SSN) Date
Variable-Length String (VLS) Number
NOTE: The concept of net-empty also applies to the built-in format AUTO.
• eFPE: 0-9 for plaintext, and 0-9 and A-Z (uppercase) for ciphertext
VLS formats always have explicitly defined alphabets: a single alphabet for both plaintext and
ciphertext for FPE formats and separate plaintext and ciphertext alphabets for eFPE formats.
Also, for CC and SSN formats for which some leading and/or trailing characters are preserved,
net-empty determination becomes more complicated. A plaintext CC or SSN string is
considered net-empty if:
• The plaintext string meets all other requirements of the specified format.
CONFIDENTIAL 3-88
Developer Templates Integration Guide Version 5.0 Handling Empty and Net-Empty Values
With respect to meeting the other requirements of the specified format, it is valid, for example,
for a CC string to have no digits at all, but having fewer than 12 digits is not considered a valid
CC. Likewise, SSN strings can also have no digits at all, but if they have any digits, they must
have exactly nine of them. And even when a valid number of digits are present, if the format
specifies the preservation of some number of leading and/or trailing digits, there may not be
any digits remaining to be protected, resulting in a positive net-empty determination.
For example, consider a CC format with preserve leading six, preserve trailing six, and Luhn
ignore. If a 12 digit plaintext is provided for protection, it will be considered net-empty and
returned as the ciphertext, as is, because there are six leading digits to preserve and six trailing
digits to preserve.
Second, for the relevant formats, the rules for determining whether an input data string is net-
empty depend on whether the format is an FPE format or an eFPE format, the latter being more
restrictive:
• FPE Formats: Regular FPE formats specify a single alphabet for both plaintext and
ciphertext. Given this, an input data string is considered to be net-empty if it contains
only characters not in the format alphabet. Consider the following examples:
• For a credit card format, with an implicit input alphabet of digits only, the input
data string “----” is net-empty.
• For a Variable-Length String format with the explicit alphabet “A-Za-z”, the
input data string “01234” is net-empty, assuming that the “Ignore Characters not
in Alphabet” option, the default, was chosen when the format was created.
• eFPE Formats: Embedded FPE (eFPE) formats specify one alphabet for plaintext and a
different (and necessarily larger) alphabet for ciphertext. Given this, an input data string,
whether plaintext to a protection operation or ciphertext to an access operation, is
considered to be net-empty if it contains only characters that are in neither the plaintext
alphabet nor the ciphertext alphabet. Consider the following examples:
• For an eFPE US social security number format, with an implicit plaintext alphabet
of digits only, and an implicit ciphertext alphabet of digits and capital letters, the
input data string “--” is net-empty.
• For an eFPE credit card number format, the input data string “-ABCD--” is not
net-empty because the letters A, B, C, and D, are in the implicit ciphertext
alphabet for eFPE credit card formats.
3-89 CONFIDENTIAL
Handling Empty and Net-Empty Values Developer Templates Integration Guide Version 5.0
When you are using the Hadoop Developer Templates, you must consider that any empty or
net-empty input data that you provide will be returned to you unchanged, but without any
other notification that this type of data was processed.
CONFIDENTIAL 3-90
Developer Templates Integration Guide Version 5.0 Known Limitations of the Developer Templates
This section reviews aspects of the Developer Template code that were intentionally kept
simple for the sake of easier comprehension and not overshadowing more central aspects of
the code. As you develop your production-quality solution using one or more of the SecureData
APIs, keep in mind that these areas require additional improvement.
3-91 CONFIDENTIAL
Known Limitations of the Developer Templates Developer Templates Integration Guide Version 5.0
CONFIDENTIAL 3-92
4 MapReduce Integration
The Hadoop Developer Templates demonstrate how to integrate Voltage SecureData data
protection technology in the context of MapReduce. This demonstration includes the use of the
Simple API (version 4.0 and greater) and the REST API.
This integration relies on a set of Java classes for MapReduce, as well as the classes in the
Hadoop Developer Templates common infrastructure. This chapter provides a description of
the former as well as instructions on how to run the MapReduce template using the provided
sample data. For more information about the common infrastructure used by the Hadoop
Developer Templates, see Chapter 3, “Common Infrastructure”.
• Integration Architecture of the MapReduce Template (page 4-1) - This section explains
the Java classes that are specific to the MapReduce template as well as information
about how batch processing is achieved for the different SecureData APIs that this
template can use.
• Configuration Settings for the MapReduce Template (page 4-4) - This section reviews
the configuration settings that are relevant to the MapReduce template and provides an
example of how and why you would want to change those settings as you adapt the
template to your own use.
• Running the MapReduce Template (page 4-6) - This section provides instructions for
running the MapReduce template using the provided sample data.
The following Java package and its associated Java source code provide classes that
implement the MapReduce integration:
4-1 CONFIDENTIAL
Integration Architecture of the MapReduce Template Developer Templates Integration Guide Version 5.0
The MapReduce integration performs data protection processing on an input CSV file and
produces an output CSV file, both within HDFS. Specific columns in the input file are either
protected or accessed, depending on which of the following two classes you specify on the
yarn command line:
com.voltage.securedata.hadoop.mapreduce.Protector
com.voltage.securedata.hadoop.mapreduce.Accessor
The integration includes an abstract base class called BaseMapReduce, which implements the
core functionality for this integration. An important aspect of this integration is the batching of
input plaintext or ciphertext for data protection processing, which is especially important to
minimize network overhead when using the REST API. For more information about batching in
the MapReduce template, see "Batch Processing for MapReduce" (page 4-3).
NOTE: The MapReduce template code does not perform any additional reducer processing.
The data protection processing is isolated in the crypto package abstraction layer. For more
information about this common infrastructure package, see "Cryptographic Abstraction" (page
3-68).
In the MapReduce template, the configuration settings are loaded using the Hadoop
configuration class HDFSConfigLoader and the class CryptoFactory is initialized in the
overridden method setup of the MapReduce Mapper class. For more information, see
"Hadoop Configuration" (page 3-60) and "Cryptographic Abstraction" (page 3-68),
respectively.
At runtime, the overall process, as implemented in the indicated shell scripts, involves the
following general steps:
1. Copy the sample data and configuration settings into HDFS, providing data upon which
the MapReduce template code can perform data protection processing and defining
which columns to protect or access and how to protect or access them.
2. Run the yarn command to launch the MapReduce job for the specific class
(Protector or Accessor), providing the Developer Template libraries (JAR files) and
the paths to the job input and output locations in HDFS.
• initConfiguration(Configuration config)
CONFIDENTIAL 4-2
Developer Templates Integration Guide Version 5.0 Integration Architecture of the MapReduce Template
This method initializes the configuration by loading the settings from the configuration
files vsauth.xml and vsconfig.xml in HDFS using the class HDFSConfigLoader to
initialize a static instance of the class HadoopConfigSettings and then using that
instance to initialize the class CryptoFactory.
• initCryptoList()
This method initializes the list of Crypto instances to use to perform the protect/access
operations, based on the settings read from the configuration files vsauth.xml and
vsconfig.xml.
This method is the overridden method map of the MapReduce Mapper class, which the
yarn command expects to find in the class specified on its command line. This is also
where the batching logic can be found.
The batching of plaintext and ciphertext for efficient processing by the REST API is
accomplished within the MapReduce template code in the overridden method map of the
abstract static class BaseMapper. The BaseMapper class is nested within the abstract class
BaseMapReduce, which extends the Hadoop MapReduce class Mapper. This logic performs
the following steps:
1. Reads a set of lines from a CSV file in HDFS. The NLinesInputFormat and
NLinesRecordReader classes in the util package are used to retrieve multiple lines
at a time rather than the default of one line at a time.
4. Invokes the Simple API or the REST API, as specified, to protect or access each batch of
column values. When using the Simple API, the looping over the batch of plaintext or
ciphertext values is done in the methods protectFormattedDataList or
accessFormattedDataList of the template class LocalCrypto. When using the
REST API, the list processing is performed remotely as part of a single REST call.
5. Loops through the input lines again, and for the relevant columns, replaces plaintext
with ciphertext (protection operations) or replaces ciphertext with plaintext (access
operations), and writes the output line, one at a time, as output of the map method.
4-3 CONFIDENTIAL
Configuration Settings for the MapReduce Template Developer Templates Integration Guide Version 5.0
There are three classes of configuration settings used by the MapReduce Developer Template:
<fieldMappings>
<fields component="mr">
<field index = "1" cryptId = "extended"/>
<field index = "7" cryptId = "alpha"/>
<field index = "8" cryptId = "date"/>
<field index = "9" cryptId = "cc"/>
<field index = "10" cryptId = "ssn"/>
</fields>
.
.
.
</fieldMappings>
For more information about these settings, see "Common Configuration" (page 3-57).
Before you begin to modify the MapReduce Developer Template XML configuration files for
your own purposes, such as using your own Voltage SecureData Server or different data
formats, Micro Focus Data Security recommends that you first run the MapReduce Developer
Template samples as provided, giving you assurance that your Hadoop cluster is configured
correctly and functioning as expected.
You will need to update many of the values in the XML configuration files vsauth.xml and
vsconfig.xml in order to protect your own data using your own Voltage SecureData Server.
CONFIDENTIAL 4-4
Developer Templates Integration Guide Version 5.0 Configuration Settings for the MapReduce Template
In order to add protection and access of the phone number field when running the MapReduce
Developer Template, you would need to edit the XML configuration file vsconfig.xml to add
new configuration settings to the MapReduce fields element, as follows (addition
highlighted):
<fieldMappings>
<fields component="mr">
<field index = "1" cryptId = "extended"/>
<field index = "6" cryptId = "alpha"/>
<field index = "7" cryptId = "alpha"/>
<field index = "8" cryptId = "date"/>
<field index = "9" cryptId = "cc"/>
<field index = "10" cryptId = "ssn"/>
</fields>
.
.
.
</fieldMappings>
• The index attribute of the new field element for the phone number is set to 6
because the phone numbers appear as the seventh CSV column (remember, zero-
based) in the sample data file plaintext.csv.
• The cryptId attribute of the new field element for the phone number is set to
alpha, which is the name of a cryptId that uses the built-in format Alphanumeric and
the Simple API (the default API). This format will produce ciphertext phone numbers
with plaintext digits replaced with either digits or letters (upper and lowercase). Non-
alphanumeric characters such as +, (, and ) will be preserved, as is.
NOTE: Remember that if you edit the versions of the configuration files vsauth.xml and/or
vsconfig.xml on the local file system, you must copy the updated versions to HDFS,
which is where the Hadoop Developer Templates will access them when they are run. For
instructions, see "Loading Updated Configuration Files into HDFS" (page 3-77).
4-5 CONFIDENTIAL
Running the MapReduce Template Developer Templates Integration Guide Version 5.0
First, remember that you must perform the common preparatory steps, as described in
"Common Procedures for the Hadoop Developer Templates" on page 3-75.
To protect data in the plaintext.csv file, navigate to the bin directory and run the
following script:
./run-mr-protect-job
This script uses YARN to initiate a MapReduce job that protects the sample data, then writes
the protected output files to the following directory in HDFS (relative to your HDFS home
directory):
voltage/protected-sample-data
NOTE: The output is also copied to your local sampledata directory, for later use when
creating the Hive table.
To decrypt or de-tokenize the protected data that is now located in the directory voltage/
protected-sample-data, navigate to the bin directory and run the following script:
./run-mr-access-job
This invokes a MapReduce job that accesses the protected output from the previous command,
then writes the accessed output files to the following directory in HDFS (relative to your HDFS
home directory):
voltage/accessed-sample-data
NOTE: The output is also copied to your local sampledata directory, where you can use it
to validate that your original data is restored.
CONFIDENTIAL 4-6
5 Hive Integration
The Hadoop Developer Templates demonstrate how to integrate Voltage SecureData data
protection technology in the context of Hive. This demonstration includes the use of the Simple
API (version 4.0 and greater) and the REST API.
To make use of the data stored in Hadoop, users typically run data analytics applications on a
computers outside of the Hadoop cluster. Those applications connect to the Hadoop cluster,
using connections such as Open Database Connectivity (ODBC) or Java Database Connectivity
(JDBC) to execute Hive queries. These queries can work seamlessly with data that has been
encrypted or tokenized use Voltage SecureData APIs. However, if an application needs access
to the plaintext, you can configure the Hadoop cluster to access the protected data. This
eliminates the need to install additional software on the computer running the analytics
application.
This integration relies on a set of Java classes that implement several Hive user-defined
functions (UDFs) for protecting and accessing data using Voltage SecureData APIs, as well as
the Java packages in the common infrastructure. This chapter provides a description of the
former as well as instructions on how to run the Hive template using the provided sample data.
For more information about the common infrastructure used by the Hadoop Developer
Templates, see Chapter 3, “Common Infrastructure”.
• History of Hive Support in the Hive Developer Template on page 5-2 - This section
provides information about how the Hive Developer Template has changed over the
previous several releases. These changes have occurred in response to changes in Hive
itself, as well as with the addition of other Hadoop components designed to improve
upon Hive’s performance, such as Impala and LLAP.
• Different Types of Hive UDFs on page 5-4 - This section explains the Java classes that
are specific to the Hive template as well as information about the batch processing
limitations inherent to the Hive template.
5-1 CONFIDENTIAL
History of Hive Support in the Hive Developer Template Developer Templates Integration Guide Version 5.0
• Integration Architecture of the Hive Template on page 5-12 - This section explains the
Java classes that are specific to the Hive template as well as information about the batch
processing limitations inherent to the Hive template.
• Configuration Settings for the Hive Template on page 5-21 - This section reviews the
configuration settings that are relevant to the Hive template and provides an example
of how and why you would want to change those settings as you adapt the template to
your own use.
• Running the Hive Developer Template on page 5-23 - This section provides
instructions for running the Hive template using the provided sample data.
As the supported Hadoop distributions have issued new releases, the newer versions have
included newer versions of Hive, necessitating changes to the Hive Developer Template as new
versions of the Voltage SecureData for Hadoop Developer Templates have been released.
Further, newer versions of the Hadoop Developer Templates have added support for other
Hadoop components related to Hive, such as Impala and LLAP (Live Long and Process). This
section provides a brief history of those changes and summarizes how the Hive Developer
Template supports these various components.
Using the Hive command line, you could create and use protect and access UDFs that extend
the Hive class UDF as either temporary or permanent UDFs.
Brief instructions related to using the Hive command line remain in this version of the Hive
Developer Template documentation:
• Creating and Calling Hive UDFs from the Hive Prompt (page 5-14)
Because jobs executed by HiveServer2 run as the hive system user, its DoAs impersonation
property became very important, necessitating the introduction of a more complex, but
sophisticated version of the protect and access UDFs that extend a different Hive class:
GenericUDF. When the DoAs property is set to false, this new type of UDF is required in
order for Voltage SecureData authentication and authorization to work properly.
CONFIDENTIAL 5-2
Developer Templates Integration Guide Version 5.0 History of Hive Support in the Hive Developer Template
• Using the Generic Hive UDFs When Impersonation is Disabled (page 5-40)
Versions 4.0 of the Hive Developer Template also introduced support for Apache Impala as a
way to create and run high performance UDFs to query the Hive metastore database. This
scenario has running-as-user considerations similar to HiveServer2, but with the impala user
instead of the hive system user and the added complication that you cannot use generic UDFs
with Impala due to non-compatible method signatures.
For more information about these components and the related considerations, see the
following sections in this documentation:
• Major Changes in Hive 3.0 and the Script Changes They Required (page 5-18)
NOTE: Support for running Hive queries in the context of LLAP did not require any code
changes to the relevant Hive Developer Template Java classes (BaseHiveGenericUDF,
ProtectDataGeneric, and AccessDataGeneric). Therefore, although LLAP testing
began with version 4.1, any version of the Hive Developer Templates with these classes,
beginning with version 3.2, should work as is with LLAP.
For more information about executing Hive queries in the context of LLAP, see the following
section in this documentation:
5-3 CONFIDENTIAL
Different Types of Hive UDFs Developer Templates Integration Guide Version 5.0
For more information about using the Hive UDFs for unstructured binary data, see the
following section in this documentation:
As described in “Java Classes for the Hive UDFs” (page 5-13), the Hive Developer Template
provides Java source code that defines eight different classes that you can use to define
permanent and temporary UDFs with the create function and create temporary
function SQL commands, respectively. In the scripts provided with the Hive Developer
Template, such as create-hive-udf.hql and create-hive-perm-udf.hql, these SQL
commands are used to create UDFs with the same names as their corresponding classes, but
using all lowercase letters. For example:
create function
protectdata as 'com.voltage.securedata.hadoop.hive.ProtectData'
using
jar 'hdfs:///user/<username>/voltage/hiveudf/vibesimplejava.jar',
jar 'hdfs:///user/<username>/voltage/hiveudf/voltage-hadoop.jar';
Where <username> is the name of the user who will run the Hive queries.
Continuing with that convention here, assume the creation of all eight possible Hive UDFs,
paired along three axes: A) protect operations versus access operations, B) formatted data
versus binary data, and C) extends the Hive class UDF versus extends the Hive class
GenericUDF:
CONFIDENTIAL 5-4
Developer Templates Integration Guide Version 5.0 Different Types of Hive UDFs
Two of these axes divide the UDFs in ways that are more obvious (protect versus access) or are
explained in full elsewhere (extends the Hive class UDF versus extends the Hive class
GenericUDF; see "Using the Generic Hive UDFs When Impersonation is Disabled" on page 5-
40).
The third axis of differentiation (formatted data versus binary data) is worthy of additional
explanation in the remainder of this section:
NOTE: SST formats are only supported for the REST API.
Using the class-name-in-lowercase naming convention, the Hive Developer Template provides
four UDFs for working with formatted data:
These two pairs of UDFs have slightly different function signatures. They both take the same
two first parameters, which are required:
1. The formatted data to be protected or accessed, often in the form of a database column
name.
2. The name of a cryptId element in the configuration file vsconfig.xml that contains
information governing the protect or access operation to be performed.
CryptIds used for formatted data cryptographic operations must specify a format other
than AES, which is reserved for binary data cryptographic operations; otherwise the
following runtime exception will occur:
5-5 CONFIDENTIAL
Different Types of Hive UDFs Developer Templates Integration Guide Version 5.0
The UDFs in the first pair, protectdata and accessdata, also take an optional third
parameter, the API type, which, if present, will override the API specified for the cryptId (either
globally or with a cryptId-specific setting). Valid choices are: simpleapi and rest.
All four of these UDFs return the resulting ciphertext (protect operations) or plaintext (access
operations) as their return value.
Using the class-name-in-lowercase naming convention, the Hive Developer Template provides
six UDFs for working with binary data:
CONFIDENTIAL 5-6
Developer Templates Integration Guide Version 5.0 Different Types of Hive UDFs
These three pairs of UDFs have slightly different function signatures. They both take the same
two first parameters, which are required:
1. The binary data to be protected or accessed, often in the form of a database column
name.
2. The name of a cryptId element in the configuration file vsconfig.xml that contains
information governing the AES protect or access operation to be performed. IBSE/AES
operations use the special format keyword AES as the value of the format attribute of
all cryptId elements defined for use with the binary Hive UDFs.
CryptIds used for binary data cryptographic operations must specify the format AES;
otherwise the following runtime exception will occur:
AES (binary data) operations cannot be performed for regular
FPE (non-'AES') format: CryptId [<cryptId-details>]
NOTE: The identity attribute of the cryptId element is only relevant for protect
operations, which include the specified identity as part of the full identity packaged
with AES ciphertext to comprise the full IBSE ciphertext payload. This means that the
identity specified in a cryptId used for an access operation is ignored. Instead, the
required identity is retrieved from the IBSE/AES ciphertext itself.
Also note that the following attributes of the cryptId element are not relevant for
“AES” cryptIds and will be ignored if present:
• translatorClass
• translatorInitData
The UDFs in the first pair (protectbinarydata and accessbinarydata) and the third pair
(protectbinarydataimpala and accessbinarydataimpala) also take an optional third
parameter, the API type, which, if present, will override the API specified for the cryptId (either
globally or with a cryptId-specific setting). Valid choices are: simpleapi and rest.
5-7 CONFIDENTIAL
Different Types of Hive UDFs Developer Templates Integration Guide Version 5.0
All six of these UDFs return the resulting ciphertext (protect operations) or plaintext (access
operations) as their return value.
If you have chosen to use the REST API when using the Binary Hive UDFs, be aware that, by
default, the Voltage SecureData Server limits the size of its REST payload to 25 MB, which
includes the JSON syntax required to format the REST request. This is done to prevent
performance degradation.
NOTE: The Simple API, the default API, is strongly recommended over the REST API
when performing binary protect and access operations of large data, such as images and
video. It is not going to be efficient to send this type of plaintext and ciphertext data
across the network for processing.
If you exceed the Voltage SecureData Server’s size limit for Web Service data, you will see
the following generic socket exception on the client and message in the debug.log file for
the Web Service:
CONFIDENTIAL 5-8
Developer Templates Integration Guide Version 5.0 Different Types of Hive UDFs
1. Switch this binary Hive UDF to use the Simple API by setting the api attribute of the
relevant cryptId element to simpleapi (or remove the api attribute from that
cryptId when the default API is set to simpleapi). This is the recommended
solution.
2. Get your Voltage SecureData administrator to increase the Web Service data size
limit for the Voltage SecureData Server.
While some Voltage SecureData IBSE APIs allow a choice of AES modes, the binary Hive
UDFs always use the Cipher Block Chaining (CBC) mode when protecting plaintext.
Interoperability with respect to the AES mode used to encrypt plaintext is achieved for
these APIs by recording the AES mode in the full identity that is included in the IBSE
envelope that accompanies the AES ciphertext. This allows the binary Hive UDFs to
properly decrypt IBSE ciphertext, even when it was encrypted using a different AES mode,
such as EMES.
The binary Hive UDFs operate with both the BINARY data type and the STRING data type.
If the input data to a binary Hive UDF is determined to be of type BINARY, such as for
images and audio/video clips, then the raw bytes in that data will be protected or accessed.
The input bytes are passed, as is, to the specified API (the Simple API or the REST API) and
the resulting output ciphertext bytes are returned as follows:
• For access operations, the resulting recovered plaintext, decrypted using the identity
and AES mode from the full identity packaged with the IBSE ciphertext input data, is
returned.
NOTE: When running the binary Hive UDFs in the context of Impala, note that the
BINARY data type is not supported. For more information about this Impala limitation,
see "Data Type Limitation" (page 5-21).
If the input data to a binary Hive UDF is determined to be of type STRING, such as for free-
form comments and SMS messages, then the input bytes from that input string data will be
protected or accessed with extra steps, including Base64 encoding and decoding to assure
that the ciphertext can be stored in that same (or a different) STRING column or variable, as
follows:
5-9 CONFIDENTIAL
Different Types of Hive UDFs Developer Templates Integration Guide Version 5.0
• For protect operations, the bytes from the input plaintext string are retrieved as
UTF-8. The bytes in this UTF-8 string are passed to the specified API (the Simple
API or the REST API) and the resulting output AES ciphertext bytes, which contain
the full identity constructed from the identity provided in the specified cryptId (the
second parameter to Hive UDFs) and the AES mode used for encryption (CBC), are
then Base64-encoded into the final output, a Java String.
• For access operations, the input ciphertext string is Base64-decoded to retrieve the
IBSE/AES ciphertext bytes, which are then passed to the specified API (the Simple
API or the REST API) for decryption using the identity and AES mode from the
contained full identity. The resulting bytes are a UTF-8 encoded version of the
original plaintext string, which is then used to build the output Java String to be
returned.
Ciphertext Expansion
When you use Voltage SecureData IBSE/AES encryption, the ciphertext is always larger
than the plaintext, based on the length of the identity specified for cryptographic key
derivation. This is because the full identity, which includes the specified identity and other
fields, such as the AES mode used for encryption (CBC), is included in the IBSE envelope
included with the ciphertext itself.
For BINARY plaintext data, this accompanying information will cause the ciphertext to be
from between 140 to 185 bytes larger than the plaintext, a modest increase when the data
is an image or an audio/video clip.
For STRING plaintext data, the IBSE envelope overhead remains the same (140-185
additional bytes), but the Base64 encoding of the IBSE ciphertext adds very close to an
additional 33% to the size of the final ciphertext (every 3 bytes of data is expressed as 4
bytes when Base64-encoded, plus possibly one or two added dummy bytes to make the
total number of Base64-encoded output bytes divisible by 4). You must take this into
account when storing the result of a protect operation in a database column or SQL variable
in order to avoid errors or worse, truncation of the ciphertext, making recovery of the
original plaintext impossible.
Interoperability with other Voltage SecureData APIs that support the protection of binary
data depends on a couple of important factors:
First, while different Voltage SecureData APIs might or might not allow a choice of AES
modes when protecting binary data (the binary Hive UDFs do not allow a choice, always
using the CBC mode), interoperability between these clients is achieved in this regard by
including enough information as part of the IBSE ciphertext to make that ciphertext self-
describing with respect to the information needed to decrypt it. This includes the AES mode
used to encrypt the AES ciphertext as well as the identity required to derive the relevant
AES cryptographic key.
CONFIDENTIAL 5-10
Developer Templates Integration Guide Version 5.0 Different Types of Hive UDFs
Second, due to the nature of their data storage mechanisms and/or message format, some
Voltage SecureData APIs perform Base64 encoding of their IBSE/AES ciphertext. For
example:
• The REST API always Base64 encodes the IBSE/AES ciphertext it returns for a
protect operation because it needs to be sure that ciphertext byte values are
represented using characters appropriate for the JSON syntax used in the HTTP
response. This API even requires its binary plaintext to be protected to be Base64-
encoded for the same reason (regardless of whether it is really just string data): so
that it can be safely transported in the JSON syntax of an HTTP request.
• Database-based APIs, such as the Teradata UDFs and the Hive UDFs (being
described here), selectively Base64 encode the IBSE/AES ciphertext they return,
based on the data type of the data they are protecting and the fact that the IBSE/
AES ciphertext will be returned as the same data type. In particular, some byte
values in the IBSE/AES ciphertext might not be acceptable in string data types, an
issue solved by Base64 encoding the IBSE/AES ciphertext.
Likewise, the Hive UDFs Base64 encode IBSE/AES ciphertext originating from
STRING data so that it can safely be returned as STRING data. IBSE/AES ciphertext
originating from BINARY data is not Base64-encoded because any byte value will be
acceptable in the BINARY data being returned.
• The Simple API, on the other hand, expects its plaintext as an array of bytes when
performing an IBSE/AES protect operation (unsigned char* in C and byte[] in
Java and C#). When the data to be protected is inherently binary, such as an image
or video clip, it will already be an array of bytes. When the data to be protected is a
string, such as a free-form text field, you must make sure that the characters in the
string are represented using the UTF-8 encoding when they are interpreted as a
sequence of bytes (UTF-8 is the character encoding used by all Voltage SecureData
APIs when retrieving bytes from a string to encrypt using IBSE/AES).
The Simple API does not do any Base64 encoding or decoding as an automatic part
of protect and access operations, respectively (although it does provide separate
APIs for performing Base64 encoding and decoding). Therefore, when using the
Simple API to create IBSE/AES ciphertext that will be accessed by other Voltage
SecureData APIs, you must consider whether those other APIs will expect that
ciphertext to also be Base64-encoded, and if so, which component is responsible for
that separate step, and the mechanism by which that component will know whether
the Base64 encoding step should be performed (such as yes for string data but no
for binary data).
5-11 CONFIDENTIAL
Integration Architecture of the Hive Template Developer Templates Integration Guide Version 5.0
Kerberos authentication is not (yet) available for the underlying IBSE/AES APIs used by the
binary Hive UDFs. If you configure Kerberos as the authentication method for a cryptId that
specifies AES as its format, you will get the following exception at runtime:
NOTE: Updating to a newer version of the Simple API that supports Kerberos
authentication for IBSE/AES cryptographic operations will automatically allow the binary
Hive UDFs to use Kerberos authentication for that API. In order to use the 4.2 version of
the binary Hive UDFs with a newer version of the REST API that supports Kerberos
authentication for IBSE/AES cryptographic operations, you will need to make a minor
source code modification to the Hive Developer Template source code and rebuild. For
more information about the required change, contact Micro Focus Data Security support.
The following Java package and its associated Java source code provide classes that
implement the Hive integration:
The Hive Developer Template demonstrates the integration of Voltage SecureData data
protection processing through the use of Hive UDFs. UDFs provide a ready-to-use integration
mechanism that is implemented by extending the Hive base classes UDF and GenericUDF,
providing an implementation of their method evaluate.
NOTE: The more complex “generic” versions allow dynamic retrieval of the user running a
query by using the Hive SessionState API. This is useful on contexts where the query is
executed as the Hive system user, such as when using Beeline/HiveServer2 when the
HiveServer2 setting doAs is set to false.
CONFIDENTIAL 5-12
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Hive Template
• Formatted text data (FPE and SST) when run as a particular user and when run as the
Hive system user and HiveServer2 impersonation is enabled (HiveServer2 setting
doAs is set to true):
Classes:
com.voltage.securedata.hadoop.hive.ProtectData
com.voltage.securedata.hadoop.hive.AccessData
Formatted text data (FPE and SST) when run as the Hive system user and HiveServer2
impersonation is disabled (HiveServer2 setting doAs is set to false):
Classes:
com.voltage.securedata.hadoop.hive.ProtectDataGeneric
com.voltage.securedata.hadoop.hive.AccessDataGeneric
• Binary data (IBSE/AES) when run as a particular user and when run as the Hive system
user and HiveServer2 impersonation is enabled (HiveServer2 setting doAs is set to
true):
Classes:
com.voltage.securedata.hadoop.hive.ProtectBinaryData
com.voltage.securedata.hadoop.hive.AccessBinaryData
• Binary data (IBSE/AES) when run as the Hive system user and HiveServer2
impersonation is disabled (HiveServer2 setting doAs is set to false):
Classes:
com.voltage.securedata.hadoop.hive.ProtectBinaryDataGeneric
com.voltage.securedata.hadoop.hive.AccessBinaryDataGeneric
5-13 CONFIDENTIAL
Integration Architecture of the Hive Template Developer Templates Integration Guide Version 5.0
Much of the important, shared logic is implemented within the helper class HiveUDFHelper.
Code in this class supports the client/server architecture of Beeline/HiveServer2 by performing
per-user caching of configuration settings with a short (10 second) refresh interval. This
ensures that:
• Users are using their own configuration settings, as controlled by the HiveServer2
setting doAs being set to true or when the generic variant of the Hive UDFs are being
used (see "Using the Generic Hive UDFs When Impersonation is Disabled" on page 5-
40).
• Any (per-user) configuration changes are picked up quickly without forcing a refresh
during a multi-UDF query. Per-user caching combined with the regular and generic
forms of the Hive UDFs allows full support for multi-user/multi-session cases under
Beeline/HiveServer2, with each user loading and using their own current configuration
settings.
Like all of the Hadoop Developer Templates, the Hive Developer Template uses a number of
the packages in the shared integration architecture, including the crypto package abstraction
layer (com.voltage.securedata.crypto) that isolates calls to the Voltage SecureData
APIs that perform the actual cryptographic operations (the Simple API and the REST API). For
more information, see "Shared Integration Architecture" (page 3-54) and "Cryptographic
Abstraction" (page 3-68).
1. At the Hive prompt, add the JAR files required by the UDFs:
hive> add jar ../simpleapi/vibesimplejava.jar;
hive> add jar voltage-hadoop.jar;
NOTE: The JAR file voltage-hadoop.jar is built as an uber JAR file that contains
vsrestclient.jar and its JSON and HTTP Client library dependencies.
2. Specify the ProtectData and AccessData classes as UDFs (shown at two prompts to
enhance readability):
hive> create temporary function accessdata as
> 'com.voltage.securedata.hadoop.hive.AccessData';
CONFIDENTIAL 5-14
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Hive Template
3. Run HQL queries against the data in the Hive tables. For example:
hive> SELECT id, name, accessdata(cc, 'cc')
> FROM voltage_sample WHERE id <= 5;
NOTE: In this example, the second parameter, 'cc', is a cryptId name, which acts as a way to
look up a specific group of settings in the configuration file vsconfig.xml. It can be the
same as the column name, but it does not necessarily have to be the same.
The script run-hive-join-query.hql performs all three of these steps, including a table
JOIN query at the end:
NOTE: The Hive Developer Template, as shipped, does not assume that the Simple API and
the REST API support FPE2 extended characters, as suggested by the commented out call
to the UDF accessdata with the cryptId extended, above. To access the name field using
the cryptId extended, remove the comment designators (--) from the beginning of that
line (and to avoid retrieving the ciphertext version as well, remove the s.name, from the
line above).
5-15 CONFIDENTIAL
Integration Architecture of the Hive Template Developer Templates Integration Guide Version 5.0
accessed data row in the query requires a network round-trip for REST API calls. This will have
a significant performance impact if the query processes a large number of rows. Try to limit any
such UDF calls to a relatively small number of processed rows, such as by using a WHERE
clause or other such filter.
For UDF calls that use the Simple API, there is no such performance impact because each
Simple API operation is performed locally, as individual calls.
For example, the following UDF call on a literal value (not a column), using the Simple API, will
fail:
hive> SELECT id, name, email FROM voltage_sample
> WHERE email = protectdata('[email protected]',
> 'alpha');
The error message indicates a failure to initialize the Simple API, with the following JNI error:
java.lang.UnsatisfiedLinkError: com.voltage.toolkit.
vtksimplejavaJNI.LibraryContext_LIB_CTX_NO_STORAGE_get()J
If you run into this situation, where you want to call the UDF on a literal input value instead of a
column, you have two options:
Option 1:
Use the REST API for the UDF calls on literal values. Since these Web Services calls do
not use JNI, they do not have any class loader issues. Note that the protect and access
UDFs have an advanced feature that allows you to pass an optional third argument to
the call, explicitly specifying the API type: either rest or simpleapi. If provided, this
explicit API type overrides the default one specified for the cryptId in the configuration
file vsconfig.xml. If you pass in rest for the third parameter in the UDF call, the data
protection processing will be performed using the REST API.
CONFIDENTIAL 5-16
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Hive Template
NOTE: Since this is a single protect or access operation on a literal value, there is no
significant performance impact for using the REST API for this operation. Only one
value is being processed in this case, so multiple remote calls are not required.
For example:
hive> SELECT id, name, email FROM voltage_sample
> WHERE email =
> protectdata('[email protected]', 'alpha', 'rest');
For example:
hive> SELECT id, name, email FROM voltage_sample
> WHERE email =
> protectdata('[email protected]', 'alpha-rest');
You may need to do this if you have a custom translator (such as the class
SimpleAPIDateTranslator) configured for the cryptId when using the Simple API,
but which is not appropriate when using the REST API. A good solution is duplicate
cryptIds for use with a given column, each specifying different APIs and one (the one
using the Simple API) specifying the required Translator class.
Option 2:
Once these two JAR files are copied to these hive/lib directories, you will probably
need to restart the Hive service in order for it to find and use the JAR files you have
added. This is true when using Beeline and when using some newer versions of the Hive
prompt, which operate using HiveServer2. For some older versions of the Hive prompt,
starting up the Hive prompt may be sufficient. In any event, continue by following the
usual steps to create and call the relevant Voltage SecureData UDF. This time, the call to
the UDF for literal values will work, even when using the Simple API, because the native
library is loaded from the correct (parent) class loader.
5-17 CONFIDENTIAL
Integration Architecture of the Hive Template Developer Templates Integration Guide Version 5.0
Major Changes in Hive 3.0 and the Script Changes They Required
Some newer Hadoop distributions, such as HortonWorks HDP 3.0.0, have upgraded to Hive
3.0, which includes important architectural changes that affect the Hive Developer Template.
For example, in previous versions of Hive, running the Hive command line would launch a local
session and perform basic operations relative to the node upon which it was started. In previous
versions of the Voltage SecureData for Hadoop Developer Templates, the Hive Developer
Template relied on this fact in its use of resources on the local node’s file system (such as JAR
files and input data files).
In Hive 3.0, running the Hive command line is now equivalent to running Beeline sessions
automatically connect to the remote HiveServer2 service. This means that any references to
local files would resolve relative to the remote node that is running the HiveServer2 service, not
to the current local node.
There are also changes to how tables are created and managed in Hive 3.0. While the two table
types, managed and external, are not new in Hive 3.0, the access control of managed tables has
been restricted such that only the Hive service can freely access and manipulate the data in
managed tables. In addition, the default file format for managed tables is now Optimized Row
Columnar (ORC).
These changes in Hive 3.0 affect the Hadoop Developer Templates in the following ways:
• The JAR files and input data files used by the Hive Developer Template can no longer
reside on the local file system of the node on which the Hive command line is launched.
The solution is to copy these files to well-known locations in HDFS (this same solution is
used elsewhere in the Hadoop Developer Templates).
• Unless they are granted full permissions to the HDFS table directory (/warehouse/
tablespace/managed/hive), no user other than Hive can create managed tables
using the CREATE TABLE command.
• The Hive Developer Template has always used CSV files as the input data file type.
However, unless you change the setting hive.default.fileformat.managed from
the default file format ORC to the file format TEXTFILE, the LOAD DATA command will
not be able to load the Hive Developer Template’s CSV input data files.
In order to accommodate the possibility that you are using a Hadoop distribution that uses
Hive 3.0, with the types of configurations described above, the script create-hive-
table.hql in the Hive Developer Template includes alternative, commented out, table
creation commands that create external tables with explicit HDFS locations specified for the
CONFIDENTIAL 5-18
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Hive Template
table data. The specification of explicit HDFS locations allows the cleanup script to find the
table data files for explicit deletion, something not done automatically when an external table is
dropped. For example, the first of the three commented-out table creation commands in this
HQL script is as follows, with the relevant comment-designators and keyword changes
highlighted:
-- CREATE EXTERNAL TABLE
-- voltage_sample
-- (id INT, name STRING, street STRING,
-- city STRING, state STRING, postcode STRING,
-- phone STRING, email STRING, birth_date STRING,
-- cc STRING, ssn STRING)
-- ROW FORMAT DELIMITED FIELDS TERMINATED BY ","
-- LOCATION '/user/<username>/voltage/hive-sample-tables/volt...
If you are using Hive 3.0, as will be the case if you are using HDP 3.0.0, when you are editing
this script to replace instances of the <username> placeholder, as described in "Edit the Hive
Scripts to Replace <username> Placeholder" (page 5-24), you will want to comment out the
original three CREATE TABLE commands and uncomment the three replacement CREATE
EXTERNAL TABLE commands (and replace their instances of the <username> placeholder).
NOTE: When the configuration setting doAs is set to false and all Hive queries are run as
the system user Hive, you must make sure that the Hive user has full permission to access:
• The JAR files and input data files required by the Hive Developer Template, and,
• The directory specified by the LOCATION directive shown above, to which the table
data files will be moved when the tables are created.
With respect to the latter permissions, you can omit the LOCATION directive altogether,
resulting in the table data files being moved to the default location: /warehouse/
tablespace/managed/external/hive. However, if you do this, the cleanup script will
fail to find the table data files for explicit deletion.
5-19 CONFIDENTIAL
Integration Architecture of the Hive Template Developer Templates Integration Guide Version 5.0
• Avoid start-up overhead by utilizing daemon processes that are always running,
• Distribute work so that the daemon processes work on data that exists in the local file
system, avoiding network overhead, and,
• Avoid MapReduce jobs, such as those used during most Hive queries.
Single instances of two additional types of Impala daemons, the Impala Statestore daemon and
the Impala Catalog Service daemon, help manage Impala queries.
The Hadoop Developer Templates includes scripts for creating and running Impala UDFs. For
information about running these scripts to perform Impala queries, see "Running Queries Using
Apache Impala" (page 5-44).
Authentication Limitation
The biggest limitation when using Impala concerns fine-grained, role-based authentication.
Without also using Apache Sentry, all Impala jobs are run as user impala. And if Sentry is
enabled for Impala, it must also be enabled for Hive, which then prevents the HiveServer2
setting doAs from being set to true (the default), thereby preventing queries being run as
the current user rather than the hive user (for more information about this setting, see
Enable Impersonation for Beeline and Remote Hive Queries on page 5-26).
This Impala limitation means that if Sentry is not used, regardless of which user is running
the Impala shell or another Impala command, Impala does all read and write operations with
the privileges of the impala user, who must therefore have the appropriate read and read/
write access to the related resources, such as JAR files and output directories, respectively.
If you intend to use Impala in a production environment, and given the security
requirements typically required, you will need to decide about the right compromises with
respect to the use of Sentry for Impala authorization versus the use of Hive impersonation
using the HiveServer2 doAs setting.
CONFIDENTIAL 5-20
Developer Templates Integration Guide Version 5.0 Configuration Settings for the Hive Template
Also note that the Generic Hive UDFs, as described in "Using the Generic Hive UDFs When
Impersonation is Disabled" (page 5-40), cannot be used with Impala due to incompatible
function signatures.
For information about running the scripts associated with Impala, see "Running Queries
Using Apache Impala" (page 5-44).
The BINARY data type is not supported by Impala for table columns, UDF arguments, and
UDF return values. Since binary data cannot be stored in an Impala table, the Hive binary
UDFs cannot be used to protect binary data when using Impala.
However, Impala does support the STRING data type, which is the other input (and output)
data type supported by the Voltage SecureData binary Hive UDFs. This offers the
possibility of several types of work-arounds, one of which is demonstrated by the Impala
scripts that create and populate and Impala table with protected binary data (create-
impala-table.sql), and that access that protected binary data (run-impala-binary-
query.sql). These scripts work with Base64-encoded binary data, as a STRING.
NOTE: If you choose to implement this approach to protecting and accessing binary data
using Impala, you must bear in mind several Impala limitations that apply:
• Both the STRING data type and entire rows of data are limited to 2 GB. If your
binary plaintext to be protected is Base64-encoded to be stored in an Impala table
as a STRING, it will have already grown by approximately 33%. As described
elsewhere in this guide (see "Ciphertext Expansion" on page 5-10), protecting a
STRING using the Voltage SecureData binary Hive UDFs will add an additional 33%
for Base64 encoding the STRING AES ciphertext (and its IBSE envelope) to the
starting size of the plaintext. This double Base64 encoding adds its 33% overhead
to the original binary data twice by the time it is protected. Given that the
ciphertext result is returned as an Impala STRING, this results in an effective
maximum size of the original binary data that is closer to 1 GB than to 2 GB.
• If your data is going to be stored in a Parquet file, the limit is even lower, 1 GB,
and you will need to make the same type of calculations with respect to ciphertext
expansion.
There are two classes of configuration settings used by the Hive Developer Template:
5-21 CONFIDENTIAL
Configuration Settings for the Hive Template Developer Templates Integration Guide Version 5.0
NOTE: Note that the third class of configuration settings required by the Hive Developer
Template in releases prior to version 4.1 are no longer necessary. This third class was
previously used to define aliases, which served much the same purpose as cryptIds in
version 4.1 and higher. In versions of the Hive Developer Template prior to version 4.1, the
second parameter to Hive UDFs was the name of an alias defined in the configuration file
vsconfig.properties (or vshive.properties). Beginning with version 4.1, the
second parameter to Hive UDFs is the name of a cryptId, specified in the XML configuration
file vsconfig.xml (or vshive.xml).
Before you begin to modify the Hive Developer Template XML configuration files for your own
purposes, such as using your own Voltage SecureData Server or different data formats, Micro
Focus Data Security recommends that you first run the Hive Developer Template as provided,
giving you assurance that your Hadoop cluster is configured correctly and functioning as
expected.
You will need to update many of the values in the XML configuration files vsauth.xml and
vsconfig.xml in order to protect your own data using your own Voltage SecureData Server.
CONFIDENTIAL 5-22
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template
• The emphasis in the paragraph above, before you start protecting Social Security
numbers, highlights the fact that you cannot protect the SSN plaintext using the cryptId
ssn and then expect to access the SSN ciphertext using the cryptId ssn-sst.
• The new cryptId has been given a different name: ssn-sst. While not strictly required
(you could have modified the existing cryptId), this was done to make the type of
operation more clear within any scripts you might use to call a Hive Developer Template
UDF with this cryptId (ssn-sst) as its second parameter.
• The format attribute of the new cryptId is set to ssn-tokens, which is an SSN
tokenization format defined on the public-facing Voltage SecureData Server
dataprotection maintained by Micro Focus Data Security.
• The api attribute of the new cryptId is set to rest, indicating that the REST API should
be used to perform the tokenization.
NOTE: In order to use the REST API, your Voltage SecureData Server must be
version 6.0 or higher.
• Finally, it is important to note that now that you have switched SSN protection from local
FPE to remote SST, the same performance warning as given in the comments for credit
card data now applies to SSNs as well.
NOTE: Remember that if you edit the versions of the configuration files vsauth.xml or
vsconfig.xml on the local file system, you must copy the updated versions to HDFS,
which is where the Hadoop Developer Templates will access them when they are run. For
instructions, see "Loading Updated Configuration Files into HDFS" (page 3-77).
The bin directory includes scripts that you can run to create Hive tables and run Hive queries,
including scripts that use Apache Impala. Also note that with the addition of support for
Beeline/HiveServer2 in the Hive Developer Template, the Hive command line has been de-
emphasized.
This section provides instructions for setting up to run, and then running, the Hive UDFs
provided by the Hive Developer Template in a variety of ways. This includes as temporary
UDFs, as permanent UDFs, from the Hive command line, using Beeline and HiveServer2 both
with and without impersonation, and running the Hive UDFs from a remote computer using
ODBC and JDBC.
5-23 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0
• Performing some common HDFS setup steps required by all of the Hadoop Developer
Templates. For more information, see “Common HDFS Procedures for the Hadoop
Developer Templates” (page 3-75).
• Running the MapReduce Developer Template to create a CSV file with protected values
that can be loaded into the Hive metastore database. For more information, see
“Running the MapReduce Template” (page 4-6).
NOTE: By adopting this approach, rather than including a pre-packaged Hive input
file with the Voltage SecureData for Hadoop Developer Templates installation, the
Hive template will work just as well with your own Voltage SecureData Server as with
the public-facing Voltage SecureData Server dataprotection, the default Voltage
SecureData Server specified in the configuration file vsconfig.xml.
After completing these steps to create the input data expected by the Hive Developer
Template, several other Hive-specific setup steps are required, as explained in the following
sub-sections:
• Copy the Required JAR Files to Your Data Nodes (page 5-25)
• Create the Hive Tables for the Hive Developer Template (page 5-25)
• Enable Impersonation for Beeline and Remote Hive Queries (page 5-26)
The set of Hive scripts, in the bin directory, that require this editing are:
• create-hive-table.hql
• create-hive-udf.hql
• create-hive-perm-udf.hql
• run-hive-join-query.hql
• run-hive-binary-query.hql
CONFIDENTIAL 5-24
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template
Edit these Hive scripts using a text editor of your choice and change all non-comment
instances of the <username> placeholder to the actual username.
The hive/lib directory is where the Hive service is installed on the data nodes in your cluster.
For example, for Hortonworks HDP 2.6.4, the Hive service is installed in the directory /usr/
hdp/2.6.4/hive/lib.
NOTE: As is the case for all of the Hadoop Developer Templates, the full Simple API package
(including the file libvibesimplejava.so and the trustStore directory) must be
installed on all nodes in your Hadoop cluster that will be running the UDF. In the case of
HiveServer2 and simple (non-join) queries, that would usually be the node(s) in the cluster
running the HiveServer2 service itself. But because the UDFs may launch MapReduce jobs in
some Hadoop distributions, it is important to install the Simple API package on all data
nodes in your Hadoop cluster, not just these HiveServer2 nodes.
And, as usual, the Simple API installation location must be the same on all data nodes, as
specified using the installPath attribute of the simpleAPI element in the configuration
file vsconfig.xml.
5-25 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0
This creates several Hive tables that contain the protected data, including protected
binary data, used in subsequent queries using the Hive Developer Template.
NOTE: In some environments, particularly when access to the Hive tables is secured using
Sentry or Ranger, impersonation is disabled by setting the property doAs to false. In this
case, because the jobs will be run as the hive system user, the regular Hive UDFs provided
with the Hive Developer Template cannot determine the user running the query. The Hive
Developer Template addresses this situation by providing two additional UDF
implementations, ProtectDataGeneric and AccessDataGeneric. For more information,
see "Using the Generic Hive UDFs When Impersonation is Disabled" (page 5-40).
Depending on your Hadoop distribution, you can view and change the doAs setting for the
HiveServer2 service using Ambari or the Hive configuration XML file.
NOTE: The specific location of this setting depends on your Hadoop distribution. For
example, under HDP 2.6, it is in the Advanced hive-interactive-site section on the
Advanced sub-tab, with the setting name Run as end user instead of Hive user.
3. If you changed this value, click Save at the bottom of the page, and then restart the
HiveServer2 service.
CONFIDENTIAL 5-26
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template
1. Locate the file hive-site.xml. In some distributions, the file is in the following
location:
/etc/hive/conf/hive-site.xml
NOTE: Before running Hive queries locally from a node within your Hadoop cluster, you
must complete the setup steps (except the final one) described in “Setting Up to Run the
Hive Developer Template” (page 5-24). The final setup step, which concerns the
HiveServer2 setting doAs, is not relevant when running Hive queries locally from a node
within your Hadoop cluster. This is because Hive queries that run locally are run as the
interactive user and the Hive UDFs will look automatically look for the configuration files in
the directory /user/<interactive_user>/voltage/config.
1. If necessary, navigate to the bin directory and type hive to access the Hive prompt.
When this script completes, the first 10 rows of the sample data are shown on the
console.
5-27 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0
1. If necessary, navigate to the bin directory and type hive to access the Hive prompt.
When this script completes, the first 10 rows of the sample data are shown on the
console.
1. If necessary, navigate to the bin directory and type hive to access the Hive prompt.
3. Run a simple HiveQL query interactively to access protected data fields in the table:
hive> SELECT id, name,
> accessdata(email, 'alpha'),
> accessdata(birth_date, 'date'),
> accessdata(cc, 'cc'),
> accessdata(ssn, 'ssn')
> FROM voltage_sample
> WHERE id <= 10;
When this script completes, the first 10 rows of the sample data are shown on the
console.
NOTE: If you run this command on a Cloudera CDH 5.2 or 5.3 distribution, the
following messages are generated by the Hive JOIN query:
These messages do not affect the sample query, and can be ignored.
CONFIDENTIAL 5-28
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template
NOTE: Before running Hive queries using the Beeline shell, you must complete the setup
steps described in “Setting Up to Run the Hive Developer Template” (page 5-24). This
includes the final setup step, which concerns the HiveServer2 setting doAs that controls
impersonation. This is required because HiveServer2 runs as the hive system user.
Start the Beeline shell and run queries by performing the following steps:
1. Change the owner of the current login session to the user you want to be running the
Beeline shell (<user>), including switching to that user’s home directory and
environment variables:
su - <user>
3. Start the Beeline shell (shown on two lines for improved readability):
beeline -n <user> -p <password>
-u "jdbc:hive2://<host>:<port>/<database>;principal=<principal>"
Where:
• <user> and <password> are the username and password values, respectively,
to connect to the specified database through HiveServer2.
• <port> is the relevant port number on the HiveServer2 host. For example,
10000.
• <database> is the name of the relevant Hive database. For example default.
• <principal> is the Hive service principal name, including the @<REALM> part.
NOTE: This part of the command to start the Beeline shell is only needed if
the Hadoop cluster or Hive service is Kerberized. Otherwise, tt can be omitted.
5-29 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0
or
!run create-hive-perm-udf.hql
5. Run a simple query using the temporary or permanent accessdata UDF you just
created. For example:
SELECT id, accessdata(name, 'name'),
accessdata(cc, 'cc'),
accessdata(ssn, 'ssn')
FROM voltage_sample
WHERE id <= 10;
6. Run a join query using the temporary or permanent accessdata UDF you just created.
For example:
SELECT s.id, accessdata(s.name, 'name'),
accessdata(s.ssn, 'ssn'),
accessdata(s.cc, 'cc'),
cs.creditscore
FROM voltage_sample s
JOIN voltage_sample_creditscore cs
ON (s.ssn = cs.ssn)
WHERE s.id <= 10;
• Creating Permanent Hive UDFs Using the Hive Command Line (page 5-31)
NOTE: Before running Hive queries a remote computer outside of the Hadoop cluster, you
must complete the setup steps described in “Setting Up to Run the Hive Developer
Template” (page 5-24). This includes the final setup step, which concerns the HiveServer2
setting doAs that controls impersonation. This is required because HiveServer2 runs as the
hive system user.
CONFIDENTIAL 5-30
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template
NOTE: The Cloudera CDH 5.1 distribution does not support permanent UDFs, which means
that remote queries are not supported.
After performing these procedures, the JAR file hive-client-test.jar and the Java
Properties file vshive.properties, along with script files for running a sample Hive query,
can be copied to any computer on which the Java Runtime Environment (JRE) is installed.
• The sub-directory src includes sample code to run a Hive query with a UDF call via
remote JDBC connection from a client outside of the Hadoop cluster.
5-31 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0
• The sub-directory bin/lib includes the file README.txt that contains a list of the
Hadoop JAR files that you add to this directory, which are needed to run the scripts.
• The file bin/hiveclient is a script that you can run on a Linux machine to
perform a remote Hive query via JDBC
• The file bin/hiveclient.bat is a script that you can run on a Windows machine
to perform a remote Hive query via JDBC
Copy this entire directory from the installation computer to your build computer.
You must generate a Hive client JAR file using the files in the directory <install_dir>/
clientsamples/jdbc. To generate this file:
1. Ensure that both mvn and javac are in your path, and that JAVA_HOME is set to the
installation location of version 8 of the JDK.
Element: repo.id
Element: repo.url
This element provides the full URL of the remote repository from which the
relevant dependency JAR files will be pulled. Standard repository URLs for
Hortonworks, Cloudera, MapR, and EMR, respectively, are as follows (where
appropriate, shown on two lines to improve readability):
• http://repo.hortonworks.com/content/groups/public
• https://repository.cloudera.com/artifactory/
cloudera-repos
• https://repository.mapr.com/nexus/content/groups/
mapr-public/
CONFIDENTIAL 5-32
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template
• https://<s3-endpoint>/<region-ID>-emr-artifacts/
<emr-release-label>/repos/maven/
Where:
https://s3.us-west-1.amazonaws.com/
us-west-1-emr-artifacts/emr-5.30.0/repos/maven
Element: log4j.version
This element provides the version of the Apache Log4j library that you want
to use.
Element: slf4j.api.version
This element provides the version of the slf4j API library that you want to
use.
Element: slf4j.log4j.version
This element provides the version of the slf4j log4j-12 library that you want
to use.
Element: http.client.version
Element: http.core.version
Element: commons.logging.version
This element provides the version of the Apache Commons Logging library
that you want to use.
5-33 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0
Element: hadoop.common.version
This element provides the version of the Apache Hadoop Common library
that you want to use.
Element: hive.exec.version
This element provides the version of the Apache Hive Exec library that you
want to use
Element: hive.jdbc.version
This element provides the version of the Hive JDBC library that you want to
use.
Element: hive.service.version
This element provides the version of the Hive Service library that you want to
use.
You must, at minimum, update this configuration file to point to your Hive server and to run
as a specific username. You can also configure the following additional settings in this
configuration file:
JDBC Driver
CONFIDENTIAL 5-34
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template
jdbc.driver = org.apache.hive.jdbc.HiveDriver
JDBC URL
If the Hive server is listening on a port other than 10000, replace that value as well.
You can also change the database name to the actual name, rather than using the
value default.
Username
Password
If needed, add the password for the username under which you will run the query.
Query
This default value shows an example of running a query to access protected fields,
using the Voltage SecureData 'accessdata' UDF. You can customize this value to run
a different test query on the server.
Display Fields
5-35 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0
This default value shows an example of how to display fields that were accessed
using the default query. You can customize this value if you want to display different
fields from a query.
• commons-logging-<version>.jar
• hadoop-common-<version>.jar
• hive-exec-<version>.jar
• hive-jdbc-<version>.jar
• hive-service-<version>.jar
• httpclient-<version>.jar
• httpcore-<version>.jar
• log4j-<version>.jar
• slf4j-api-<version>.jar
• slf4j-log4j12-<version>.jar
See the documentation for your specific Hadoop distribution for the location of
these files.
CONFIDENTIAL 5-36
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template
• If you are running the sample query from Linux, navigate to the directory bin
and run the script hiveclient. This script sets up the classpath and runs the
Java program.
> cd bin
> chmod +x hiveclient
> ./hiveclient
• If you are running the sample query from Windows, navigate to the folder
<install_dir>\clientsamples\jdbc\bin and double-click the file
hiveclient.bat to set up the classpath and run the Java program.
The first 10 lines of the query display, with the output from the first row similar to
the following:
Row: 0
s.id: [1]
s.name: [Fabien Baillairgé]
email_decrypted: [[email protected]]
bd_decrypted: [3/2/2007]
cc_decrypted: [5225629041834450]
ssn_decrypted: [675-03-4941]
cs.creditscore: [621]
The most secure way to access unprotected data from a remote computer is by running
queries against the Voltage SecureData UDFs. However, some Business Intelligence tools
do not allow the direct query of any Hive UDFs. In this case, you can create Hive views to
access the unprotected data.
CAUTION: The entire set of unprotected data in the cluster is available to any user who
has access to the view, which is a potential security risk. Be aware that the level of
security of the cluster is reduced if you create and run queries using Hive views.
To create the sample view in Hive and run a query from that view:
5-37 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0
FROM voltage_sample s
JOIN voltage_sample_creditscore cs
ON (s.ssn = cs.ssn);
• Permanent Hive UDFs on your cluster. See "Creating Permanent Hive UDFs Using the
Hive Command Line" (page 5-31) for instructions.
• A working connection between the computer running Microsoft Excel and your Hadoop
cluster.
• The correct version of the Hive ODBC driver installed on the computer you are using to
run Microsoft Excel.
NOTE: You must use the 32-bit version of the driver if you are using a 32-bit version
of Excel, even if your computer is 64-bit.
You can obtain the correct ODBC driver from the provider of your Hadoop distribution:
• Cloudera
• Hortonworks
• MapR
This launches Microsoft Excel, with the Select Data Source dialog box open.
2. Click the Machine Data Source tab on the dialog box, and then choose the data source
name that corresponds to the driver that you downloaded for your distribution.
For example, if you downloaded the ODBC driver for Hortonworks, the data source
name is Sample Hortonworks Hive DSN.
3. Click OK.
CONFIDENTIAL 5-38
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template
This opens an ODBC driver connection dialog. Note that the exact title of the dialog box
varies, depending on which ODBC driver you are using.
4. In the Host field, enter the hostname for the server in your cluster that is running the
HiveServer2 service.
5. In the User Name field, type the user name under which you want to run the query.
6. In the Password field, type the password value associated with the specified the user
name.
If your cluster does not have a password configured for the user, do one of the following:
• Click inside the Mechanism field, and then scroll up to display User Name.
Choosing User Name in this field disables the Password field. It also clears the
User Name field, which means that you must re-enter the user name.
5-39 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0
If the connection is valid, you see a message indicating success. If you do not see the
success message, you cannot proceed until you have established a valid connection.
8. Click OK. A message displays at the bottom of the Excel file indicating that it is waiting
for the query to be executed. This might take up to several minutes.
When the query completes, the Excel file is populated with plaintext values for email,
birthdate, credit card number, and social security number (ssn) for the first 10 rows of
the sample data stored in the Hadoop cluster, and accessed by the Hive UDF.
3. Click the Definition tab and edit the string in the Command text field.
4. Click OK.
A dialog indicates that the query is no longer identical to the one in the original .odc
file.
Running a permanent Hive Developer Template UDF query using Beeline when the
HiveServer2 setting doAs is set to false will generate the following error in the log file /var/
log/hive/hiveserver2.log (or an alternative location of HiveServer2 log files):
Caused by: com.voltage.securedata.config.ConfigException: Failed to load config/
auth properties from HDFS
at com.voltage.securedata.hadoop.config.HDFSConfigLoader.load(HDFSConfigLoader.java:189)
at com.voltage.securedata.hadoop.config.HDFSConfigLoader.load(HDFSConfigLoader.java:103)
at com.voltage.securedata.hadoop.hive.BaseHiveUDF.initConfig(BaseHiveUDF.java:56)
at com.voltage.securedata.hadoop.hive.BaseHiveUDF.getCrypto(BaseHiveUDF.java:82)
at com.voltage.securedata.hadoop.hive.BaseHiveUDF.evaluate(BaseHiveUDF.java:207)
... 33 more
Caused by: java.io.FileNotFoundException: File does not exist: /user/hive/voltage/config/
vsconfig.xml
This happens because the query is run as the hive system user, and the expected XML
configuration files are not found in the following directory in HDFS:
/user/hive/voltage/config
CONFIDENTIAL 5-40
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template
One way to solve this problem would be to copy the configuration files to the configuration
directory for the hive system user in HDFS: /user/hive/voltage/config. However, if
multiple users are allowed to create and run UDFs in the context of Beeline/HiveServer2, there
are security implications to these users having access to these shared configuration files,
particularly the XML configuration file vsauth.xml and the authentication credentials it
contains.
Hive Developer Template solves this security issue by providing two additional Hive UDF
implementations:
• com.voltage.securedata.hadoop.hive.ProtectDataGeneric
• com.voltage.securedata.hadoop.hive.AccessDataGeneric
These UDF classes inherit from a class hierarchy that ultimately extends the class GenericUDF
(by contrast, the regular Hive UDF classes, for the ProtectData and AccessData UDFs,
inherit from the class UDF).
The code implemented in the generic versions of the Hive UDFs is more complex, but does
provide access to the user running the query through the Hive SessionState API. This
ability, combined with LDAP Plus SharedSecret authentication, allows the generic UDFs to
make cryptographic key requests to the Key Server as the relevant user even when the
HiveServer2 setting doAs is set to false and the job itself runs as the hive system user.
1. Choose one of the schemes that allow you to put your Hive Developer Template
configuration files in a custom location (of your choosing). For more information, see
"Specifying the Location of the XML Configuration Files" (page 3-47).
2. In your chosen custom location, in the XML configuration file vsauth.xml, configure
authentication/authorization to use LDAP Plus Shared Secret, as described in "Other
Approaches to Providing Configuration Settings" (page 3-52).
3. Set the file permissions on the configuration files in your custom location in HDFS to
make them readable only by the hive system user (when the HiveServer2 setting doAs
is set to false, the hive system user is the user that must be able to read the
configuration files). This step is important so that access to the sensitive information,
such as the authentication credentials in the XML configuration file vsauth.xml, is
appropriately limited.
CAUTION: Because this approach requires that the hive system user be granted
read permission to these configuration files, any job running as that system user will
be able to read the sensitive information contained in them. In environments where
untrusted users can create and run new UDFs that end up running as the hive
system user, this approach may not be sufficiently secure.
4. Copy the following required two (or three) JAR files to a common (not user-specific)
location in HDFS, for reference when creating the generic UDFs (for example: /apps/
hive/voltage/hiveudf/):
5-41 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0
Make sure to update the paths to the JAR files in HDFS to the location where you
copied them, if different than shown above.
6. Call the newly created generic UDFs in the context of Hive queries, just as you would for
the regular protectdata and accessdata UDFs. For example:
SELECT id, name, accessdatageneric(cc, 'cc')
FROM voltage_sample
WHERE id <= 10;
The generic UDFs have the same parameters as the regular UDFs:
protectdatageneric(<value/column>, <cryptId>)
accessdatageneric(<value/column>, <cryptId>)
If you do not need both variants of the UDFs to run side-by-side, you could easily create the
generic UDFs with the regular names, protectdata and accessdata, and just use them in
almost all circumstances. While the generic UDFs are only strictly required in circumstances
where the HiveServer2 setting doAs must be set to false (such as when Sentry or Ranger are
being used), they can still be used in situations in which the doAs setting is set to true or is
irrelevant (the Hive command line being an example of the latter). In such cases, the generic
UDF implementation will authenticate the correct user (the current Hadoop user as reported by
the UserGroupInformation API) when performing LDAP Plus Shared Secret authentication.
Nevertheless, because their code is simpler and easier to understand, the regular UDFs remain
as the primary UDF sample implementations in the Hive Developer Template.
CAUTION: If you are using Apache Impala to run the Hive UDFs provided in the Developer
Templates, note that the generic UDFs described in this section do not work with Impala.
Impala does not support Hive UDFs that extend from class GenericUDF.
CONFIDENTIAL 5-42
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template
Running Hive Developer Template UDF queries in the context of LLAP faces the same
challenges as when running Hive Developer Template UDF queries using Beeline when the
HiveServer2 setting doAs is set to false (that is, with impersonation disabled). Because such
queries are run as the hive system user, using the regular Hive Developer Template UDFs
makes the choice of Voltage SecureData authentication method difficult. Shared Secret, which
does not do per-user authorization, is workable, but limited in that regard, providing only
identity pattern-matching authorization. The other available authentication methods
(Username and Password, LDAP + Shared Secret, and Kerberos) will only work if you are willing
to also give the hive user appropriate privileges in LDAP or Kerberos, respectively. However,
this does not effectively limit queries to only certain users because all users are running their
queries as the hive user.
The solution for running the Hive Developer Template in the context of LLAP is the same as for
the Beeline/HiveServer2 scenario described above: using the alternative, more sophisticated
UDFs that extend a base class that allows the UDFs to make cryptographic key requests to the
Key Server as the relevant user even when the LLAP daemons are running as the hive system
user. Creating and running the generic UDFs protectdatageneric and
accessdatageneric, as described in "Using the Generic Hive UDFs When Impersonation is
Disabled" (page 5-40), works in the context of LLAP without any additional changes required,
including with any of the available authentication methods.
Also, due to security concerns, Hive UDFs are required to be created as permanent UDFs. If you
attempt to run temporary UDFs in the context of Hive, and depending on the LLAP mode
(only, all, auto, map, or none), they will either fail or fall back to run on the external Hive
queues.
To successfully execute the generic Hive UDFs in the context of LLAP, follow these steps:
2. Launch a Beeline session using the appropriate Beeline JDBC URL. For example (shown
on two lines to improve readability):
beeline -u "jdbc:hive2://<hive-server-interactive-host>:<port>/
;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive"
NOTE: This URL is different than what you use to launch Beeline for Hiveserver2,
including the host and port.
5-43 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0
3. Create generic Hive Developer Template UDFs as permanent UDFs. For an example, see
the creation of the generic Hive UDFs protectdatageneric and
accessdatageneric in the script create-hive-perm-udf.hql.
4. Verify that the Hive LLAP execution mode is set as desired (generally either only or
all in order to force LLAP execution as much as possible).
5. Restart the Hive service. This is required to avoid errors associated with the generic
Hive UDFs you created in Step 3 above. Depending on the version of HDP you are
using, such errors can be associated with not finding the UDF(s) or not being allowed to
run the UDF(s).
6. Re-launch Beeline as described in Step 2 above. This is required because of the Hive
service restart.
7. Run Hive query using the generic Hive UDFs created in Step 3 above.
NOTE: For information about additional steps, both optional and required, that you must
also perform if you are using Kerberos authentication in the context of LLAP, see "Kerberos
Authentication When Beeline/HiveServer2 Impersonation is Disabled" (page 3-84).
When Impala runs the Hive UDFs, it will compile these JAR files into a single JAR file within the
sub-directory /var/lib/impala/udfs and then use that combined JAR file to run the UDFs.
CONFIDENTIAL 5-44
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template
Running the MapReduce protect job involves both common infrastructure steps and a specific
MapReduce step:
1. Make sure a home directory exists in HDFS for the user that will be running the Impala
queries. This is usually the user impala, but if Apache Sentry is being used for
authorization, it could be a different user. For more information, see "Creating a Home
Directory in HDFS" (page 3-75).
2. Run the script copy-sample-data-to-hdfs to copy the files required for the
MapReduce protect job to HDFS using:
./copy-sample-data-to-hdfs
For more information, see "Loading Hadoop Developer Template Files into HDFS" (page
3-76).
For more information, see "Running the MapReduce Template" (page 4-6).
After the MapReduce protect job has created the output CSV file mr-protected-data.csv
in the local directory sampledata, you can run the first of the Impala-specific scripts:
copy-impala-data-to-hdfs. As the user impala (or potentially another user of Sentry is
being used), run this Impala-specific script:
./copy-impala-data-to-hdfs
This script:
• If necessary, creates the HDFS directory voltage/config (within the user’s home
directory), and then copies the XML configuration files vsconfig.xml and
vsauth.xml from the local file system to that directory.
• If necessary, creates the HDFS directory voltage/hiveudf (within the user’s home
directory), and then copies the JAR files voltage-hadoop.jar and
vibesimplejava.jar from the local file system to that directory.
NOTE: If you are using the JAR-based alternative configuration file location
approach, as described in "Config-Locator Properties File Packaged as a JAR File"
(page 3-48), you can uncomment a line in this script to copy the JAR file
vsconfig.jar to the voltage/hiveudf directory as well.
5-45 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0
• create-impala-table.sql
Before using this SQL script, you must edit it and replace all non-comment instances of
<username> with the name of the user you are using to run the Impala scripts. Normally, this is
the user impala, but it might be another user if you are using Sentry for authorization.
As the user whose name you edited into the table creation script, in the Impala shell, create the
Impala tables as follows:
[<daemon-node-name>:21000] > source create-impala-table.sql;
NOTE: When you run the impala-shell command to launch the Impala shell, if you are on
a computer that is not running the Impala daemon, your Impala shell prompt will be:
To connect to a computer running the Impala daemon, use the Impala shell command
connect:
[<daemon-node-name>:21000] >
This SQL script creates and loads tables that are similar to those created by the Hive HQL
script create-hive-table.hql, but because Hive and Impala share the Hive metastore
database, in order to avoid naming conflicts and problems when dropping the tables, it
appends _impala to all of the names of the tables it creates.
NOTE: When you create the Impala tables using the CSV files mr-protected-data.csv,
creditscore.csv, and encoded_binary.csv in the HDFS directory voltage/
impala-sample-data, those files are moved in the course of table creation to the Hive
data warehouse in HDFS. This means that if you want to re-create these tables for any
reason, you must run the script copy-impala-data-to-hdfs again in order for the SQL
script create-impala-table.sql to find its source CSV files where it expects to find
them.
CONFIDENTIAL 5-46
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template
• create-impala-perm-udf.sql
NOTE: Temporary Impala UDFs do not persist across an Impala service restart while
permanent Impala UDFs do persist across an Impala service restart.
Before using one or the other of these SQL scripts, you must edit it and replace all non-
comment instances of <username> with the (same) name of the user account under which
you are running the Impala SQL scripts, either user impala or another user if you are using
Sentry for authorization.
As the user whose name you edited into either the temporary and/or permanent UDF creation
script(s), in the Impala shell, create the Impala UDFs as follows:
[<daemon-node-name>:21000] > source create-impala-temp-udf.sql;
or
[<daemon-node-name>:21000] > source create-impala-perm-udf.sql;
These scripts create the temporary or permanent UDFs using different syntax and both using
the same set of UDF names:
• protectdataimpala
• accessdataimpala
• protectbinarydataimpala
• accessdbinaryataimpala
As with the Impala table names, this is done to avoid name conflicts with the Hive UDFs that
could prevent issues when creating permanent UDFs and when dropping any of the UDFs (due
to ownership differences).
NOTE: These SQL scripts use the same names for the temporary and permanent Impala
UDFs: protectdataimpala, accessdataimpala, protectbinarydataimpala, and
accessbinarydataimpala. To avoid name conflicts, create either temporary Impala UDFs
or permanent Impala UDFs, but not both.
5-47 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0
• run-impala-join-query.sql
• run-impala-binary-query.sql
To run the JOIN query, as the user impala (or potentially another user if Sentry is being used),
in the Impala shell, run the SQL script run-impala-join-query.sql to use the temporary
or permanent Impala UDF for ciphertext access (depending on which type you created) named
accessdataimpala in a JOIN query on tables created by the script create-impala-
table.sql:
When this script completes, the first 10 rows of the sample data are shown on the console.
NOTE: This Impala script, as shipped, does not assume that the Simple API and the REST
API support FPE2 extended characters, as suggested by the commented out call to the UDF
accessdataimpala with the cryptId extended:
To access the name field using the cryptId extended, remove the comment designators
(--) from the beginning of that line (and to avoid retrieving the ciphertext version as well,
remove the s.name, from the line above).
To run the binary data query, as the user impala (or potentially another user if Sentry is being
used), in the Impala shell, run the SQL script run-impala-binary-query.sql to use the
temporary or permanent Impala UDF for ciphertext access (depending on which type you
created) named accessbinarydataimpala in a simple query on the binary data tables
created by the script create-impala-table.sql:
[<daemon-node-name>:21000] > source run-impala-binary-query.sql;
When this script completes, the Base64-encoded binary data of the two .PNG images, and their
associated data, are shown on the console.
• drop-impala-udf.sql
CONFIDENTIAL 5-48
Developer Templates Integration Guide Version 5.0 Running the Hive Developer Template
As the user impala (or potentially another user of Sentry is being used), in the Impala shell, run
the SQL script drop-impala-udf.sql to drop both temporary and permanent Impala UDFs
with the names accessdataimpala, protectdataimpala, accessbinarydataimpala,
protectbinarydataimpala from the Hive metastore database, if any:
5-49 CONFIDENTIAL
Running the Hive Developer Template Developer Templates Integration Guide Version 5.0
CONFIDENTIAL 5-50
6 Sqoop Integration
The Hadoop Developer Templates demonstrate how to integrate Voltage SecureData data
protection technology in the context of Sqoop. This demonstration includes the use of the
Simple API (version 4.0 and greater) and the REST API.
The Sqoop integration is trickier than the MapReduce and Hive integrations because Sqoop
does not provide an explicit mechanism through which its import process can be extended to
support a transformation phase. Nevertheless, for Sqoop 1.x at least, this integration was
accomplished by devising a way to wrap the object-relational mapping (ORM) class generated
using the codegen command, combined with the Java packages in the common infrastructure.
This chapter provides a description of the former as well as instructions on how to run the
Sqoop template using the provided sample data. For more information about the common
infrastructure used by the Hadoop Developer Templates, see Chapter 3, “Common
Infrastructure”.
• Integration Architecture of the Sqoop Template (page 6-2) - This section explains the
Java classes that are specific to the Sqoop template as well as information about how
batch processing is achieved for the different SecureData APIs that this template can
use.
• Configuration Settings for the Sqoop Template (page 6-4) - This section reviews the
configuration settings that are relevant to the Sqoop template and provides an example
of how and why you would want to change those settings as you adapt the template to
your own use.
• Running the Sqoop Template (page 6-6) - This section provides instructions for
running the Sqoop template using the provided sample data.
6-1 CONFIDENTIAL
Integration Architecture of the Sqoop Template Developer Templates Integration Guide Version 5.0
The following Java package and its associated Java source code provide classes that
implement the Sqoop integration:
Unlike MapReduce and Hive, Sqoop does not provide an explicit mechanism through which to
integrate custom code. The base Sqoop import job runs from the command-line, using
command-line arguments, and imports data from an external relational database table into
HDFS directly, without a well-defined or documented way to transform or otherwise process
the data as it is being imported. Sqoop is an efficient Hadoop-based extract-load (EL) tool but
not a full-fledged extract-transform-load (ETL) tool. There is no transform phase that is
designed to be customized.
NOTE: Some Sqoop options may not work at all with the Sqoop integration, while others .
However, Sqoop does provide a way to generate code for the object-relational mapping (ORM)
class that is used to perform the import, with this class being provided as an explicit command-
line input to the Sqoop import command. This opens a mechanism by which custom data
processing code can be integrated into the import data flow, effectively providing a custom
transform phase.
NOTE: The Sqoop integration architecture described in this section works in the context of
Sqoop 1.x. Sqoop 2 uses an entirely different server-side architecture, which does not
support code generation (using the codegen command). Therefore, the approach outlined
here does not apply to Sqoop 2.
Also note that some newer Hadoop distributions have changed such that they no longer use
the deprecated Cloudera package for their Sqoop-generated ORM classes. The Maven build
script (pom.xml) for the Hadoop Developer Templates dynamically adapts to the use of
either the older, deprecated Cloudera Sqoop package or the newer Apache Sqoop package.
For more information, see “Support for Newer Apache Sqoop 1.x Versions” (page 2-16).
The Developer Template for Sqoop accomplishes this by wrapping the generated ORM class in
the class SqoopRecordWrapper. This class uses the Java Reflection API to read data from,
and write data to, the wrapped ORM class. Using this approach, custom code is integrated into
this wrapper to protect the data after it is read from the JDBC ResultSet, and before it is
returned as an output string for Sqoop to write to HDFS.
The abstract base class SqoopRecordWrapper demonstrates the first phase of the batching
functionality for this integration, which is especially important to minimize network overhead
when using the REST API. For more information about batching in the Sqoop Developer
Template, see "Batch Processing for Sqoop" (page 6-3).
CONFIDENTIAL 6-2
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Sqoop Template
At runtime, the overall process, as implemented in the indicated shell scripts and assuming that
you have loaded data into your MySQL database table, involves the following general steps:
1. Run the sqoop codegen command to generate the ORM class for your table.
2. Run the sqoop import command to use the generated ORM class and Developer
Template libraries to protect data as it is being imported into HDFS.
• readFields(ResultSet __dbResults)
This method reads the data into the ORM class from the JDBC ResultSet, in batches.
This method protects the batch of data by calling the configured Voltage SecureData
data protection API (either the Simple API or the REST API), and returns the processed
batch results as a string to Sqoop for writing to HDFS.
Whether a particular Sqoop import option works using this batching approach depends on
whether the option in question calls both the readFields and toString methods, allowing
both of their overridden SecureData counterparts to perform their roles in the cryptographic
processing. Some Sqoop import options, such as options to import into HCatalog, only call the
readFields method, making the batching approach unworkable. However, if your
cryptographic needs can be satisfied with the Simple API (without the REST API), a different,
non-batched approach is possible. For more information about how to determine whether the
Sqoop integration can work with a particular Sqoop import option, see “Advanced Sqoop Import
Options” (page 6-10).
The Sqoop integration, by default, work on its data record-by-record. However, the Developer
Template for Sqoop has been written so that the data to be protected or accessed is batched
together for efficient processing, regardless of which type of API is used. When the Simple API
is being used, the batched list of plaintext or ciphertext is looped over in the Sqoop template
code itself, and when the REST API is used, the batched list of plaintext or ciphertext is
processed in a single REST list operation.
6-3 CONFIDENTIAL
Configuration Settings for the Sqoop Template Developer Templates Integration Guide Version 5.0
The batching of plaintext and ciphertext for efficient data protection processing by the REST
API within the Sqoop template code is not as straightforward as the batching mechanism for
the MapReduce template. This is because, unlike MapReduce and Hive, Sqoop does not
provide an explicit mechanism through which to integrate custom code. The base Sqoop
import job runs from the command-line, using command-line arguments, and imports data from
an external relational database table into HDFS directly, without a well-defined or documented
way to transform or otherwise process the data as it is being imported, whether batched or not.
The integration technique of wrapping the generated object-relational mapping (ORM) class in
the class SqoopRecordWrapper can also be used for batching plaintext or ciphertext together
for efficient data protection processing. In the default Sqoop import data flow, data is read from
the JDBC ResultSet into the fields of the ORM class via the method readFields, and then
the values from these fields are returned to Sqoop as a string via the method toString. This is
all done record-by-record, which is inefficient when using the REST API.
As with the MapReduce Developer Template, this problem is solved in the class
SqoopRecordWrapper by altering the logic of its overridden methods readFields and
toString. The method readFields batches up the records it reads from the JDBC
ResultSet. Then, when the method toString is called, the data protection processing is
performed on the entire batch and the data protection results are compiled into the result
string returned to Sqoop for writing to HDFS.
NOTE: The Sqoop integration batches the import operation for efficient REST API
processing. However, as a side effect of this batching, the Sqoop import command reports
the number of batches processed as the number of records retrieved.
For example, the Sqoop import job might display the following result message:
INFO mapreduce.ImportJobBase: Retrieved 8 records.
This message actually indicates that eight batches of records (not just eight records) were
processed/imported. Each batch contains up to 2000 records, which is the default batch size
configured in the Sqoop Developer Template code. Some batches might contain fewer
records if a partial batch was processed.
There are three classes of configuration settings used by the Sqoop template:
CONFIDENTIAL 6-4
Developer Templates Integration Guide Version 5.0 Configuration Settings for the Sqoop Template
<fieldMappings>
.
.
.
<fields component="sqoop">
<field name = "email" cryptId = "alpha"/>
<field name = "birth_date" cryptId = "date"/>
<field name = "cc" cryptId = "cc"/>
<field name = "ssn" cryptId = "ssn"/>
</fields>
</fieldMappings>
Before you begin to modify the Sqoop Developer Template XML configuration files for your
own purposes, such as using your own Voltage SecureData Server or different data formats,
Micro Focus Data Security recommends that you first run the Sqoop Developer Template
samples as provided, giving you assurance that your Hadoop cluster is configured correctly and
functioning as expected.
You will need to update many of the values in the XML configuration files vsauth.xml and
vsconfig.xml in order to protect your own data using your own Voltage SecureData Server.
In order to add protection and access of the phone column when running the Sqoop template,
you would need to edit the XML configuration file vsconfig.xml to add new configuration
settings to the Sqoop fields element, as follows (addition highlighted):
<fieldMappings>
.
.
.
<fields component="sqoop">
6-5 CONFIDENTIAL
Running the Sqoop Template Developer Templates Integration Guide Version 5.0
• The name attribute of the new field element for the phone number is set to the
column name phone. This is the column name under which the phone numbers in the
sample data file plaintext.csv are imported into the MySQL database table during
the first phase of running the Sqoop Developer Template.
• The cryptId attribute of the new field element for the phone number is set to
alpha, which is the name of a cryptId that uses the built-in format Alphanumeric and
the Simple API (the default API), which will provide the best performance. This format
will produce ciphertext phone numbers with plaintext digits replaced with either digits
or letters (upper and lowercase). Non-alphanumeric characters such as +, (, and ) will
be preserved, as is.
NOTE: Remember that if you edit the versions of the configuration files vsauth.xml and/or
vsconfig.xml on the local file system, you must copy the updated versions to HDFS,
which is where the Hadoop Developer Templates will access them when they are run. For
instructions, see "Loading Updated Configuration Files into HDFS" (page 3-77).
The MySQL JDBC driver is already included in some Hadoop distributions (such as HDP). The
sample scripts assume that the sample data exists in a MySQL database. You must load the
sample data into MySQL, generate a JAR file, and then use Sqoop to import the data while also
protecting some of the fields.
NOTE: The Sqoop integration in the Hadoop Developer Templates is primarily meant for the
main Sqoop import use-case of loading data from a relational database table into HDFS.
Sqoop supports several advanced import options that provide different output options, not
all of which will work with the Sqoop integration. For more information, see “Advanced Sqoop
Import Options” (page 6-10).
CONFIDENTIAL 6-6
Developer Templates Integration Guide Version 5.0 Running the Sqoop Template
NOTE: Subsequent commands in this section are issued at the mysql> prompt.
NOTE: Specifying CHARACTER SET UTF8 is now required due to the extended
characters in the name column.
5. Exit MySQL:
mysql> exit;
6-7 CONFIDENTIAL
Running the Sqoop Template Developer Templates Integration Guide Version 5.0
• DATABASE_NAME=<database name>
3. By default, the database fields imported into HDFS by Sqoop, some of which are
protected in the process, are delimited by the comma character (,). If you want to use a
different delimiter character in the HDFS import file(s), you must change the following
line in the script codegen to specify an alternative delimiter, as follows:
To:
--table $TABLE_NAME --fields-terminated-by "<delimiter>" \
4. Save the file, then run the script from the bin directory:
./codegen
5. At the prompt, enter the password for your MySQL database <username> (the value
specified for the DATABASE_USERNAME variable).
com.voltage.sqoop.DataRecord.jar.
If the script does not run successfully (if the variables are not updated correctly, for example),
you see messages such as the following:
./codegen: line 12: syntax error near unexpected token 'newline'
CONFIDENTIAL 6-8
Developer Templates Integration Guide Version 5.0 Running the Sqoop Template
NOTE: The ORM source files (.java and .class) are created in a package directory
structure within the parent directory from which you are running the codegen script. This
means that the user account running the command must have permissions to create a new
sub-directory within the current working directory. If your user account does not have the
required permissions, the codegen script might fail with the following error:
If you see this error message, grant the required permissions for the current working
directory and try again.
• DATABASE_NAME=<database name>
3. Save the file, then run the script from the bin directory:
./run-sqoop-import
4. At the prompt, enter the password for your MySQL database <username> (the value
specified for the DATABASE_USERNAME variable).
If the script runs successfully, the data from the voltage_sample table in MySQL is imported
into the following directory in HDFS (relative to your home directory):
voltage/protected-sqoop-import
The fields specified in the configuration file vsconfig.xml are protected during the import.
If the script does not run successfully (if the variables are not updated correctly, for example),
you see messages such as the following:
./run-sqoop-import: line 11: syntax error near unexpected token `newline'
6-9 CONFIDENTIAL
Running the Sqoop Template Developer Templates Integration Guide Version 5.0
As described above, the batched version of the Sqoop integration works by overriding two
methods in the generated ORM class: readFields and toString. Together, the overridden
versions of these methods work in tandem to perform cryptographic processing:
• readFields reads the data into the ORM class from the JDBC ResultSet, in batches.
• toString protects the batch of data and returning the processed batch results as a
string to Sqoop for writing to HDFS.
Sqoop provides a number of import options that vary with respect to whether their processing
calls both of these methods, allowing the batched version of the Sqoop integration to work as
expected. For example:
• --fields-terminated-by
In the context of the codegen/import flow in the Sqoop integration, this import option
is ignored if it is specified in the import command itself. However, as explained step 3
of “Generate an ORM JAR File” (page 6-7), this import option can be specified in the
codegen step in the flow to allow the use a delimiter other than the default, the comma
character (,). This is required because the delimiter character is defined as a constant in
the generated ORM class and is used during the import operation, regardless of any
value specified for this option in the import command.
Because they call both of the overridden readFields and toString methods of the
generated ORM class, both of these import options, used to import directly from a
relational database table into a table in Hive, work using the batched version of the
Sqoop integration.
CONFIDENTIAL 6-10
Developer Templates Integration Guide Version 5.0 Running the Sqoop Template
Because they only call the overridden readFields method of the generated ORM
class, these import options, used to import directly from a relational database table into
a table in HCatalog, do not work using the batched version of the Sqoop integration. If
you attempt to use the batched version of the Sqoop integration with either of these
HCatalog import options, you will see that the number of output records is less than the
number of input records (because the toString method is not called to process and
clear the batches) and that the specified output fields have not been protected (also
performed by the uncalled toString method).
NOTE: The job logs will contain evidence of the method readFields being called
(messages of the form Reading fields from result set; web service
batch size:), but without corresponding evidence of the method toString being
called (no messages of the form Writing data to HDFS; batch size:).
As with the HCatalog import options described above, this import option, used to import
directly from a relational database table into Apache Parquet Files, does not work with
the batched version of the Sqoop integration for the same reason: because only the
overridden readFields method of the generated ORM class is called. Job log evidence
of this fact is the same: Reading messages without correspond Writing messages.
Likewise, the workaround is the same, again assuming that exclusive use of the Simple
API for cryptographic processing: use the non-batched version of the Sqoop integration,
as described in “Non-Batched Version of the Sqoop Integration” (page 6-12).
Other advanced Sqoop import options may work with the Sqoop integration, depending on
whether they:
• Call both of the (overridden) readFields and toString methods of the generated
ORM class, allowing the batched version of the Sqoop integration to be used.
• Only call the (overridden) readFields method of the generated ORM class, allowing
the alternative non-batched version of the Sqoop integration to be used (with the
Simple API only due to the network overhead associated with non-batched use of the
REST API).
If a particular Sqoop import option does not conform to either of these scenarios, it cannot be
used with the Sqoop integration. For more information, see “Determine Whether the Sqoop
Integration Supports a Sqoop Import Option” (page 6-13).
6-11 CONFIDENTIAL
Running the Sqoop Template Developer Templates Integration Guide Version 5.0
The Sqoop integration provides the following alternative ORM wrapper class:
NonBatchedSqoopImportProtector
Unlike the main SqoopRecordWrapper base class, which performs the cryptographic
processing in stages using a combination of both the readFields and toString methods for
optimized batch processing, the ORM wrapper class NonBatchedSqoopImportProtector
performs all cryptographic processing within the readFields method.
The main disadvantage to this alternative approach is that all cryptographic operations are
performed on individual records, one at a time, with no batching. This is problematic in the case
of remote cryptographic processing using the REST API, which will not be able to handle large
data sets efficiently. However, if you are only performing local cryptographic processing using
the Simple API, this alternative approach may allow you to perform the SecureData integration
successfully, even when using an advanced Sqoop import option that was failing when using
the batched (SqoopRecordWrapper) version of the Sqoop integration.
To use the non-batched version of the Sqoop integration, create a version of the run-sqoop-
import script that replaces the SqoopImportProtector class (which extends the
SqoopRecordWrapper base class) with the NonBatchedSqoopImportProtector class,
and which adds the specific advanced Sqoop import option that was otherwise failing (using
the option --hcatalog-database as an example here):
sqoop import \
-libjars com.voltage.sqoop.DataRecord.jar,../common-crypto/simpleapi/vibesimple...
--username $DATABASE_USERNAME \
-P \
--connect jdbc:mysql://$DATABASE_HOST/$DATABASE_NAME \
--table $TABLE_NAME \
--jar-file voltage-hadoop.jar \
--class-name com.voltage.securedata.hadoop.sqoop.NonBatchedSqoopImportProtector \
--target-dir voltage/protected-sqoop-import \
--hcatalog-database
Then run the modified script and check the results. Also, check the job logs for the following log
messages, indicating that cryptographic processing was performed in the method
readFields:
CONFIDENTIAL 6-12
Developer Templates Integration Guide Version 5.0 Running the Sqoop Template
As mentioned above, the <crypto-api-type> in the message above needs to specify the
Simple API. The use of remote cryptographic processing, such as with the REST API, with the
non-batched version of the Sqoop integration will result in individual network calls for each
protect operation, resulting in severe performance issues and potential job timeouts or other
failures when importing large data sets.
• The advanced Sqoop import option in question must recognize and use the
--class-name parameter to the sqoop import command, which is used to provide
the class path of a generated ORM class to call particular methods in that class.
• In the generated ORM class specified by the --class-name parameter to the sqoop
import command, the advance Sqoop import option in question must either A) call
both of the methods readFields and toString (batched version of the Sqoop
integration), or B) call only the method readFields (non-batched version of the
Sqoop integration).
The batched version of the Sqoop integration relies on both of the methods
readFields and toString being called, the overridden versions of which work in
tandem to provide batched cryptographic processing. This version is appropriate for
use with either the Simple API or the REST API.
The non-batched version of the Sqoop integration (see "Non-Batched Version of the
Sqoop Integration" on page 6-12) relies only on the method readFields being called,
the overridden version of which provides non-batched cryptographic processing. This
version is appropriate for use only with the Simple API due to the high network
overhead associated with non-batched use of the REST API.
The steps to determine compatibility of a particular advanced Sqoop import option with one or
the other of the versions of the Sqoop integration are as follows:
1. Try the batched version of the Sqoop integration on a small sample dataset (a few
thousand records, as provided in the sample input data) with the advanced Sqoop
import option of interest, such as by creating a variant of the run-sqoop-import
script with the relevant option added. After the import job completes, examine the
resulting output to make sure that all of the input records were processed and that the
specified fields in the records were protected as expected.
• The output is not formatted properly, such as if the specified delimiter character
was not used. This may indicate a failure to apply the specified Sqoop import
option in conjunction with the generated ORM class.
6-13 CONFIDENTIAL
Running the Sqoop Template Developer Templates Integration Guide Version 5.0
• The specified fields are not protected in the output (they are still plaintext). This
indicates a failure to call both of the methods readFields and toString in the
generated ORM class correctly.
• The number of records in the output is less than the number of records in the
input. This indicates a failure to call the method toString in the generated
ORM class to process and clear the batches.
If none of these errors are found, the advanced Sqoop import option of interest appears
to work correctly with the batched version of the Sqoop integration. Even more
confidence can be gained by looking in the job logs for the messages described in
step 3 below.
If any of these errors are found, proceed to step 2 to determine whether the advanced
Sqoop import option of interest is compatible with the use of a generated ORM class.
2. If step 1 above fails (one or more of the types of errors described is found), the next
step is to try the option again, still specifying a generated ORM class but one without
any SecureData integration.
IMPORTANT: This sqoop import command specifies and uses a generated ORM
class but does not include the Hadoop Developer Templates JAR file nor the Simple
API JAR file, completely bypassing the SecureData integration.
After the import job completes, check the resulting output again, to make sure that all of
the input records were imported directly, as is, to the output location and that the
specified advance Sqoop import option of interest was applied. If any errors are found,
the advanced Sqoop import option of interest may just not work with any generated
ORM class specification, integrated with SecureData or not.
If no errors are found, the advanced Sqoop import option would appear to work
correctly with a generated ORM class. Proceed to step 3 to troubleshoot the
cryptographic processing associated with the Sqoop integration.
CONFIDENTIAL 6-14
Developer Templates Integration Guide Version 5.0 Running the Sqoop Template
NOTE: Keep in mind that a failure to apply the specified advanced Sqoop import
option in this simpler sqoop import command may also indicate that the option
should be provided earlier, during the codegen step, as is the case for the
--fields-terminated-by option described above.
3. If step 1 failed, indicating that the batched version of the Sqoop integration does not
work for the advanced Sqoop import option of interest, but step 2 succeeded, indicating
that the advanced Sqoop import option of interest works, in general, with a generated
ORM class, the next step is troubleshooting the SecureData aspect of the integration.
Specifically, you must determine which of the overridden methods readFields and
toString are called, if any.
To determine this, run the sqoop import command again (as in step 1 above), and
check the logs for the import job (for example, in the Hadoop JobHistory user interface).
Specifically, look for the following messages logged by the SecureData
SqoopRecordWrapper class, indicating that the corresponding method was invoked:
If either of these log messages is missing, then that indicates that Sqoop did not call the
corresponding overridden method in the generated ORM class when performing the
import operation with the specified advanced Sqoop import option, preventing the
SecureData cryptographic integration from working properly.
If neither of the required methods was called, the Sqoop integration does not support
the advanced Sqoop import option in question.
If the readFields method was called but the toString method was not called, and if
your scenario can work with the Simple API only (that is, no functionality specific to the
REST API, such as SST, is required), the advanced Sqoop import option of interest
should be able to use the non-batched version of the Sqoop integration, as described in
“Non-Batched Version of the Sqoop Integration” (page 6-12).
6-15 CONFIDENTIAL
Running the Sqoop Template Developer Templates Integration Guide Version 5.0
CONFIDENTIAL 6-16
7 Spark Integration
The Spark Developer Template demonstrates how to integrate Voltage SecureData data
protection technology in the context of Apache Spark, using the Scala programming language.
The Spark Developer Template also demonstrates the use of PySpark, the Python API to
Apache Spark. This demonstration includes the use of the Simple API (version 4.0 and greater)
and the REST API to perform cryptographic operations.
The Spark Developer Template demonstrates the use of several different older and newer
Spark and PySpark APIs and their corresponding data structures, including the following
variants:
• Spark SQL - Using Spark UDFs to perform cryptographic operations by making a Spark
and PySpark SQL query on a Spark DataFrame data set.
The Spark Developer Template uses the Spark driver/processor model. The data to be
protected or accessed is originally in the form of one or more columns in an input CSV file, and
the protected or accessed result ends up in an output CSV file.
At a high-level, these Spark Developer Template drivers perform the following steps:
• Perform the Spark creation phase by reading the input CSV file into the appropriate
Spark data structure (RDD, Dataset, and/or DataFrame objects, and/or a temporary
SQL table or view).
• Call the processor, either directly or by using a UDF, to protect or access the relevant
columns in the Spark object for that variant.
The Spark Developer Template drivers and processors, written in Scala, work in conjunction
with the Java packages in the common infrastructure. This chapter provides a description of
these Spark components as well as providing instructions on how to run the Spark Developer
Template using the provided sample data.
7-1 CONFIDENTIAL
Developer Templates Integration Guide Version 5.0
For more information about the common infrastructure used by all of the Developer Templates,
including the Spark Developer Template, see Chapter 3, “Common Infrastructure”.
NOTE: The Spark Developer Template sample job uses one of the sample data files already
supplied with the existing Hadoop Developer Templates: plaintext.csv (located in the
directory <install_dir>/sampledata). The Spark job script
run-spark-prepare-job copies this file to a Spark-job-specific location in HDFS. For
more information about this sample data file, see "Sample Data for the Spark Developer
Template" (page 7-14).
• Integration Architecture of the Spark Developer Template (page 7-3) - This section
explains the Scala classes that implement the driver and processor components of the
Spark Developer Template. It also describes the three Python modules that serve as
driver components for three PySpark variants of the Spark Developer Template.
• Configuration Settings for the Spark Developer Template (page 7-12) - This section
reviews the configuration settings that are relevant to the Spark Developer Template,
including the component-specific approach demonstrated by the XML configuration file
vsspark-rdd.xml.
• Sample Data for the Spark Developer Template (page 7-14)- This section provides a
description of the sample data provided for the Spark Developer Template.
CONFIDENTIAL 7-2
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Spark Developer Template
• Running the Spark Developer Template (page 7-14)- This section provides instructions
for running the Scala-based variants of the Spark Developer Template, as provided, as
well as how to make the source code changes necessary to use the simple processor
instead of the default batch processor. It also includes instructions for running the three
PySpark variants of the Spark Developer Template.
• Limitations of the Spark Developer Template (page 7-26)- This section provides
information about the type of improvements you will need to make in order to create a
production-grade Spark solution that integrates calls to the Voltage SecureData APIs.
The Spark Developer Template uses the Spark driver/processor model to protect incoming
plaintext or to access incoming ciphertext, respectively, from an input CSV file, using Format-
Preserving Encryption (FPE) or Secure Stateless Tokenization™ (SST). The driver and
processor classes are written in the Scala programming language. Python drivers are also
included for the three PySpark variants of the Spark Developer Template.
The Spark Developer Template uses the same two XML configuration files that are used by the
other Hadoop Developer Templates: vsauth.xml and vsconfig.xml. Depending on the
variant, it may also use another Spark-specific XML configuration file named vsspark-
rdd.xml to contain the Spark-specific field configuration settings that associate a CSV column
number with a cryptId specified in the XML configuration file vsconfig.xml. These settings
could also be specified in that latter file (vsconfig.xml) in a fields element with its
component attribute set to spark.
NOTE: The Spark Developer Template demonstrates a feature of the common configuration
classes that allow the separation of configuration values that are specific to a particular
Hadoop technology into a separate configuration file for that technology. The other Hadoop
Developer Templates may take advantage of this possible isolation of configuration value
processing in a future release.
The Spark Developer Template makes use of the common infrastructure provided with Voltage
SecureData for Hadoop to retrieve global file-based configuration information, provide data
translation, and a cryptographic abstraction layer, as well as a REST client.
The Spark Developer Template provides three Scala packages for the five variants of the Spark
Developer Template:
• RDD Variant:
For information about this variant, described together with the Dataset variant, see
"RDD and Dataset Variants" (page 7-4).
7-3 CONFIDENTIAL
Integration Architecture of the Spark Developer Template Developer Templates Integration Guide Version 5.0
• Dataset Variant:
For information about this variant, described together with the RDD variant, see "RDD
and Dataset Variants" (page 7-4).
For information about these variants, see "UDF-Based Variants" (page 7-8).
The Spark Developer Template also provides a PySpark (Python) package for interfacing with
the Scala UDF variants DataFrame and Spark SQL (provided in the package
com.voltage.securedata.spark.udf) and the Java Hive UDF (provided in the package
com.voltage.securedata.hadoop.hive).
For information about these variants, see "UDF-Based Variants" (page 7-8).
The following two sub-sections described the Spark Developer Template variants based on the
their shared functionality:
The RDD variant defines the Spark driver object SDSparkDriver. This object defines the
function main, which serves as the Spark job entry point. This object:
2. Reads the data in the input CSV file into a Spark RDD.
CONFIDENTIAL 7-4
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Spark Developer Template
3. Calls to the Spark processor’s function processRDD (defined for both the batch
processing class SDSparkBatchProcessor and the simplified single-crypto-operation
processing class SDSparkProcessor). This function returns another, transformed
RDD with the cryptographic results.
The Dataset variant defines the Spark case class Person and the Spark driver object
SDSparkDatasetDriver. This driver object defines the function main, which serves as the
Spark job entry point. This object:
2. Defines an explicit schema (that matches the schema defined for the case class
Person) so that the data in the input CSV file can be read into a Spark DataFrame
object and then converted to a Spark Dataset object.
3. Calls to the Spark processor’s function processDS (defined for both the batch
processing class SDSparkDatasetBatchProcessor and the simplified single-crypto-
operation processing class SDSparkDatasetProcessor). This function returns
another, transformed Dataset with the cryptographic results.
• The Batch Processors - These processors batch multiple cryptographic operations into
single direct calls to the shared Java Crypto interface methods
protectFormattedDataList and accessFormattedDataList. The real
advantage of these processors exist when the REST API is used for protecting or
accessing sensitive data, improving performance by avoiding network transactions for
individual cryptographic operations.
The Spark batch processor classes for these two variants are:
These Spark processor classes and their inner class SparkCrypto perform batched
cryptographic processing for their Spark jobs, potentially distributing the job across the
nodes of your Hadoop cluster. These classes gather the data to be protected (as
opposed to the data to be passed through, as is) into batches of a configurable size
7-5 CONFIDENTIAL
Integration Architecture of the Spark Developer Template Developer Templates Integration Guide Version 5.0
(2000 by default), protecting or accessing all of the strings in each of the configured
columns as a single cryptographic operation before moving on to the next batch of
strings. Batch processing is especially important for efficiency when using either of the
Web Services API (the REST API), but is inherently more complicated. Successful
cryptographic processing of the input RDD or Dataset results in a different, transformed
RDD or Dataset being returned to the Spark driver.
NOTE: The Spark Developer Template source code, as shipped, calls these Spark
processors. To call the simple Spark processors, SDSparkProcessor or
SDSparkDatasetProcessor, you will need to modify the corresponding driver
source code, commenting out the code that calls the batch Spark processor and
uncommenting the code that calls the simple Spark processor.
The Spark simple processor classes for these two variants are:
These Spark processor classes and their inner class SparkCrypto perform single-
crypto-op processing for the Spark job, potentially distributing the job across the nodes
of your Hadoop cluster. These classes iterate through the input RDD or Dataset,
cryptographically processing the relevant strings as it goes, one at a time, and adding
the result to the output RDD or Dataset. Input strings not configured for cryptographic
processing are echoed, as is, to the output RDD or Dataset. It is important to note that
no batch processing is performed by these Spark processors, with each individual string
to be protected or accessed being processed by itself, even when using one or both of
the Web Services API (the REST API). This can be very inefficient and is not appropriate
for production environments, but the code is much more simple and easy to follow.
CONFIDENTIAL 7-6
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Spark Developer Template
NOTE: To use these processors, minor changes to the corresponding driver source
code will be required to adjust which processor is called in each case:
To:
To:
The index-based approach to protecting and accessing columns in these variants of the Spark
Developer Template is based on the creation of an index-based crypto map using the XML
configuration file vsspark-rdd.xml, which associates a column index with a cryptId, the latter
of which provides the Voltage SecureData protection format, protection API, authentication
information, and so on.
There is an important difference between the RDD and Dataset variants of the Spark
Developer Template, relevant to both the batch and the simple processors and their use is an
index-based approach to specifying which columns to protect and access. It is that the Dataset
variant must take extra steps to use the current index and reflection to retrieve column values
as strings to pass to the performCrypto method of the relevant SparkCrypto object and to
convert a processed row back to a case class person record in the resulting Dataset object:
• Convert processed row back to case class person record in resulting Dataset object:
convertedRows(i) = convertToPerson(savedRows(i))
NOTE: These code examples are from the Dataset variant batch processor in the file
SDSparkDatasetBatchProcessor.scala.
7-7 CONFIDENTIAL
Integration Architecture of the Spark Developer Template Developer Templates Integration Guide Version 5.0
UDF-Based Variants
This section discusses the driver and processor functionality for the DataFrame, Spark SQL,
and HiveUDF variants of the Spark Developer Template, which share enough functionality to
warrant common description.
The Spark Developer Template also provides PySpark versions of all three UDF variants. The
DataFrame and Spark SQL versions of these variants work by having a PySpark driver
module call the shared Scala processor class through a UDF. The HiveUDF version of these
variants works by having a PySpark driver module call the Java Hive classes directly.
• The Scala version of the DataFrame variant of the vsconfig.xml calls the
protectColumn and accessColumn functions of the SDSparkUDFProcessor Scala
class by first calling UDF functions of the same name (protectColumn and
accessColumn) in the SDSparkDataFrameDriver Scala object. It then creates the
output DataFrame object by calling the protect or access UDF using the withColumn
function of the DataFrame object for each column to be protected or accessed,
respectively.
The PySpark version of the DataFrame variant has Python functions that are also
named protectColumn and accessColumn. These Python functions use PySpark to
call down into the Scala code in the SDSparkDataFrameDriver and
SDSparkUDFProcessor classes, as follows:
Level 1:
CONFIDENTIAL 7-8
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Spark Developer Template
These Python functions use the PySpark library functions _to_seq and
_to_java_column when calling the Scala functions protectColumn and
accessColumn in the SDSparkDataFrameDriver Scala object (see level 2
below). This is required to convert the input to a Java sequence as required by
level 2 processing. Upon return, the PySpark library function Column is used to
cast the results back to a form Python can interpret.
It then creates the output DataFrame object by calling these functions within
the withColumn function of the DataFrame object for each column to be
protected or accessed, respectively.
Level 2:
These Scala functions create and return UDFs that call the functions of the
same names in the SDSparkUDFProcessor Scala class to perform the
cryptographic processing (see level 3 below). They also need to use currying
to combine the cryptId and col parameters at the Python level into the
single parameter cryptId at this level.
NOTE: For the Scala version of the DataFrame variant, function main in
the SDSparkDataFrameDriver object calls these functions within the
withColumn function of the DataFrame object for each column to be
protected or accessed, respectively.
Level 3:
The Scala functions at this level use the SparkCrypto class to access
the shared CryptoFactory Java code to perform the actual
cryptographic operations.
Both of these variants uses the where function of the DataFrame object to limit the
processing to the first 10 rows of data in the input DataFrame object, producing only
10 rows of data in the output DataFrame object, resulting in only 10 lines being written
to the output CSV file at the end of driver processing.
• Both versions of the Spark SQL variant create a temporary database view of the data in
the DataFrame object in preparation for the upcoming SQL query, one in the Scala
driver and one in the PySpark driver. They both register the protectColumn and
accessColumn functions of the SDSparkUDFProcessor class as UDFs of the same
name, but somewhat differently from the two different Spark drivers:
7-9 CONFIDENTIAL
Integration Architecture of the Spark Developer Template Developer Templates Integration Guide Version 5.0
• The Scala version of the Spark SQL variant registers the UDFs in the sibling
functions registerProtectUDF and registerAccessUDF, both of which are
calling by function main.
• The PySpark version of the Spark SQL variant calls these Scala UDF registration
functions registerProtectUDF and registerAccessUDF in the
SDSparkSQLDriver object by using a JVM version of the SQL Context.
Then, using the sql function of the SQLContext object contained within the
SparkSession object, they both run a SQL query that calls the protect or access UDF
for each column to be protected or accessed, respectively. And as with all of the other
UDF variants, the query is limited to the first 10 rows of data in the temporary database
view, producing only 10 rows of data in the output DataFrame object, resulting in only
10 lines being written to the output CSV file at the end of driver processing.
NOTE: Due to how the SQL SELECT statements are used for this variant, only the
protected or accessed data appears in the output DataFrame object and output CSV
file. As shipped, these fields are the final four fields in each row of data: the email
address, the birth date, the credit card number, and the Social Security number.
• Both Scala version and the PySpark version of the HiveUDF variant create a temporary
database view of the data in the DataFrame object in preparation for the upcoming
SQL query. Unlike the other two types of UDF variants, they both create temporary
protect and access UDFs from the Hive classes ProtectData and AccessData, also
included in the Hadoop Developer Templates.
Then, using the sql function of the SQLContext object contained within the
SparkSession object, they both run a SQL query that calls the protect or access Hive
UDF for each column to be protected or accessed, respectively. And as with all of the
other UDF variants, the query is limited to the first 10 rows of data in the temporary
database table, producing only 10 rows of data in the output DataFrame object,
resulting in only 10 lines being written to the output CSV file at the end of driver
processing.
NOTE: As with both versions of the Spark SQL variant, due to how the SQL SELECT
statements are used for this variant, only the protected or accessed data appears in
the output DataFrame object and output CSV file. As shipped, these fields are the
final four fields in each row of data: the email address, the birth date, the credit card
number, and the Social Security number.
NOTE: As with the Hive Developer Template, the UDF variants of the Spark Developer
Template are inherently limited to protecting and accessing sensitive data one value at a
time (in other words, no batch processing is possible). Especially when subject to the
network overhead of one or both of the Voltage SecureData Web Services API (the REST
API) used for tokenization, processing large sets of data using UDFs can be prohibitively
slow. That is why the UDF variants limit their processing to the first 10 rows of the 10K rows
in the input CSV file, as described above.
CONFIDENTIAL 7-10
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Spark Developer Template
For both versions of all three UDF variants, the parameters to the UDFs are as follows:
• The name of the column in the DataFrame object, as established when its schema was
defined, which is also mapped to the temporary database view or table for the
SparkSQL and HiveSQL variants, respectively. This will provide the UDF with the
plaintext or ciphertext value to be protected or accessed, respectively.
• The name of a cryptId specified in the XML configuration file vsconfig.xml, which in
turn specifies the information required to protect or access the values in the specified
column. This information include a data protection format, the Voltage SecureDataAPI,
the identity and authentication information for cryptographic key derivation, and so on.
This cryptId approach for protecting and accessing columns in these variants of the
Spark Developer Template is based on one loaded-on-demand entry for each unique
combination of cryptographic processing choices (format, identity, choice of API,
translator class, if any, and so on), but not necessarily for each column of data to be
processed (two columns can use the same cryptId and thus same crypto map entry).
NOTE: Note that for the DataFrame variants, at the level of the
SDSparkDataFrameDriver, these two parameters are effectively cosed into one
parameter through the use of currying. This was necessary in order to have the cryptId
be treated as a string instead of as the name of a second database column.
The return value of each UDF call is the protected or accessed value, collected into the
resulting DataFrame object. This is either with the unprocessed column values in the case of
the DataFrame variant, or without the unprocessed column values in the case of the Spark
SQL and HiveUDF variants. All of the UDF drivers end by writing the contents of the
DataFrame object to the output CSV file.
The HiveUDF variant of the Spark Developer Template does not use the
SDSparkUDFProcessor class, instead relying directly on the protect and access UDFs
provided by the Hive Developer Template, which serve as virtual processors for this variant of
the Spark Developer Template:
com.voltage.securedata.hadoop.hive.ProtectData
com.voltage.securedata.hadoop.hive.AccessData
NOTE: The wrapper object used to serialize the Hive UDFs also uses lazy initialization,
allowing the Hive UDFs to work in the context of a Spark session.
7-11 CONFIDENTIAL
Configuration Settings for the Spark Developer Template Developer Templates Integration Guide Version 5.0
For the RDD and Dataset variants of the Spark Developer Template, note that the batch
processors log more information than the single-crypto-operation processors. The former
processor logs information about each batch as it is processed, while the latter processor just
logs information about which type of cryptographic operation (protect or access) is being
performed for the entire job (log entries for each individual cryptographic operations would
produce too much log output for most purposes).
For all versions of the UDF variants of the Spark Developer Template, both within the class
SDSparkUDFProcessor and within the Hive UDFs, logging is initialized but not included by
default due to the volume of logging that would be done for single-value protection and access
operations. You can add temporary logging calls as needed for your own debugging purposes.
The Spark Developer Template uses a total of three configuration files. Two of these
configuration files are the same as the ones used by the other Hadoop Developer Templates.
The third configuration file is specific to the RDD and Dataset variants of the Spark Developer
Template and contains Spark-specific configuration settings about the fields to be protected or
accessed. These settings are similar to the component-specific settings for the MapReduce
Developer Template in the configuration file vsconfig.xml. These variants of the Spark
Developer Template take advantage of common configuration processing functionality that
allows their Spark-specific settings to be isolated into a separate configuration file for index-
based column protection settings.
The Spark Developer Template uses the version of the XML configuration file
vsconfig.xml used by the other Hadoop Developer Templates, located in the
following directory:
<install_dir>/config
NOTE: This configuration file contains field configuration settings for the MapReduce
and Sqoop templates, but the parallel settings for the RDD and Dataset variants of
the Spark Developer Template are defined in the XML configuration file vsspark-
rdd.xml, described below.
CONFIDENTIAL 7-12
Developer Templates Integration Guide Version 5.0 Configuration Settings for the Spark Developer Template
For more information about these settings, see "Configuration Settings" (page 3-5),
"vsconfig.xml" (page 3-33), "Common Configuration" (page 3-57) and "Hadoop
Configuration" (page 3-60).
The Spark Developer Template uses the version of the configuration file vsauth.xml
used by the other Hadoop Developer Templates, located in the following directory:
<install_dir>/config
For more information about these settings, see "Configuration Settings" (page 3-5),
"vsauth.xml" (page 3-38), "Common Configuration" (page 3-57), and the comments in
the XML configuration file itself.
• vsspark-rdd.xml - The settings in this XML configuration file provide the index-
based field settings that control which fields (columns) in the Spark job’s input CSV file
(and RDD or Dataset) as subject to cryptographic processing. For example:
<fields>
<field index = "7" cryptId = "alpha"/>
<field index = "8" cryptId = "date"/>
<field index = "9" cryptId = "cc"/>
<field index = "10" cryptId = "ssn"/>
</fields>
The RDD and Dataset variants of the Spark Developer Template use this type of XML
configuration file. These are the same type of index-based field configuration settings
that are used by the MapReduce Developer Template in the XML configuration file
vsconfig.xml.
Before you begin to modify the Spark Developer Template XML configuration files for your own
purposes, such as using your own Voltage SecureData Server or different data formats, Micro
Focus Data Security recommends that you first run the Spark Developer Template samples as
provided, giving you assurance that your Hadoop cluster is configured correctly and
functioning as expected.
You will need to update many of the values in the XML configuration files vsauth.xml,
vsconfig.xml, and vsspark-rdd.xml in order to protect your own data using your own
Voltage SecureData Server.
7-13 CONFIDENTIAL
Sample Data for the Spark Developer Template Developer Templates Integration Guide Version 5.0
CAUTION: This sample approach may not be appropriate in your production Spark
integrations.
The same alternative approaches to configuration that are suggested for the MapReduce
Developer Template can be considered for your production Spark integrations. For more
information, see "Other Approaches to Providing Configuration Settings" (page 3-52).
The Spark Developer Template sample jobs use the same sample data as the MapReduce
Developer Template, located in the following file:
<install-dir>/sampledata/plaintext.csv
This file contains 10,000 rows of dummy sample data consisting of names, addresses, email
addresses, birth dates, credit card numbers, and Social Security numbers. The XML
configuration file vsspark-rdd.xml, by default, specifies the protection of the final four fields
of each line using cryptIds named alpha, date, cc, and ssn, respectively.
The script run-spark-prepare-job copies the sample input CSV file from the local file
system to a Spark sample data directory in HDFS (voltage/spark-plaintext-sample-
data) within the user’s home directory, which is where the Spark protect scripts, such as run-
spark-protect-rdd-job, by default, expect to find it.
CONFIDENTIAL 7-14
Developer Templates Integration Guide Version 5.0 Running the Spark Developer Template
Run-Time Prerequisites
To run the Spark sample job and to write the results to HDFS, the following services need to be
configured and running on the Hadoop cluster:
• YARN or Mesos (if you are using the YARN resource manager or Mesos cluster)
Also, make sure that the dependency versions specified in the file build.sbt are correct for
the version of Spark you are using.
7-15 CONFIDENTIAL
Running the Spark Developer Template Developer Templates Integration Guide Version 5.0
If you make a change to the value of most of the following variables in one script, you will need
to make a corresponding change in the relevant upstream or downstream script. The variables
are:
• plaintextDir - In the prepare job, the HDFS directory (relative to the user’s home
directory) into which the original plaintext CSV file is copied.
• inputDir - In the protect and access jobs, the HDFS directory (relative to the user’s
home directory) from which the source original plaintext CSV file and protected CSV file,
respectively, are processed.
• outputDir - In the protect and access jobs, the HDFS directory (relative to the user’s
home directory) into which the resulting protected CSV file and accessed CSV file,
respectively, are placed. The output files (part-*) from the different Spark partitions
are also placed in this directory.
• plaintextFilename - In the prepare and protect jobs, the name of the original
plaintext CSV file that is the HDFS copy destination of the former job and that provides
the input to the latter job.
• protectedFilename - In the protect and access jobs, the name of the protected CSV
file in HDFS that receives the output of the former job and that provides the input to the
latter job, respectively. Also the name of the protected CSV file that will be saved locally
in the following local directory:
<install-dir>/spark/sampledata
• accessedFilename - In the access job, the name of the accessed CSV file in HDFS
that receives the output of that job. Also the name of the accessed CSV file that will be
saved locally in the following local directory:
<install-dir>/spark/sampledata
1. Build the target Spark JAR file (see “Building the Spark Developer Template” on page 2-
17).
2. Optionally edit the input and output path and filename variables in the prepare, protect,
or access scripts. For more information, see "Changing the Input and Output Locations
and Filenames on HDFS" (page 7-15).
CONFIDENTIAL 7-16
Developer Templates Integration Guide Version 5.0 Running the Spark Developer Template
b. Run the Spark protection sample job by running the following script:
./run-spark-prepare-job
While, in general, you can run the protect job for any of the variants of the Spark Developer
Template, including the PySpark versions, independent of any other variant, you do need to run
a particular protect job before you run an access job because the output of the former (protect)
job serves as the input to the latter (access) job. However, as shipped, these pairings of protect
and access jobs may not be intuitive. The pairings are as follows:
Table 7-1 Protect Script Prerequisites for Each Access Script
Before you run this access script: You must run this protect script:
run-spark-access-rdd-job run-spark-protect-rdd-job
run-spark-access-dataset-job run-spark-protect-dataset-job
run-spark-access-dataframe-job run-spark-protect-rdd-job
run-spark-access-sql-job run-spark-protect-rdd-job
run-spark-access-hive-job run-spark-protect-rdd-job
run-pyspark-access-dataframe-job run-spark-protect-rdd-job
run-pyspark-access-sql-job run-spark-protect-rdd-job
run-pyspark-access-hive-job run-spark-protect-rdd-job
The steps to run one or more of the job scripts for the variants of the Spark Developer
Template, including the PySpark versions, are as follows:
1. Change the file permissions of the XML configuration file vsauth.xml so that only the
current user can read it (this is because this XML configuration file contains the
sensitive authentication credentials that are important to conceal):
chmod 0600 <install_dir>/config/vsauth.xml
cd <install_dir>/spark/bin.
3. Run one or more of the variants of the Spark protect sample jobs, including the PySpark
protect sample jobs, as follows:
./run-spark-protect-rdd-job
./run-spark-protect-dataset-job
./run-spark-protect-dataframe-job
7-17 CONFIDENTIAL
Running the Spark Developer Template Developer Templates Integration Guide Version 5.0
./run-spark-protect-sql-job
./run-spark-protect-hive-job
./run-pyspark-protect-dataframe-job
./run-pyspark-protect-sql-job
./run-pyspark-protect-hive-job
NOTE: Note that the UDF variants (DataFrame, Spark SQL, and HiveUDF), as
shipped, only produce a small set of protection results, and that the Spark SQL and
HiveUDF results only include the data from the protected columns (unprocessed
columns do not appear in the output). These characteristics apply to the PySpark
versions as well.
4. Run one or more of the variants of the Spark access sample jobs, as follows, paying
attention to each access job’s protect job prerequisite as shown in Table 7-1 on page 7-
17:
./run-spark-access-rdd-job
./run-spark-access-dataset-job
./run-spark-access-dataframe-job
./run-spark-access-sql-job
./run-spark-access-hive-job
./run-pyspark-access-dataframe-job
./run-pyspark-access-sql-job
./run-pyspark-access-hive-job
NOTE: Note that the UDF variants (DataFrame, Spark SQL, and HiveUDF), as
shipped, only produce a small set of access results, and that the Spark SQL and
HiveUDF results only include the data from the accessed columns (unprocessed
columns do not appear in the output). These characteristics apply to the PySpark
versions as well.
CONFIDENTIAL 7-18
Developer Templates Integration Guide Version 5.0 Running the Spark Developer Template
For example, to run the Spark job using YARN with a deploy mode of cluster when using HDP,
the following additional arguments (highlighted) must be added to the spark-submit
command:
spark-submit --master yarn --deploy-mode cluster --jars ${...
Depending on your Hadoop distribution, you may need to use some combination of the
--master and --deploy-mode arguments to run the Spark sample jobs in a non-default
mode.
For more information for your distribution, use the --help argument to the spark-submit
command.
NOTE: All of these scripts expect the current directory in the local file system to be set to
<install_dir>/spark/bin when they are run.
run-spark-prepare-job
This script:
• Copies the original plaintext CSV file from the local file system to the expected location
in HDFS.
• Copies the three configuration files used by the Spark Developer Template from the
local file system to their default location in HDFS.
• Copies all of the necessary Hadoop and common infrastructure JAR files to the directory
<install_dir>/spark/lib on the local file system.
Invocation:
./run-spark-prepare-job
This script must be run from the directory <install_dir>/spark/bin and it has no
parameters.
7-19 CONFIDENTIAL
Running the Spark Developer Template Developer Templates Integration Guide Version 5.0
run-spark-protect-rdd-job
run-spark-protect-dataset-job
run-spark-protect-dataframe-job
run-spark-protect-sql-job
run-spark-protect-hive-job
These scripts run the Spark protect sample jobs for the Scala versions of each of the five
variants of the Spark Developer Template. They protect the sample data in the shared sample
data file plaintext.csv and consolidate the output from the various Spark partitions running
the protect job into a single output file in the following directories on both the local file system
and on HDFS:
Local File System: <install_dir>/spark/sampledata
HDFS: <users_home_dir>/voltage/spark-protected-sample-data/<variant>
NOTE: This directory will contain each partition’s results in files with names
beginning with part-, as well as the consolidated results in a file named as
described below.
As shipped, each of the variants produces an output file of protected sample data using a
unique filename for its variant in the directories described above:
• RDD: protectedRDD.csv
• Dataset: protectedDataset.csv
• DataFrame: protectedDataFrame.csv
• HiveUDF: protectedHiveData.csv
Invocations:
./run-spark-protect-rdd-job
./run-spark-protect-dataset-job
./run-spark-protect-dataframe-job
./run-spark-protect-sql-job
./run-spark-protect-hive-job
These scripts must be run from the directory <install_dir>/spark/bin and they have no
parameters.
CONFIDENTIAL 7-20
Developer Templates Integration Guide Version 5.0 Running the Spark Developer Template
Note that each Spark protect script includes code that detects the Hadoop distribution in use
and then attempts to ensure that the correct version of Spark (Spark2) is being invoked for that
distribution. This can be an issue when multiple versions of Spark are installed at the same time.
run-spark-access-rdd-job
run-spark-access-dataset-job
run-spark-access-dataframe-job
run-spark-access-sql-job
run-spark-access-hive-job
These scripts run the Spark access sample jobs for the Scala versions of each of the five
variants of the Spark Developer Template. They access the sample data created by a particular
prerequisite Spark protect sample job, as described in Table 7-1 on page 7-17, and consolidate
the output from the various Spark partitions running the access job into a single output file in
the following directories on both the local file system and on HDFS:
Local File System: <install_dir>/spark/sampledata
HDFS: <users_home_dir>/voltage/spark-accessed-sample-data/<variant>
NOTE: This directory will contain each partition’s results in files with names
beginning with part-, as well as the consolidated results in a file named as
described below.
As shipped, each of the variants produces an output file of accessed sample data using a
unique filename for its variant in the directories described above:
• RDD: accessedRDD.csv
• Dataset: accessedDataset.csv
• DataFrame: accessedDataFrame.csv
• HiveUDF: accessedHiveData.csv
Invocations (after running the appropriate protect sample job for each access sample job, as
described in Table 7-1 on page 7-17):
./run-spark-access-rdd-job
./run-spark-access-dataset-job
./run-spark-access-dataframe-job
./run-spark-access-sql-job
7-21 CONFIDENTIAL
Running the Spark Developer Template Developer Templates Integration Guide Version 5.0
./run-spark-access-hive-job
These scripts must be run from the directory <install_dir>/spark/bin and they have no
parameters.
Note that each Spark access script includes code that detects the Hadoop distribution in use
and then attempts to ensure that the correct version of Spark (Spark2) is being invoked for that
distribution. This can be an issue when multiple versions of Spark are installed at the same time.
run-pyspark-protect-dataframe-job
run-pyspark-protect-sql-job
run-pyspark-protect-hive-job
These scripts run the Spark protect sample jobs for the Python versions of each of the three
UDF variants of the Spark Developer Template. They protect the sample data in the shared
sample data file plaintext.csv and consolidate the output from the various Spark partitions
running the protect job into a single output file in the following directories on both the local file
system and on HDFS:
Local File System: <install_dir>/spark/sampledata
HDFS: <users_home_dir>/voltage/pyspark-protected-sample-data/<variant>
NOTE: This directory will contain each partition’s results in files with names
beginning with part-, as well as the consolidated results in a file named as
described below.
As shipped, each of the variants produces an output file of protected sample data using a
unique filename for its variant in the directories described above:
• DataFrame: protectedPySparkDataFrame.csv
• HiveUDF: protectedPySparkHiveData.csv
Invocations:
./run-pyspark-protect-dataframe-job
./run-pyspark-protect-sql-job
./run-pyspark-protect-hive-job
These scripts must be run from the directory <install_dir>/spark/bin and they have no
parameters.
CONFIDENTIAL 7-22
Developer Templates Integration Guide Version 5.0 Running the Spark Developer Template
Note that each Spark protect script includes code that detects the Hadoop distribution in use
and then attempts to ensure that the correct version of Spark (Spark2) is being invoked for that
distribution. This can be an issue when multiple versions of Spark are installed at the same time.
run-pyspark-access-dataframe-job
run-pyspark-access-sql-job
run-pyspark-access-hive-job
These scripts run the Spark access sample jobs for the Python versions of each of the three
UDF variants of the Spark Developer Template. They access the sample data created by a
particular prerequisite Spark protect sample job, as described in Table 7-1 on page 7-17, and
consolidate the output from the various Spark partitions running the access job into a single
output file in the following directories on both the local file system and on HDFS:
Local File System: <install_dir>/spark/sampledata
HDFS: <users_home_dir>/voltage/pyspark-accessed-sample-data/<variant>
NOTE: This directory will contain each partition’s results in files with names
beginning with part-, as well as the consolidated results in a file named as
described below.
As shipped, each of the variants produces an output file of accessed sample data using a
unique filename for its variant in the directories described above:
• DataFrame: accessedPySparkDataFrame.csv
• HiveUDF: accessedPySparkHiveData.csv
Invocations (after running the appropriate protect sample job for each access sample job, as
described in Table 7-1 on page 7-17):
./run-pyspark-access-dataframe-job
./run-pyspark-access-sql-job
./run-pyspark-access-hive-job
These scripts must be run from the directory <install_dir>/spark/bin and they have no
parameters.
Note that each Spark access script includes code that detects the Hadoop distribution in use
and then attempts to ensure that the correct version of Spark (Spark2) is being invoked for that
distribution. This can be an issue when multiple versions of Spark are installed at the same time.
7-23 CONFIDENTIAL
TrustStores Used by the Spark Developer Template Developer Templates Integration Guide Version 5.0
update-spark-config-files-in-hdfs
This script updates the Spark-specific XML configuration file for the Spark Developer Template
(vsspark-rdd.xml) from the local file system to its default location in HDFS.
Use this script in conjunction with the generic Hadoop Developer Templates script
update-config-files-in-hdfs in the directory <install_dir>/bin to update all three
configuration files used by the Spark Developer Template. For more information about this
generic script, see "Loading Updated Configuration Files into HDFS" (page 3-77).
Invocation:
./update-spark-config-files-in-hdfs
This script must be run from the directory <install_dir>/spark/bin and it has no
parameters.
The Spark Developer Template uses the same approach to trustStore management as the
other Hadoop Developer Templates. This involves the use of the class
TruststoreInitializer to duplicate the trusted root certificates in the OpenSSL trustStore
in the JVM trustStore, eliminating the need to directly manage such certificates in both of these
trustStores. For more information, see "Multiple Developer Template TrustStores - Background
and Usage" (page 3-56).
Where <Spark_UI_Server> and <Port> are dependent on the Hadoop distribution you
are using.
For example, in the Ambari user interface, for version 2.6.1 of the HDP (Hortonworks),
information about the Spark UI Server can be found by clicking on Spark2, then Quick Links,
then Spark UI Server.
The History Server in the Spark UI looks like the following, with the most recent jobs listed first:
CONFIDENTIAL 7-24
Developer Templates Integration Guide Version 5.0 Using the Spark Web Server
Information about each job can be viewed by clicking on the App ID link.
To view environment information about a particular job, click on App ID link and then the
environment tab:
To view job logs, click on the App ID link, and then on the Executors tab, choose stderr. The
following example shows how this will display, when using YARN, the job logs within the YARN
ResourceManager UI:
7-25 CONFIDENTIAL
Limitations of the Spark Developer Template Developer Templates Integration Guide Version 5.0
The Spark Developer Template demonstrates a simple approach to integrating the Voltage
SecureData APIs into a Spark job in order to protect and access sensitive data. There are
several limitations to this approach that are worth mentioning:
• The Spark Developer Template reads its original plaintext sample data from a CSV file. If
your plaintext data is in a different format, you will need to customize the source code
for one or more of the Scala and Python Spark drivers in the following files to convert
the plaintext from your source format to the format (RDD or DataFrame) expected by
the Spark Developer Template processor(s).
• SDSparkDriver.scala
• SDSparkDatasetDriver.scala
• SDSparkDataFrameDriver.scala
• SDSparkSQLDriver.scala
• SDSparkHiveDriver.scala
• sd_pyspark_dataframe_driver.py
• sd_pyspark_sql_driver.py
• sd_pyspark_hive_driver.py
Likewise, if your scenario requires the output to be in a form other than a CSV file, the
final step in these Spark drivers will need to be changed accordingly.
CONFIDENTIAL 7-26
8 NiFi Integration
The NiFi Developer Template demonstrates how to integrate Voltage SecureData data
protection technology in the context of NiFi. This demonstration includes the use of the Simple
API (version 4.0 and greater) and the REST API.
NOTE: Both the Voltage SecureData for Hadoop Developer Templates and NiFi use the term
template in their own way. For the former, template is used to convey the fact that the Java
packages and the associated configuration approach provided for the MapReduce, Hive,
Sqoop, Spark, and NiFi integrations are meant to provide guidance and a starting point for
your own similar, but naturally more robust integrations.
Within NiFi, a template is a save-able and reuse-able set of connected, potentially pre-
configured processors.
In this document, NiFi Developer Template refers to the Voltage SecureData NiFi integration
and NiFi template, when used at all, has the stand-alone NiFi meaning.
NiFi provides an obvious integration opportunity in the form of its individual processors. NiFi
processors provide discrete processing steps in a flow of data. The SecureDataProcessor
NiFi processor provided with the NiFi Developer Template serves as an example of a NiFi
processor that can be configured to either protect or access the data flowing through it, and
works in conjunction with the Java packages in the common infrastructure. This chapter
provides a description of the former as well as instructions on how to run the NiFi Developer
Template in two different ways using the provided sample data. For more information about
the common infrastructure used by the Developer Templates, see Chapter 3, “Common
Infrastructure”.
NOTE: The NiFi Developer Template comes with its own sample data. It has been simplified
even further for demonstration purposes, with just a single type of data, such as credit card
numbers, Social Security numbers, or email addresses, provided in each input file, one data
value per line. For more information about these sample data files, see "Sample Data for the
NiFi Developer Template" (page 8-17).
Another important aspect of understanding the NiFi Developer Template is to understand the
configuration settings it uses. Unlike the other three Datastream Developer Templates, which
read all of their configuration settings from their two XML configuration files (vsconfig.xml
and vsauth.xml), the NiFi Developer Template reads its (global) configuration settings from a
Java Properties configuration file (vsnifi.properties) with each SecureDataProcessor
getting its individual settings through the NiFi user interface.
Much of the documentation related to the global configuration settings relevant to the NiFi
Developer Template is provided in Chapter 3 in the section "Configuration Settings" (page 3-5).
This section provides information about the common infrastructure Java classes used to read
and create in-memory copies of the settings, as well as a description of the individual settings.
8-1 CONFIDENTIAL
Quick Start Using the Provided NiFi Workflow Developer Templates Integration Guide Version 5.0
This chapter will review these global configuration settings in the context of the NiFi Developer
Template as well as provide information about configuring a SecureDataProcessor using
the NiFi user interface.
• Quick Start Using the Provided NiFi Workflow (page 8-2) - This section provides
instructions for deploying and running the NiFi Developer Template using the provided
pre-configured workflow with the provided sample data and using the public-facing
Voltage SecureData Server dataprotection hosted by Micro Focus Data Security.
NOTE: For instructions about building the NiFi Developer Template, see "Building the
Datastream Developer Templates" (page 2-19).
• Integration Architecture of the NiFi Developer Template (page 8-8) - This section
explains the Java classes that are specific to the NiFi Developer Template including the
classes that implement the SecureDataProcessor and the classes for retrieving the
global configuration settings from the configuration file vsnifi.properties.
• Configuration Settings for the NiFi Developer Template (page 8-10) - This section
reviews the global configuration settings that are relevant to the NiFi Developer
Template and explains the properties set for individual instances of a
SecureDataProcessor using the NiFi user interface.
• Sample Data for the NiFi Developer Template (page 8-17) - This section provides a
description of the simplified sample data provided for the NiFi Developer Template.
• Limitations and Simplifications of the NiFi Developer Template (page 8-20) - This
section provides information about the type of improvements you will need to make in
order to create a production-grade NiFi processor that integrates calls to the Voltage
SecureData APIs.
After you have installed the Simple API (see the Voltage SecureData Simple API Installation
Guide) and installed and built the NiFi Developer Template (see "Installing the Developer
Templates" on page 2-7 and "Building the Datastream Developer Templates" on page 2-19),
CONFIDENTIAL 8-2
Developer Templates Integration Guide Version 5.0 Quick Start Using the Provided NiFi Workflow
follow these steps to deploy and run the provided NiFi workflow and exercise the
SecureDataProcessor using the provided sample data and the public-facing Voltage
SecureData Server dataprotection hosted by Micro Focus Data Security:
2. Deploy the Configuration File for the Voltage SecureData NiFi Processor
After making any necessary changes to the Java Properties configuration file
vsnifi.properties, such as to the Simple API install path setting
simpleapi.install.path, copy it to the following directory on your NiFi server’s file
system:
<nifi-install-location>/conf
Start (or restart) your NiFi server, using the script in the following directory:
<nifi-install-location>/bin
When you are working on the NiFi server, the NiFi user interface is usually launched in a
compatible browser from the following URL:
http://<host>:8080/nifi
8-3 CONFIDENTIAL
Quick Start Using the Provided NiFi Workflow Developer Templates Integration Guide Version 5.0
From the NiFi user interface, begin by clicking the Upload Template button in the
Operate Palette:
The Upload Template dialog box opens. Click the magnifying glass Browse button,
browse to and choose the example template SecureDataExample.xml from the
directory <install-dir>/stream/nifi/template, and then upload it.
From the NiFi user interface, drag the Template tool icon from the Components
Toolbar into the NiFi workspace:.
The Add Template dialog box opens. Confirm that the SecureDataExample NiFi
template is displayed in the Choose Template dropdown list and then click Add.
NOTE: If you have imported more than one NiFi template, you may need to choose
the SecureDataExample NiFi template in the Choose Template dropdown list
before clicking Add.
CONFIDENTIAL 8-4
Developer Templates Integration Guide Version 5.0 Quick Start Using the Provided NiFi Workflow
For each of the following processors, right-click in the processor and choose Configure
to open the Configure Processor dialog box. In the Properties tab of this dialog box,
enter an appropriate directory path of your choosing as the value of the indicated
property, and then click Apply:
NOTE: You will need to create three empty directories on your NiFi server’s local file
system to serve as the input, output, and error directories.
8-5 CONFIDENTIAL
Quick Start Using the Provided NiFi Workflow Developer Templates Integration Guide Version 5.0
sample data in the file creditcard.txt, you will only need to add a value for the
SharedSecret property (use the value voltage123). For more information about this
step, see "Configuring the Properties of the NiFi SecureDataProcessor" (page 8-12).
Press ctrl-A to select all of the processors in the SecureDataExample workflow and
then click the Start Component button in the Operate Palette.
The red square Stopped icon in front of each processor’s name changes to a green
triangle Started icon.
You are now ready to exercise the SecureDataExample workflow, as explained in the
following section.
If you are using the public-facing Voltage SecureData Server dataprotection, hosted by
Micro Focus Data Security, and default settings for the SecureDataProcessor, with the
addition of the SharedSecret property value voltage123, start by putting a copy of the
sample data file creditcard.txt into the input directory. If the protection operation
succeeds, you will soon find an output file named creditcard.txt in the output directory.
This output file will contain the ciphertext values corresponding to the plaintext values in the
input file.
Continue exercising the workflow by experimenting with protecting the other sample data files.
Remember that you will need to reconfigure the SecureDataProcessor appropriately (the
processor must be stopped to reconfigure it, and then restarted):
CONFIDENTIAL 8-6
Developer Templates Integration Guide Version 5.0 Quick Start Using the Provided NiFi Workflow
Finally, you can exercise the data access aspect of the workflow by using the successful output
files you have generated as input files for access operations. You will need to re-configure the
SecureDataProcessor to perform access operations and continue to exercise care that you
have configured the other properties of the SecureDataProcessor to match the input file
you intend to process.
By default, these settings are configured to use the public-facing Voltage SecureData Server
dataprotection, hosted by Micro Focus Data Security for demonstration purposes. If you are
using this Voltage SecureData Server during your initial experimentation with the NiFi
Developer Template, no changes are necessary (except, perhaps, to the install location of the
Simple API). However, you will need to copy this configuration file to the directory
<nifi-install-location>/conf on your NiFi server’s file system.
NOTE: After you begin using your own Voltage SecureData Server, you will need to change
the configuration settings in this file and copy it (again) to the directory
<nifi-install-location>/conf on your NiFi server’s file system.
As you adapt the NiFi Developer Template code for your own purposes, you are, of course, free
to change how this configuration information is provided to your SecureData processors for use
when initializing the Voltage SecureData APIs you intend to use.
For more information about the parameters that the SecureDataProcessor expects to find
in the configuration file vsnifi.properties (which do not extend the common
configuration properties used by all of the Developer Templates), see "Common Configuration"
(page 3-57).
8-7 CONFIDENTIAL
Integration Architecture of the NiFi Developer Template Developer Templates Integration Guide Version 5.0
NiFi provides an obvious integration mechanism in the form of its extensible processor
architecture. The NiFi Developer Template pursues this approach by providing a SecureData
processor (named SecureDataProcessor) that can protect incoming plaintext or access
incoming ciphertext using Format-Preserving Encryption (FPE) or Secure Stateless
Tokenization™ (SST). The processor can be configured with the standard FPE and SST
parameters: format, identity, and authentication credentials in one of two forms. It can also be
configured to perform the protect or access operation using one of two different SecureData
data protection APIs: the Simple API (version 4.0 and greater) or the REST API (the latter API
can be used for SST processing while the former cannot).
The SecureData processor makes use of the common infrastructure provided with Voltage
SecureData for Hadoop for retrieving global file-based configuration information, providing
data translation and a cryptographic abstraction layer, as well as the REST client. The
SecureDataProcessor provides the following Java packages:
See "Processor Classes for the NiFi Developer Template" (page 8-8).
See "Configuration Classes for the NiFi Developer Template" (page 8-9).
The NiFi integration uses a NiFi processor to perform data protection processing on one or
more incoming plaintexts or ciphertexts. As shipped, it assumes that its input stream contains
one item of data to be protected or accessed “per line”, with no other data on the line, in a
format corresponding to the configured FPE or SST format. In other words, no parsing within a
line is required, and each plaintext or ciphertext is separated from the next one by the
appropriate line termination character(s), either a carriage-return or a carriage-return/line-feed
pair.
CONFIDENTIAL 8-8
Developer Templates Integration Guide Version 5.0 Integration Architecture of the NiFi Developer Template
Failures are handled by throwing a runtime exception. This exception is caught in the
method OnTrigger of the SecureDataProcessor class, which logs the error and
routes the entire flow file to the FAILURE relationship.
The SecureDataProcessor uses the classes in this package to read global configuration
settings from the java Properties configuration file vsnifi.properties. As shipped, the
SecureDataProcessor does not require any additional global configuration settings than are
read and established in-memory by the package com.voltage.securedata.config, as
described in "Common Configuration" (page 3-57).
NOTE: The remaining configuration settings for the NiFi Developer Template are specific to
a given instance of the SecureDataProcessor: the SecureData API to use, the operation
type, and the FPE/SST parameters. This configuration information is integrated with the
processor itself as part of the NiFi processor definition and configured using the NiFi user
interface. For more information about this configuration information, see "Configuring the
Properties of the NiFi SecureDataProcessor" (page 8-12).
The NiFi Developer Template provides a class in this package that modestly extends the
parallel class in the common configuration package com.voltage.securedata.config.
The NiFi Developer Template uses this NiFi-specific class to read and parse the global
configuration values that it requires.
8-9 CONFIDENTIAL
Configuration Settings for the NiFi Developer Template Developer Templates Integration Guide Version 5.0
This class wraps the call to the NiFiConfigPopulator class in code that reads the
configuration file from the local file system. This approach allows for the addition of
other types of configuration loaders that load configuration data from different input
sources; reading from the local file system is just one example approach demonstrated
in the NiFi Developer Template.
The NiFi Developer Template configuration implementation provides an example of the types
of configuration information that your custom SecureData NiFi processor will need access to in
order to be able to use one or both of the two Voltage SecureData APIs demonstrated here in
NiFi workflows. It is also an example of one possible approach to making this configuration
information available to your custom NiFi processor.
With respect to error processing, the DataStreamCallback class treats all of the input in a
given input stream (a single file in the provided template workflow) collectively: if any protect or
access operations in a given stream fail, the entire stream fails, resulting in the following actions:
• The input stream is written unchanged to the failure output stream: plaintext remains
plaintext and ciphertext remains ciphertext.
• A custom error attribute on the flow file is set to the text of the error message, allowing
a downstream processor to take action based on the nature of the error.
Finer-grained error processing may be appropriate for your environment. For example, you may
want to re-write the intentionally simplified NiFi Developer Template code such that only
individual cryptographic operations that fail get routed to the failure output stream, regardless
of how the input data is grouped in the input files.
There are two classes of configuration settings used by the NiFi Developer Template:
CONFIDENTIAL 8-10
Developer Templates Integration Guide Version 5.0 Configuration Settings for the NiFi Developer Template
simpleapi.policy.url = https://voltage-pp-0000.dataprotection...
simpleapi.install.path = /opt/voltage/simpleapi
rest.hostname = voltage-pp-0000.dataprotection.voltage.com
#product.name =
#product.version =
return.protected.value.on.access.auth.failure = false
Note that the NiFi Developer Template does not require any custom global
configuration settings beyond what can be processed using the common configuration
infrastructure provided for the Voltage SecureData for Hadoop Developer Templates.
The one distinct difference is that the Java Properties configuration file
vsnifi.properties for NiFi is read from the local file system, not from HDFS as for
the Hadoop Developer Templates. For more information about these settings, see
"Common Configuration" (page 3-57) and "Common Configuration" (page 3-57).
Anytime you make changes to this configuration file, you must always copy it to the
following directory in the NiFi server’s file system and start or restart your NiFi server:
<nifi-install-location>/conf
NOTE: You must update most of the values in the Java Properties configuration file
vsnifi.properties in order to protect data using your own Voltage SecureData
Server. However, if possible, before doing so Micro Focus Data Security recommends
that you first run the provided sample data through the NiFi ready-to-use workflow
as provided, giving you assurance that your NiFi workflow is configured correctly and
functioning as expected. For more information, see "Adding the SecureDataProcessor
to a Blank Workflow" (page 8-18).
• The remainder of the configuration settings that the NiFi Developer Template requires
to perform protect and access operations (the equivalent of the other types of settings
in the Hadoop XML configuration files vsconfig.xml and vsauth.xml) are handled
differently in the NiFi Developer Template. Because NiFi provides an extensible
mechanism for defining custom properties for a processor, including an extensible user
interface for setting the custom properties, the NiFi Developer Template makes use of
that mechanism.
8-11 CONFIDENTIAL
Configuration Settings for the NiFi Developer Template Developer Templates Integration Guide Version 5.0
These built-in processor properties are discussed in the following section, Configuring
the Properties of the NiFi SecureDataProcessor.
NOTE: By design, the Simple API cannot perform SST operations due to its local
processing.
Regardless of which API the SecureDataProcessor is configured to use, you must always
provide the standard set of cryptographic parameters:
• Format - The name of the centrally configured format to which the plaintext to be
protected and the ciphertext to be accessed will conform.
The NiFi Developer Template allows the optional use of NiFi expression processing for
the Format parameter. For more information, see "Format" (page 8-13).
• Identity - The common name portion of the identity that, for FPE, will be used in the
derivation of the cryptographic key that will be used to protect the incoming plaintext or
access the incoming ciphertext. For SST, the identity must match the identity configured
for the specified SST format.
This set of configuration values can be set in the Properties tab of the Configure Processor
dialog box. To open this dialog box, in the NiFi user interface, right-click in an idle
SecureDataProcessor and choose Configure.
The remainder of this section provides information about each of these configuration values, in
the order they appear in the Properties tab of the Configure Processor dialog box.
Operation
Use the Operation property to specify whether this SecureDataProcessor will perform a
protect or access operation. Click in the Value column for this property, choose either
PROTECT or ACCESS in the Value column drop-down box, and then click Ok.
CONFIDENTIAL 8-12
Developer Templates Integration Guide Version 5.0 Configuration Settings for the NiFi Developer Template
Format
Use the Format property to specify the format of the data to be processed by this
SecureDatProcessor. Enter the name of a data protection format, either the name of one of
the built-in formats such as cc or auto, or the name of a centrally configured format. Click in
the Value column for this property, type the name of the format, and then click Ok.
The NiFi Developer Template allows the optional use of NiFi expression processing for the
Format parameter. For example, if you wanted to specify the data protection format at the
beginning of each input filename (format name as the portion of the filename before the first
underscore character), you could specify the format as:
${filename:substringBefore('_')}
If you use this expression as the value of the Format parameter, then you would need to
rename the sample data files appropriately, to begin with their format name and an underscore.
For example, rename creditcard.txt to cc_creditcard.txt.
NOTE: For relevant information about the API used to enable expression language
support being deprecated, see API Used to Enable Expression Language Support
has been Deprecated at the end of this list of modifications.
• Modifying the method getFormatInfo to accept the FlowFile object as one of its
parameters, enabling flow file attributes such as filename to be accessible when
attempting to evaluate expressions.
As of Apache NiFi version 1.7.0 (June 2018), the following NiFi core API has been marked
as deprecated:
public PropertyDescriptor.Builder
expressionLanguageSupported(boolean supported)
The current version of the SecureDataProcessor code uses this API to turn on
expression language support for this property (Format).
This deprecated API has been replaced by the following API in newer versions of NiFi:
8-13 CONFIDENTIAL
Configuration Settings for the NiFi Developer Template Developer Templates Integration Guide Version 5.0
public PropertyDescriptor.Builder
expressionLanguageSupported(ExpressionLanguageScope
expressionLanguageScope)
In most cases, the use of the deprecated API should not cause any issues in your NiFi
environments, other than possibly displaying deprecation warnings when building the
SecureDataProcessor code. However, it is possible that future releases of NiFi may
completely remove the deprecated API. The possible behaviors, depending on NiFi
versions, are as follows:
• Versions 1.7.0 and higher (all known versions at this time) : Deprecation
warning. For example:
The method expressionLanguageSupported(boolean) from the
type PropertyDescriptor.Builder is deprecated.
If you get build errors on possible future versions of NiFi where the currently used,
deprecated API has been completely removed, or if you want to correct any deprecation
warnings as of NiFi 1.7.0, you can edit the Java source file SecureDataProcessor.java
and replace the following line of code:
expressionLanguageSupported(true) // this flag enables expres...
If you do, you will also need to import that new ExpressionLanguageScope class, at the
top of that Java source file, as follows:
import org.apache.nifi.expression.ExpressionLanguageScope;
After making these code changes, rebuild the processor using Maven. You should not get
any build warnings or errors related to this API. For more information about building the
NiFi Developer Template, see "Building the Datastream Developer Templates" (page 2-19).
Identity
Use the Identity property to specify the common name portion of the identity to be used for
this cryptographic operation. Click in the Value column for this property, type the identity, and
then click Ok.
CONFIDENTIAL 8-14
Developer Templates Integration Guide Version 5.0 Configuration Settings for the NiFi Developer Template
NOTE: When using the Voltage SecureData Server dataprotection, maintained by Micro
Focus Data Security for testing purposes, you can use the demonstration identity
[email protected]. When using your own Voltage SecureData Server, make sure that you use
an identity that matches a configured authorization rule.
Auth Method
Use the Auth Method property to specify the authentication method to be used when
connecting to the Voltage SecureData Server, such as to derive the cryptographic key needed
to complete the operation. Click in the Value column for this property, choose either
SharedSecret or UserPassword in the Value column drop-down box, and then click Ok.
SharedSecret
When you have chosen SharedSecret as the value of the Auth Method property, use the
SharedSecret property to specify the shared secret credential to be used when connecting to
the Voltage SecureData Server. Click in the Value column for this property, type the shared
secret, and then click Ok.
NOTE: In order to help keep the shared secret private, this property is configured such that
the text Sensitive value set will be displayed in the user interface instead of the shared secret
itself.
Username
When you have chosen UserPassword as the value of the Auth Method property, use the
Username property to specify the username credential to be used when connecting to the
Voltage SecureData Server. Click in the Value column for this property, type the username, and
then click Ok.
There is no default value for this property or the associated Password property.
Password
When you have chosen UserPassword as the value of the Auth Method property, use the
Password property to specify the password credential to be used when connecting to the Key
Server. Click in the Value column for this property, type the password, and then click Ok.
NOTE: In order to help keep the password private, this property is configured such that the
text Sensitive value set will be displayed in the user interface instead of the password itself.
There is no default value for this property or the associated Username property.
8-15 CONFIDENTIAL
Configuration Settings for the NiFi Developer Template Developer Templates Integration Guide Version 5.0
API Type
Use the API Type property to specify which of the two available Voltage SecureData APIs will
be used to perform the cryptographic operation. Click in the Value column for this property,
choose either simpleapi or rest in the Value column drop-down box, and then click Ok.
• If any of the required processor properties are not set, NiFi will report an error such as
the following:
'Format' is invalid because Format is required
NOTE: NiFi performs this type of custom validation only if more basic required
property validation is successful. In other words, if no value is set for a required
property such as Format or Identity, an error message related to that issue is
displayed. After such basic validation is performed successfully, then custom
validation such as is used for authentication credentials is performed and reported
upon.
Of course, you can always modify the definition of the SecureDataProcessor such
that custom properties are allowed.
CONFIDENTIAL 8-16
Developer Templates Integration Guide Version 5.0 Sample Data for the NiFi Developer Template
You must correct all configuration errors, as described above, before you can start the
SecureDataProcessor. After all configuration errors have been corrected, and the required
relationships have been configured (as described in the following section, SecureDataProcessor
Relationship Configuration), the yellow warning icon for the SecureDataProcessor will
change to a red square. This indicates that the processor is currently stopped and is ready to
be started.
If either relationship is not connected or auto-terminated, NiFi displays a warning icon with an
error message such as the following, and does not allow the processor to be started:
'Relationship failure' is invalid because Relationship 'failure' is
not connected to any component and is not auto-terminated
• creditcard.txt - Protect the credit card data in this file using the built-in FPE format
cc.
You can also use the SST format cc-sst-6-4 for this input data, if you also change the
API Type property to rest, indicating the use of the REST API for the SST operations.
• ssn.txt - Protect the social security number data in this file using the built-in FPE
format ssn.
• email.txt - Protect the email address data in this file using the pre-configured FPE
format Alphanumeric.
• date.txt - Protect the date data in this file using the FPE format DATE-ISO-8601,
pre-configured on the dataprotection Voltage SecureData Server hosted by Micro
Focus Data Security.
• name.txt - Protect the name data in this file using the FPE format
AlphaExtendedTest1, pre-configured on the dataprotection Voltage SecureData
Server hosted by Micro Focus Data Security. Note that this file contains characters
beyond the normal ASCII range, which will be protected using a Variable-Length String
8-17 CONFIDENTIAL
Adding the SecureDataProcessor to a Blank Workflow Developer Templates Integration Guide Version 5.0
(VLS) format configured to support extended character sets using FPE2. Also note that
you must use the REST API (which itself requires version 6.0 or greater of the Voltage
SecureData Server) or version 5.0 or greater of the Simple API to use this type of format.
If you are using your own Voltage SecureData Server to process the sample data in either of the
files date.txt, or name.txt, or to tokenize the data in the file creditcard.txt, you will
need to create the corresponding format(s) in your Voltage SecureData Server, as described in
Appendix A, “Voltage SecureData Server Configuration”.
After you have built and deployed the SecureDataProcessor (see "Building the Datastream
Developer Templates" on page 2-19 and steps 1, 2, and 3 in "Exercising the
SecureDataExample Workflow" on page 8-6), you can use it in a workflow. To do so, follow
these steps:
TIP: Search for “SecureData” in the Add Processor dialog box to avoid scrolling to
near the bottom of the list of processors.
NOTE: If you have not done so already, you will also need to provide configuration
settings for your Voltage SecureData Server in the configuration file
vsnifi.properties, as described in "Common Configuration" (page 3-57).
3. Add and configure a processor, such as TailFile or GetFile, to read input data from
a specific input file or directory.
4. Connect the SUCCESS relationship from your chosen input processor to the
SecureDataProcessor you added and configured above.
5. Add and configure a processor, such as PutFile, to write output data to a specific
directory.
CONFIDENTIAL 8-18
Developer Templates Integration Guide Version 5.0 Adding the SecureDataProcessor to a Blank Workflow
NOTE: Remember that defined downstream relationships for all processors must be
auto-terminated or connected to another processor.
The following screenshot shows the SecureDataProcessor configured to receive input from
an upstream GetFile processor and pass successful results to a downstream PutFile
processor.
Start the processors in your new NiFi workflow and then exercise it in the same way as
described for the pre-configured workflow in "Exercising the SecureDataExample Workflow"
(page 8-6).
8-19 CONFIDENTIAL
Limitations and Simplifications of the NiFi Developer Template Developer Templates Integration Guide Version 5.0
Note the following known limitations and simplifications in the NiFi Developer Template
(intentional so as to keep the SecureDataProcessor code focused on its core functionality
of performing cryptographic operations):
CONFIDENTIAL 8-20
9 StreamSets Integration
The StreamSets Developer Template demonstrates how to integrate Voltage SecureData data
protection technology in the context of StreamSets. This demonstration includes the use of the
Simple API (version 4.0 and greater) and the REST API.
StreamSets provides an obvious integration opportunity in the form of its individual processors.
StreamSets processors provide discrete processing steps in a flow of data. The SDProcessor
StreamSets processor provided with the StreamSets Developer Template serves as an example
of a StreamSets processor that can be configured to either protect or access the data flowing
through it, and works in conjunction with the Java packages in the common infrastructure. This
chapter provides a description of the former as well as instructions on how to run the
StreamSets Developer Template in two different ways using the provided sample data. For
more information about the common infrastructure used by the Developer Templates, see
Chapter 3, “Common Infrastructure”.
NOTE: The StreamSets Developer Template comes with its own sample data. It has been
simplified even further for demonstration purposes, with just a single type of data, such as
credit card numbers, Social Security numbers, or email addresses, provided in each input file,
one data value per line. For more information about these sample data files, see "Sample
Data for the StreamSets Developer Template" (page 9-11).
Much of the documentation related to the global configuration settings relevant to the
StreamSets Developer Template is provided in Chapter 3 in the sections "Common
Configuration" (page 3-57) and "Configuration Settings" (page 3-5). These sections provide
information about the common infrastructure Java classes used to read and create in-memory
copies of the settings, as well as a description of the individual settings. This chapter will review
these global configuration settings in the context of the StreamSets Developer Template as
well as provide information about configuring a SDProcessor using the StreamSets user
interface.
• Quick Start Using the Provided StreamSets Pipelines (page 9-2) - This section provides
instructions for building, deploying, and running the StreamSets Developer Template
using the provided pre-configured pipeline(s) with the provided sample data and using
the public-facing Voltage SecureData Server dataprotection hosted by Micro Focus
Data Security.
9-1 CONFIDENTIAL
Quick Start Using the Provided StreamSets Pipelines Developer Templates Integration Guide Version 5.0
• Configuration Settings for the StreamSets Developer Template (page 9-10) - This
section reviews the global configuration settings that are relevant to the StreamSets
Developer Template and explains the properties set for individual instances of an
SDProcessor using the StreamSets user interface.
• Sample Data for the StreamSets Developer Template (page 9-11) - This section
provides a description of the simplified sample data provided for the StreamSets
Developer Template.
• Adding the Voltage SDProcessor to a Blank Pipeline (page 9-12) - This section
provides instructions for adding the SDProcessor to a blank pipeline and connecting it
to an appropriate upstream origin or processor and downstream processor or
destination.
• Limitations of the StreamSets Developer Template (page 9-21) - This section provides
information about the type of improvements you will need to make in order to create a
production-grade StreamSets processor that integrates calls to the Voltage SecureData
APIs.
After you have installed the Simple API (see the Voltage SecureData Simple API Installation
Guide) and installed and built the StreamSets Developer Template (see "Installing the
Developer Templates" on page 2-7 and "Building the Datastream Developer Templates" on
page 2-19), follow these steps to deploy and run the provided StreamSets pipeline(s) and
exercise the SDProcessor using the provided sample data and the public-facing Voltage
SecureData Server dataprotection hosted by Micro Focus Data Security:
Then, follow these steps to deploy and prepare to run one or the other of the provided
StreamSets pipelines using the provided sample data and the public-facing Voltage SecureData
Server dataprotection hosted by Micro Focus Data Security:
Source directory:
<install_dir>/stream/streamsets_processor/target
CONFIDENTIAL 9-2
Developer Templates Integration Guide Version 5.0 Quick Start Using the Provided StreamSets Pipelines
This step deploys the Voltage SecureData StreamSets processor you built by following
the instructions in "Building the Datastream Developer Templates" (page 2-19).
Copy the entire Voltage SecureData StreamSets processor’s default configuration file
directory (containing the XML configuration files vsconfig.xml and vsauth.xml)
from the following source location to the following target location on your StreamSets
host file system:
Care should be taken to protect the XML configuration files, and especially the
authentication credentials in vsauth.xml, in this target location. The group and
user access control to this directory should be restricted to the sdc user only (under
which the StreamSets service is executed).
Create the following expected input and output directories used by the SDProtect and
SDAccess pipelines on the StreamSets host:
/tmp/voltage/datain
/tmp/voltage/dataout
Make sure that the user that runs the StreamSets service, sbc, has read/write
permission for both of these directories.
When you are working on the StreamSets host, the StreamSets user interface is usually
launched in a compatible browser from the following URL:
9-3 CONFIDENTIAL
Quick Start Using the Provided StreamSets Pipelines Developer Templates Integration Guide Version 5.0
http://<host>:18630
Pipelines: <install_dir>/stream/streamsets_processor/sample/pipelines
Follow these steps to exercise the SDProtect pipeline, provided with the StreamSets Developer
Template:
1. Copy the plaintext sample file plaintext.csv from the following source directory to
the target directory created in Step 4 in "Quick Start Using the Provided StreamSets
Pipelines" (page 9-2) above:
Source directory:
<install_dir>/stream/streamsets_processor/sample/data
2. In the StreamSets user interface, in the Create New Pipeline dropdown menu, choose
Import Pipeline. The Import Pipeline dialog box appears:
3. Give the pipeline to be imported the title SDProtect and a description and then browse
for and choose the saved pipeline SDProtect.json in the following directory:
CONFIDENTIAL 9-4
Developer Templates Integration Guide Version 5.0 Quick Start Using the Provided StreamSets Pipelines
<install_dir>/stream/streamsets_processor/sample/pipelines
The imported pipeline will be shown with InputDir as the origin, connected to Voltage
SDProcessor (configured for a protect operation) as the processor, in turn connected to
OutputDir as the destination.
4. Optionally, validate that the SDProtect sample pipeline is ready to run by clicking the
Preview button in the StreamSets user interface:
In the Preview Configuration dialog box, accept the defaults and click Run Preview.
The first ten records from the file plaintext.csv in the datain directory will be
processed and displayed (but not written to the output directory dataout). When
ready, dismiss the preview.
For more information, see "Sample Data for the StreamSets Developer Template" (page
9-11).
5. Run the SDProtect sample pipeline by clicking the Start button in the StreamSets user
interface:
The SDProtect pipeline will begin running and process the records in the file
plaintext.csv in the datain directory.
6. Examine the results of the SDProtect sample pipeline by looking for the output file in
the output directory, dataout, created in Step 4 in "Quick Start Using the Provided
StreamSets Pipelines" (page 9-2) above.
9-5 CONFIDENTIAL
Quick Start Using the Provided StreamSets Pipelines Developer Templates Integration Guide Version 5.0
While the pipeline is running, the output will be in a file named _tmp_ciphertext_0.
After the pipeline is stopped, this file will be renamed to ciphertext_<UniqueID>,
where <UniqueID> is a long unique identifier. The pipeline will probably take a few
minutes to process the 10,000 records in the input file.
The output file will contain the same columns as the input file plaintext.csv, with
the first two columns of each record, the credit card number and the Social Security
number protected (the plaintext values replaced by computed ciphertext values), as
specified by the configuration of the processor component (Voltage SDProcessor) of
the SDProtect sample pipeline.
Exercise the SDAccess pipeline in the same way, taking the following extra steps and making
the following relevant changes:
• Import the saved pipeline from the file SDAccess.json (instead of SDProtect.json)
and give it the title SDAccess (instead of SDProtect).
By default, these settings are configured to use the public-facing Voltage SecureData Server
dataprotection, hosted by Micro Focus Data Security for demonstration purposes. If you are
using this Voltage SecureData Server during your initial experimentation with the StreamSets
CONFIDENTIAL 9-6
Developer Templates Integration Guide Version 5.0 Integration Architecture of the StreamSets Developer Template
Developer Template, no changes are necessary (except, perhaps, to the install location of the
Simple API). However, you will need to copy these XML configuration files to the directory
expected by the sample pipelines on the StreamSets host (/etc/sdc/vsconfig).
NOTE: After you begin using your own Voltage SecureData Server, you will need to change
the configuration settings in these XML configuration files and copy them (again) to the
directory /etc/sdc/vsconfig or whatever directory your StreamSets processor is
configured to use.
As you adapt the StreamSets Developer Template code for your own purposes, you are, of
course, free to change how this configuration information is provided to your StreamSets
processor(s) when initializing the Voltage SecureData APIs you intend to use.
For more information about the parameters that the Voltage SecureData processor for
StreamSets expects to find in the XML configuration files vsconfig.xml and vsauth.xml
(which do not extend the common configuration properties used by all of the Developer
Templates, see "Common Configuration" (page 3-57).
StreamSets provides an obvious integration mechanism in the form of its extensible processor
architecture. The StreamSets Developer Template uses this approach by providing a set of
classes that implement a Voltage SecureData processor for use with StreamSets. When
deployed, the Voltage SecureData processor becomes available for use in the StreamSets user
interface.
When the StreamSets Developer Template is built, a Maven plug-in is used to package the
required JAR files (and potentially other types of resources) into a .tar.gz archive file for
deployment. The archive file sdprocessor-1.0-SNAPSHOT.tar.gz will typically contain the
following JAR files:
httpclient-4.5.10.jar
httpcore-4.4.12.jar
sdprocessor-1.0-SNAPSHOT.jar
simpleapi-1.0.jar
vscryptofactory-1.0.jar
vsrestclient-1.0.jar
vs-stream-common-1.0.jar
In the StreamSets user interface, the SDProcessor processor includes a SecureData Settings
tab with settings for the operation type (Protect or Access), the directory in which the standard
Developer Templates XML configuration files, vsconfig.xml and vsauth.xml, are located,
9-7 CONFIDENTIAL
Integration Architecture of the StreamSets Developer Template Developer Templates Integration Guide Version 5.0
and a list of one or more pairs of mappings from fields to be processed to the cryptId containing
the cryptographic parameters that control that processing. The cryptographic parameters
include the data protection format, which can specify either Format-Preserving Encryption
(FPE) or Secure Stateless Tokenization™ (SST).
NOTE: When the cryptId specifies an SST format, it must also specify the REST API (the
Simple API does not support SST operations and, as of version 5.0 of the Hadoop Developer
Templates, the SOAP API is not supported).
The Voltage SecureData processor classes provided with the StreamSets Developer Template
make use of common infrastructure provided with the Developer Templates for retrieving
global file-based configuration information, providing data translation and a cryptographic
abstraction layer, as well as a REST client. The StreamSets Developer Template provides the
following Java package:
The Voltage SecureData processor classes in this package implement the user interface for the
processor as well as the processing that integrates the Voltage SecureData APIs using the
cryptographic abstraction shared by the Developer Templates and the configuration settings in
the XML configuration files vsconfig.xml and vsauth.xml.
CONFIDENTIAL 9-8
Developer Templates Integration Guide Version 5.0 Integration Architecture of the StreamSets Developer Template
The Voltage SecureData processor classes for StreamSets use the classes in this package to
read global configuration settings from the configuration files vsauth.xml and
vsconfig.xml.
As shipped, the Voltage SecureData processor classes for StreamSets do not require any
configuration settings other than the common global configuration settings that are read and
established in-memory by the package com.voltage.securedata.config, as described in
"Common Configuration" (page 3-57). However, whereas the Voltage SecureData for Hadoop
Developer Templates read these configuration files from HDFS, the DataStream Developer
Templates (including the StreamSets Developer Template) read these configuration files from
a local directory.
For more information about this shared Java package, see "Shared Code for the DataStream
Developer Templates" (page 3-63).
/var/log/sdc/sdc.log
The logs contain general informational and debugging messages which will be helpful when
troubleshooting. Consult the log files if you encounter any errors when running the StreamSets
Developer Template.
9-9 CONFIDENTIAL
Configuration Settings for the StreamSets Developer Template Developer Templates Integration Guide Version 5.0
The StreamSets Developer Template uses the standard XML configuration files defined by, and
used throughout, the Developer Templates: vsconfig.xml and vsauth.xml.
There are two types of configuration settings used by the StreamSets Developer Template:
NOTE: Unlike some of the other Developer Templates, the StreamSets Developer Template
does not use the fieldMappings element in the XML configuration file vsconfig.xml.
Instead, fields in incoming records are mapped directly to the cryptIds defined in this same
configuration file, using the SecureData Settings tab in the StreamSets user interface for
the Voltage SecureData processor.
Also, unlike some the other Developer Templates, the StreamSets Developer Template does
not ever use a template-specific configuration file, such as vsstreamsets.xml. A custom
product and version for the clientId element can only be provided in the XML
configuration file vsconfig.xml (when not provided, the default client identifier product is
set to SecureData-Streamset-Processor and no default version is set).
The StreamSets Developer Template does not require any custom configuration settings
beyond what can be processed using the common configuration infrastructure provided for
common use throughout the Developer Templates. For more information about these settings,
see "Common Configuration" (page 3-57), "Configuration Settings" (page 3-5), "XML
Configuration Files" (page 3-32), and the comments in the configuration files themselves.
Before you begin to modify the StreamSets Developer Template’s XML configuration files for
your own purposes, such as using your own Voltage SecureData Server or different data
formats, Micro Focus Data Security recommends that you first run the StreamSets Developer
Template sample pipeline(s) as provided, giving you assurance that your StreamSets
installation is configured correctly and functioning as expected.
You will need to update many of the values in the XML configuration files vsauth.xml and
vsconfig.xml in order to protect your own data using your own Voltage SecureData Server.
CONFIDENTIAL 9-10
Developer Templates Integration Guide Version 5.0 Sample Data for the StreamSets Developer Template
Operation Type
The Operation Type setting allows you to choose between protect and access operations for
this instance of the Voltage SecureData processor for StreamSets. By default, the processor’s
operation type is set to PROTECT. You can use the dropdown control to change it to ACCESS
when appropriate.
Config Location
The Config Location setting allows you to specify the directory in which the Voltage
SecureData processor for StreamSets will look for the XML configuration files vsauth.xml and
vsconfig.xml. The default value for this setting is /etc/sdc/vsconfig, but you can
change it as required for your environment.
The StreamSets Developer Template includes two files of related sample data
(plaintext.csv and ciphertext.csv) that you can use with the included sample pipelines
(saved in the files SDProtect.json and SDAccess.json). Each of these sample data files
include 10,000 records with seven comma-separated fields in each.
The first and second fields in each record represent credit card and Social Security numbers,
respectively, and it is precisely these two fields that are different between the two sample data
files. The file ciphertext.csv is the result of cryptographically processing the first and
second fields in each record in the file plaintext.csv using cryptIds named cc and ssn,
respectively, and according to the cryptographic parameters supplied in the XML configuration
files vsconfig.xml and vsauth.xml, included with the StreamSets Developer Template
(which uses public-facing Voltage SecureData Server dataprotection hosted by Micro
Focus Data Security).
9-11 CONFIDENTIAL
Adding the Voltage SDProcessor to a Blank Pipeline Developer Templates Integration Guide Version 5.0
This means that the output file produced by running the sample pipeline SDProtect with the
sample data in plaintext.csv will contain the same data as the provided sample data file
ciphertext.csv. Likewise, the output file produced by running the sample pipeline
SDAccess with the sample data in ciphertext.csv will contain the same data as the
provided sample data file plaintext.csv. For instructions about how to run these sample
pipelines, see "Quick Start Using the Provided StreamSets Pipelines" (page 9-2).
This section describes the steps required to create a StreamSets pipeline from scratch that
includes the Voltage SDProcessor. It is divided into a number of logical sub-sections.
1. On the Pipelines page of the StreamSets user interface, click the Create New Pipeline
button:
2. In the New Pipeline dialog box, provide an appropriate Title and Description, and then
click Save.
CONFIDENTIAL 9-12
Developer Templates Integration Guide Version 5.0 Adding the Voltage SDProcessor to a Blank Pipeline
3. In the StreamSets user interface, in the Configuration -> Error Records tab, in the
Error Records dropdown menu, choose Discard (Library: Basic):
2. On the command line, create the input directory /tmp/voltage/mydatain and set its
ownership to sdc for user and group:
mkdir /tmp/voltage/mydatain
chown sdc:sdc /tmp/voltage/mydatain
9-13 CONFIDENTIAL
Adding the Voltage SDProcessor to a Blank Pipeline Developer Templates Integration Guide Version 5.0
3. In the StreamSets user interface, confirm that the origin Directory 1 is selected, and
then in the Configuration -> Files tab, provide the full path to the input directory as the
Files Directory and *.csv as the File Name Pattern:
4. In the Configuration -> Data Format tab, in the Data Format dropdown menu, choose
Delimited:
CONFIDENTIAL 9-14
Developer Templates Integration Guide Version 5.0 Adding the Voltage SDProcessor to a Blank Pipeline
5. In the Configuration -> Post Processing tab, optionally make a File Post Processing
choice other than None.
NOTE: The default, None, leaves input files in place, but records them as already
processed.
1. Using the Select Processor to connect dropdown menu, choose Voltage SDProcessor:
2. On the command line, copy an appropriate input file to the input directory created
above (/tmp/voltage/mydatain). For example, from the <install_dir>/
stream/streamsets_processor directory:
cp ./sample/data/plaintext.csv /tmp/voltage/mydatain
3. In the StreamSets user interface, confirm that the processor Voltage SDProcessor 1 is
selected, and then in the Configuration -> SecureData Settings tab, leave the
Operation Type and Config Location fields set to their default values (PROTECT and
/etc/sdc/vsconfig, respectively) and then set two fields to be protected by typing
the following values into the indicated fields:
9-15 CONFIDENTIAL
Adding the Voltage SDProcessor to a Blank Pipeline Developer Templates Integration Guide Version 5.0
Fields in records are specified using a zero-based field position preceded by the slash
character (/) and cryptIds are specified using their names. In this case, the first field in
each record (/0)is a credit card number, processed using the cryptId named cc, and the
second field in each record (/1) is a Social Security number, processed using the cryptId
named ssn. To display the second row for entering /1 and ssn, click the plus sign (+) to
the right of the current last row.
1. Using the Select Destination to connect dropdown menu, choose Local FS:
2. On the command line, create the output directory /tmp/voltage/mydataout and set
its ownership to sdc for user and group:
mkdir /tmp/voltage/mydataout
chown sdc:sdc /tmp/voltage/mydataout
CONFIDENTIAL 9-16
Developer Templates Integration Guide Version 5.0 Adding the Voltage SDProcessor to a Blank Pipeline
3. In the StreamSets user interface, confirm that the destination Local FS 1 is selected, and
then in the Configuration -> Output Files tab, provide the full path to the output
directory as the Directory Template:
4. In the Configuration -> Data Format tab, in the Data Format dropdown menu, choose
Delimited:
9-17 CONFIDENTIAL
Adding the Voltage SDProcessor to a Blank Pipeline Developer Templates Integration Guide Version 5.0
In the Preview Configuration dialog box, accept the defaults and click Run Preview.
2. The first ten records from the file plaintext.csv in the mydatain directory will be
displayed as will be output by the Directory 1 stage:
CONFIDENTIAL 9-18
Developer Templates Integration Guide Version 5.0 Adding the Voltage SDProcessor to a Blank Pipeline
4. Click on the stage Local FS 1to display the first ten records as input to the Local FS 1
stage:
9-19 CONFIDENTIAL
Adding the Voltage SDProcessor to a Blank Pipeline Developer Templates Integration Guide Version 5.0
CONFIDENTIAL 9-20
Developer Templates Integration Guide Version 5.0 Limitations of the StreamSets Developer Template
2. Your new pipeline will begin running and display a variety of statistics about its
processing:
It will probably take several minutes to process the 10,000 records in the sample input
file.
Using the default settings for output file naming, your output file with protected credit
card and Social Security numbers will be named _tmp_sdc-<unique_id> while your
pipeline is running and then be renamed to sdc-<unique_id> after your pipeline is
stopped.
An issue related to StreamSets file security remains unresolved. Specifically, when the
SDProcessor processor either A) attempts to load the Simple API JNI library, which contains
the underlying cryptographic code written in the C programming language, and/or B) fails to
load the XML configuration files from the directory /etc/sdc/vsconfig, an
AccessControlException exception is thrown. For example:
It is important for your StreamSets administrator to set permissions such that the SDProcessor
processor can access these resources.
9-21 CONFIDENTIAL
Limitations of the StreamSets Developer Template Developer Templates Integration Guide Version 5.0
If necessary, you can work-around this issue by disabling the StreamSets security manager by
setting the property SDC_SECURITY_MANAGER_ENABLED to false in one of the following two
files:
File: /opt/streamsets-datacollector/libexec/_sdc
Setting: SDC_SECURITY_MANAGER_ENABLED=false
File: /opt/streamsets-datacollector/libexec/sdc-env.sh
After making this change, restart your StreamSets service (for example, on CentOS: service
sdc restart) and confirm that you see the following warning in the StreamSets log file:
NOTE:If you see the following Security Manager message in this log file, the Security
Manager is not disabled:
CONFIDENTIAL 9-22
10 Kafka Connect Integration
The Kafka Connect Developer Template demonstrates how to integrate Voltage SecureData
data protection technology in the context of the Apache Kafka Connect API. This is achieved by
providing a Voltage SecureData implementation of the Kafka Connect Transformation
interface. This custom transformation class works in conjunction with the Java packages in the
common infrastructure to call the underlying Voltage SecureData APIs, as follows:
• Using a custom implementation of the Transformation interface with the built-in sink
connector class FileStreamSink, access the specified data as it is being read from a
Kafka topic to an external data source.
Note that due to the design of Kafka Connect, processing in the Kafka Connect Developer
Template occurs one record at a time. Further, as implemented using the HoistField built-in
transformation to retrieve that record, the entire line in the sample data file is treated as a single
value. So, as delivered and for the sake of simplicity, each line in the sample data files consist of
a single value to be protected. For more information about these sample data files, see "Sample
Data for the Kafka Connect Developer Template" (page 10-11).
For more information about the common infrastructure used by all of the Developer Templates,
including the Kafka Connect Developer Template, see Chapter 3, “Common Infrastructure”.
• The Kafka Connect Developer Template includes several Java Properties configuration
files, located in the bin directory, that are defined for use by Kakfa Connect scripts, and
specifically by the script for running Kafka Connect in stand-alone mode (connect-
standalone). These configuration files define characteristics such as the Connector to
use and the transformations it will perform, and the converters to be used to serialize
the data to a standard format before it is written to, or read from, the specified Kafka
topic.
• The custom Kafka Connect transformation classes for protecting and accessing data,
provided with the Kafka Connect Developer Template, rely on the global settings from
the two standard configuration files vsauth.xml and vsconfig.xml for information
required to interact with the underlying Voltage SecureData APIs. The documentation
related to the global configuration settings in these files that are relevant to the Kafka
Connect Developer Template is provided in several places in this document:
10-1 CONFIDENTIAL
Integration Architecture of the Kafka Connect Developer Template Developer Templates Integration Guide Version 5.0
• The section "Shared Code for the DataStream Developer Templates" (page 3-
63) provides information about the Java classes used by the DataStream
Developer Templates, including the Kafka Connect Developer Template, to wrap
the common infrastructure Java classes described above.
This chapter will review these global configuration settings in the context of the Kafka Connect
Developer Template.
• Integration Architecture of the Kafka Connect Developer Template (page 10-2) - This
section explains the Java classes that are specific to the Kafka Connect Developer
Template including the classes that implement the Transformation interface.
• Configuration Settings for the Kafka Connect Developer Template (page 10-5) - This
section reviews the global configuration settings that are relevant to the Kafka Connect
Developer Template.
• Sample Data for the Kafka Connect Developer Template (page 10-11) - This section
provides a description of the simplified sample data provided for the Kafka Connect
Developer Template.
• Running the Kafka Connect Developer Template (page 10-12) - This section provides
instructions for running the Kafka Connect Developer Template as provided.
• Limitations of the Kafka Connect Developer Template (page 10-17) - This section
provides information about the type of improvements you will need to make in order to
create production-grade Kafka Connect transformations that integrate calls to the
Simple API and/or the REST API.
Kafka Connect provides an obvious integration mechanism in the form of its extensible
transformation architecture. The Kafka Connect Developer Template uses this approach by
providing Voltage SecureData protect and access transformation classes, implementing the
Kafka Connect interface Transformation. These transformation classes can then be included
in a list of transforms performed by a Kafka Connect source connector or sink connector. As
specified in the Kafka Connect protect and access Java Properties files, the custom protect and
access transformation classes protect plaintext being written into a Kafka topic using a
FileStreamSource connector and access ciphertext being read from a Kafka topic using a
FileStreamSink connector, respectively. They use Format-Preserving Encryption (FPE) or
Secure Stateless Tokenization™ (SST). These classes can be configured with the standard FPE
and SST parameters: format, identity, and authentication credentials using the standard XML
configuration files vsconfig.xml and vsauth.xml. They can also be configured to perform
CONFIDENTIAL 10-2
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Kafka Connect Developer Template
the protect or access operations using one of two different SecureData data protection APIs:
the Simple API or the REST API (as of version 5.0 of the Hadoop Developer Templates, the
SOAP API is not supported). Note that only the REST API can be used for SST processing.
The Voltage SecureData protect and access transformation classes provided with the Kafka
Connect Developer Template make use of common infrastructure provided with the Developer
Templates for retrieving global file-based configuration information, providing data translation
and a cryptographic abstraction layer, as well as a REST client. The Kafka Connect Developer
Template provides the following Java package:
The Voltage SecureData protect and access classes in this package look up the fields specified
for Kafka Connect transform to which they have been assigned as its type. These fields (a
single field in the provided sample workflow) are then looked up in the field mappings for the
kafka-connect component in the XML configuration file vsconfig.xml. These mappings
provide the name of a corresponding cryptId, which in turn provides the various cryptographic
settings (format, identity, authentication credentials, and so on) required to perform the protect
or access operation using the configured underlying Voltage SecureData API.
• BaseField - This abstract class implements the aspects of the Kafka Connect
Transformation interface that can be shared between the ProtectField and
AccessField classes described below. This includes the determination about whether
the data can be processed with or without a schema as well as the initialization of the
CryptoFactory object.
10-3 CONFIDENTIAL
Integration Architecture of the Kafka Connect Developer Template Developer Templates Integration Guide Version 5.0
• ProtectField - This abstract class implements the aspects of the Kafka Connect
Transformation interface specific to the protect transformation, including the two
inner concrete classes, Key and Value, and their methods called by code in the base
class BaseField.
• AccessField - This abstract class implements the aspects of the Kafka Connect
Transformation interface specific to the access transformation, including the two
inner concrete classes, Key and Value, and their methods called by code in the base
class BaseField.
The Voltage SecureData protect and access transformation classes use the classes in this
package to read global configuration settings from the configuration files vsauth.xml and
vsconfig.xml.
As shipped, the Voltage SecureData transformation classes for Kafka Connect do not require
any configuration settings other than the common global configuration settings that are read
and established in-memory by the package com.voltage.securedata.config, as
described in "Common Configuration" (page 3-57). However, whereas the Voltage SecureData
for Hadoop Developer Templates read these configuration files from HDFS, the DataStream
Developer Templates (including the Kafka Connect Developer Template) read these
configuration files from a local directory.
For more information about this shared Java package, see "Shared Code for the DataStream
Developer Templates" (page 3-63).
Transformations are specified within Java properties files, the format of which is defined by
Kafka Connect, that are provided to and read by Kafka Connect scripts, such as the Kafka
Connect script for stand-alone execution: connect-standalone.sh. This script expects two
or more Java Properties files as command line parameters, the first one specifying the key/
value pairs required by the Kafka Connect worker(s). The one or more subsequent Java
Properties files specified on the command line each define a connector to be started. The
CONFIDENTIAL 10-4
Developer Templates Integration Guide Version 5.0 Configuration Settings for the Kafka Connect Developer Template
connector key-value pairs include a name for this connector instance, the associated Java class,
values expected by that Java class, and general Kafka Connect parameters such as the
maximum number of tasks and the relevant topic(s).
For more information about the Kafka Connect configuration information specified using Java
Properties files, see "Configuration Settings for the Kafka Connect Developer Template" (page
10-5).
NOTE: When using distributed mode, log messages are written using a REST server that
Kafka Connect launches at <rest-server-address>:<port>/logs.
The logs contain general informational and debugging messages which will be helpful when
troubleshooting. Consult the log files if you encounter any errors when running the Kafka
Connect Developer Template.
The Kafka Connect Developer Template uses Java Properties configuration files defined by
Kafka Connect as well as the standard XML configuration files defined by, and used throughout,
the Developer Templates. This section provides information about both of these types of
configuration files.
connect-standalone-worker.properties
The Java Properties file connect-standalone-worker.properties contains key/value
pairs required by the Kafka Connect worker that is started with Kafka Connect in stand-alone
mode. This Java Properties file is the first command line parameter to the Kafka Connect
Developer Template script run-kafka-connect-protect-transform and passed through
as the first command line parameter to the Kafka Connect script connect-standalone.sh.
10-5 CONFIDENTIAL
Configuration Settings for the Kafka Connect Developer Template Developer Templates Integration Guide Version 5.0
connect-file-source-protect.properties
The Java Properties file connect-file-source-protect.properties contains key/value
pairs required by the Kafka source connector that the Kafka Connect Developer Template uses
to protect data as it is written from a file in the local file system to a Kafka topic. It uses the
connector class FileStreamSource with a single task and an input file in the sample
directory named ssn.txt. It writes to the Kafka topic specified as the parameter to the script
create-kafka-topic. It specifies a sequence of three transforms to the data as it is written:
transforms=MakeMap,InsertSource,Protect
1. MakeMap - Using the arbitrary name MakeMap, this transform is implemented by the
built-in transformation HoistField, which will place each line from the specified input
file as a JSON value (HoistField$Value) using the field name ssn
(transforms.MakeMap.field=ssn). For example:
{"ssn" : "675-03-4941"}
NOTE: This transform is not required by the Kafka Connect Developer Template
sample. The first and third transforms, MakeMap and Protect, can operate correctly
with or without it being included.
3. Protect - Using the arbitrary name Protect, this transform is implemented by the
Kafka Connect Developer Template custom transformation ProtectField, which will
perform the specified protect operation(s) on the corresponding value(s). As delivered,
the Protect transform performs a single protect operation on the value in the JSON
field ssn, as shown above.
transforms.Protect.type=com.voltag...nnect.ProtectField$Value
transforms.Protect.fields=ssn
As implied by the plural form fields, the custom Protect transform is designed to
protect more than one field value per record when appropriate. Multiple fields can be
specified by providing a list of comma-separated field names (without whitespace).
CONFIDENTIAL 10-6
Developer Templates Integration Guide Version 5.0 Configuration Settings for the Kafka Connect Developer Template
NOTE: To keep the Kafka Connect Developer Template as simple as possible, the
FileStreamSource connector is used. This connector can only read a single file
at a time and each row in that file is turned into a string. The first specified
transformation, named MakeMap, uses the built-in transformation class
HoistField. This transformation makes a map, associating that string (as the
value) with the field name ssn. A more complex connector might be able to read
associated values from multiple files, or extract multiple values from a single row
in one file, in order to create a JSON record that contains multiple fields to be
protected by the time it reached the Protect transform.
At the end of the Protect transform, the JSON record, prior to being written to the
Kafka topic, will look much the same, with the Social Security number protected using
the built-in FPE format ssn (as specified by the field-to-cryptId-to-format mapping in
the XML configuration file vsconfig.xml):
{"ssn" : "783-91-4941", "data_source" : "ssn.txt"}
In the provided sample workflow, as specified in "Steps to Run the Kafka Connect Developer
Template" (page 10-13), this Java Properties file is provided as the second command line
parameter to the Kafka Connect Developer Template script
run-kafka-connect-protect-transform and passed through as the second command
line parameter to the Kafka Connect script connect-standalone.sh.
connect-file-sink-access.properties
The Java Properties file connect-file-sink-access.properties contains key/value
pairs required by the Kafka sink connector that the Kafka Connect Developer Template uses to
access data as it is read from a Kafka topic to a file in the local file system. It uses the connector
class FileStreamSink with a single task and an output file named ssn-access-sink.txt.
It reads from the Kafka topic specified as the parameter to the script create-kafka-topic. It
specifies a single transform to the data as it is read:
• Access - Using the arbitrary name Access, this transform is implemented by the Kafka
Connect Developer Template custom transformation AccessField, which will perform
the specified access operation(s) on the corresponding value(s). As delivered, the
Access transform performs a single access operation on the value in the JSON field
ssn:
10-7 CONFIDENTIAL
Configuration Settings for the Kafka Connect Developer Template Developer Templates Integration Guide Version 5.0
transforms=Access
transforms.Access.type=com.voltag...onnect.AccessField$Value
transforms.Access.fields=ssn
As with the custom Protect transform and as implied by the plural form fields, the
custom Access transform is designed to access more than one field value per record
when appropriate. Multiple fields can be specified by providing a list of comma-
separated field names (without whitespace).
NOTE: As explained in the note for the Protect transformation above, this sample
workflow, as provided, accesses just a single field as each record is read from the
specified Kafka topic.
At the end of the Access transform, the JSON record, prior to being written to the
specified file in the local file system, will look much the same, with the Social Security
number accessed using the built-in FPE format ssn (as specified by the field-to-cryptId-
to-format mapping in the XML configuration file vsconfig.xml). For example:
{"ssn" : "675-03-4941", "data_source" : "ssn.txt"}
In the provided sample workflow, as specified in "Steps to Run the Kafka Connect Developer
Template" (page 10-13), this Java Properties file is provided as the third command line
parameter to the Kafka Connect Developer Template script
run-kafka-connect-protect-transform and passed through as the third command line
parameter to the Kafka Connect script connect-standalone.sh.
connect-file-sink.properties
The Java Properties file connect-file-sink.properties contains key/value pairs
required by the Kafka sink connector that the Kafka Connect Developer Template uses to write
the ciphertext, as is, from a Kafka topic to a file in the local file system. It uses the connector
class FileStreamSink with a single task and an output file named
ssn-protect-sink.txt. It reads from the same Kafka topic specified as the parameter to
CONFIDENTIAL 10-8
Developer Templates Integration Guide Version 5.0 Configuration Settings for the Kafka Connect Developer Template
the script create-kafka-topic. It does not specify any transforms to the data as it is read,
resulting in the ciphertext JSON record being written to the specified file in the local file system.
For example:
{"ssn" : "783-91-4941", "data_source" : "ssn.txt"}
There are three types of configuration settings used by the Kafka Connect Developer
Template:
<fieldMappings>
<fields component="kafka-connect">
<field name = "email" cryptId = "alpha"/>
<field name = "birth_date" cryptId = "date"/>
<field name = "cc" cryptId = "cc"/>
<field name = "ssn" cryptId = "ssn"/>
</fields>
.
.
.
</fieldMappings>
10-9 CONFIDENTIAL
Configuration Settings for the Kafka Connect Developer Template Developer Templates Integration Guide Version 5.0
<fields>
<field name = "email" cryptId = "alpha"/>
<field name = "birth_date" cryptId = "date"/>
<field name = "cc" cryptId = "cc"/>
<field name = "ssn" cryptId = "ssn"/>
</fields>
The Kafka Connect Developer Template does not require any custom configuration settings
beyond what can be processed using the common configuration infrastructure provided for
common use throughout the Developer Templates. For more information about these settings,
see "Common Configuration" (page 3-57), "Configuration Settings" (page 3-5), "XML
Configuration Files" (page 3-32), and the comments in the configuration files themselves.
Before you begin to modify the Kafka Connect Developer Template XML configuration files for
your own purposes, such as using your own Voltage SecureData Server or different data
formats, Micro Focus Data Security recommends that you first run the Kafka Connect
Developer Template sample workflow as provided, giving you assurance that your Kafka
installation is configured correctly and functioning as expected.
You will need to update many of the values in the XML configuration files vsauth.xml and
vsconfig.xml in order to protect your own data using your own Voltage SecureData Server.
The Kafka Connect protect and access connector Java Properties files each contain a
commented out key/value pair that can be used to specify an alternative relative directory path
that can be used to locate the XML configuration files in a directory other than the default:
If you chose to uncomment and use this alternate configuration file location key/value pair,
remember that the relative directory path you specify will be relative to current directory on the
local file system when you run Kafka Connect in stand-alone mode. Typically, this is the
following directory:
CONFIDENTIAL 10-10
Developer Templates Integration Guide Version 5.0 Sample Data for the Kafka Connect Developer Template
<install-dir>/stream/kafka_connect/bin
.
NOTE: In the event that you choose to run Kafka Connect in distributed mode, you must put
the XML configuration files in the same location on every computer on which your Kafka
Connect tasks will run. This is true regardless of whether you use the default configuration
file location or use the configPath key/value pairs to specify an alternate location.
• creditcard.txt - Protect the credit card data in this file using the SST format cc-sst-
6-4 for this input data. This works because in the XML configuration file
vsconfig.xml, the default API (defaultApi="simpleapi") is overridden for this
format to use the REST API, which is required for the SST operations:
<cryptId name="cc" format="cc-sst-6-4" api="rest" />
You could change the format setting above to cc, specifying the built-in FPE format for
credit card numbers and then remove the api attribute and value to revert to the use of
the Simple API for this cryptId.
• ssn.txt - Protect the social security number data in this file using the built-in FPE
format ssn.
• email.txt - Protect the email address data in this file using the FPE format
AlphaNumeric, pre-configured on the dataprotection Voltage SecureData Server
hosted by Micro Focus Data Security.
• date.txt - Protect the date data in this file using the FPE format DATE-ISO-8601,
pre-configured on the dataprotection Voltage SecureData Server hosted by Micro
Focus Data Security.
If you are using your own Voltage SecureData Server to process the sample data in the files
email.txt or date.txt or to tokenize the data in the file creditcard.txt, you will need to
create the corresponding format(s) in your Voltage SecureData Server, as described in
Appendix A, “Voltage SecureData Server Configuration”.
10-11 CONFIDENTIAL
Running the Kafka Connect Developer Template Developer Templates Integration Guide Version 5.0
The Kafka Connect Developer Template includes scripts in the directory <install_dir>/
stream/kafka_connect/bin that you use to run the template’s sample workflow. These
scripts provide commands to create a Kafka topic and run Kafka Connect in stand-alone mode
to protect Social Security numbers as they are written to a Kafka topic and to access them as
they are read from that Kafka topic.
Run-Time Prerequisites
To run the Kafka Connect Developer Template, you will need the following services configured
and running:
• Kafka
• Zookeeper
kafkaBrokerList
This property specifies a list of Kafka Brokers in <host>:<port> format, separated by
commas. For example:
hostname1.domain.com:1234,hostname2.domain.com:1234
The administrator who configured your Kafka installation will be able to provide you with this
host and port information for your Kafka Broker hosts. In some cases, the port number may be
found in the listeners property in the Kafka properties file server.properties.
kafkaServerPropsFile
This property specifies the absolute file path to the Kafka Java Properties file
server.properties.
The administrator who configured your Kafka installation will be able to provide you with the
path to this Java Properties file. Note that in some cases this file may be outside the main Kafka
installation path. Some examples of possible locations for this file include:
CONFIDENTIAL 10-12
Developer Templates Integration Guide Version 5.0 Running the Kafka Connect Developer Template
• /etc/kafka/<version>/0/server.properties
• /opt/cloudera/parcels/KAFKA/etc/kafka/conf.dist/server.properties
If you are not able find this file on your cluster, the following command may help you locate it:
locate "*kafka*server.properties"
kafkaBinDir
This property specifies the absolute path to the Kafka bin directory, which contains Kafka
scripts such as kafka-topics.sh and connect-standalone.sh. If the Kafka bin directory
is already included in the system PATH variable, do not provide a value for this property.
Otherwise, specify the full path to the directory kafka/bin within your Kafka installation
location, ending with a slash character (/). Example locations:
• /usr/hdp/<version>/kafka/bin/
• /opt/cloudera/parcels/KAFKA/lib/kafka/bin/
NOTE: When you specify a value for this property, it must end with the slash character (/).
If you are not able find this path for your cluster, the following command may help you locate it:
locate "kafka-topics.sh"
1. Change directory (cd) to the following directory where the Kafka Connect JAR file was
built:
<install_dir>/stream/kafka_connect/target
2. Copy the Kafka Connect JAR file, vs-kafka-connect-1.0.jar, to the Kafka lib
directory for the Kafka distribution you are using:
• HDP: /usr/hdp/<hdp-version>/kafka/lib
• CDH: /opt/cloudera/parcels/kafka/lib
NOTE: The Kafka Connect Developer Template is only supported for HDP and CDH.
It is not supported for MapR and EMR.
10-13 CONFIDENTIAL
Running the Kafka Connect Developer Template Developer Templates Integration Guide Version 5.0
3. Change directory (cd) to the following directory to run the scripts in the subsequent
steps:
<install_dir>/stream/kafka_connect/bin
4. Create a Kafka topic to be used by the Kafka Connect Developer Template scripts by
running the following script, choosing a topic name and specifying it as a parameter. For
example:
./create-kafka-topic ssn-protect-connect
The script create-kafka-topic will edit the relevant Kafka Connect Java Properties
files to contain this topic name so that when Kafka Connect is run in stand-alone mode,
the source and sink connectors will know what Kafka topic to write to and read from,
respectively.
5. Run the following Kafka Connect Developer Template script, including the four Java
Properties files as parameters (shown on multiple lines to improve read-ability):
./run-kafka-connect-transform
connect-standalone-worker.properties
connect-file-source-protect.properties
connect-file-sink-access.properties
connect-file-sink.properties
This script will start Kafka Connect in stand-alone mode with a Kafka Connect worker
(using the Java Properties file specified as the first parameter) and three Kafka Connect
connectors (one source connector using the Java Properties file specified as the second
parameter and two sink connectors using the Java Properties files specified as the third
and fourth parameters).
NOTE: The first parameter must be the worker Java Properties file, as shown above.
The order of the remaining three connector Java Properties files are arbitrary, and
only one of them is strictly necessary, and more than three could be specified.
The sink connectors above will each produce a single output file:
CONFIDENTIAL 10-14
Developer Templates Integration Guide Version 5.0 Running the Kafka Connect Developer Template
NOTE: The connectors will keep running until the Connect worker is manually
stopped.
7. Delete the relevant Kafka topic by running the following script, specifying the same topic
name as its parameter:
./delete-kafka-topic ssn-protect-connect
8. Delete the Kafka Connect offsets file used when running Kafka in stand-alone mode:
rm /tmp/connect.offsets
• You can repeat steps 1 through 6, performing step 3 twice, launching fewer connectors
each time. For example, the first time, you could run the script run-kafka-connect-
transform with just the first two parameters, protecting the Social Security numbers as
they are written to the Kafka topic, but producing no output files. Then, when you run
the script run-kafka-connect-transform again, run it with just the first and third
parameters. This would retrieve the protected records from the existing Kafka topic,
access them, and write them to the output file ssn-access-sink.txt.
• Make changes such that you are protecting different sample data. For example, to
protect the credit card sample data instead, make the following changes, exactly as
shown, to these Java Properties configuration files:
File: connect-file-source-protect.properties
file=../sampledata/ssn.txt -> file=../sampledata/cc.txt
transforms.MakeMap.field=ssn -> transforms.MakeMap.field=cc
transforms.Protect.fields=ssn -> transforms.Protect.fields=cc
File: connect-file-sink-access.properties:
file=ssn-access-connect.txt -> file=cc-access-connect.txt
transforms.Access.fields=ssn -> transforms.Access.fields=cc
File: connect-file-sink.properties:
file=ssn-access-connect.txt -> file=cc-access-connect.txt
10-15 CONFIDENTIAL
Running the Kafka Connect Developer Template Developer Templates Integration Guide Version 5.0
The run the Kafka Connect Developer Template sample workflow again (steps 1
through 6), specifying a different topic name as the parameter to the script create-
kafka-topic. For example:
./create-kafka-topic cc-protect-connect
Script Summary
Examine the scripts in the bin directory to see how they call Kafka scripts in the context of the
Kafka Connect Developer Template. This section summaries these scripts.
create-kafka-topic
This script creates a new Kafka topic with the specified name. It also edits the following three
Kafka Connect Java Properties files to specify this topic name as the value for the relevant
name:
• connect-file-source-protect.properties: topic=<topic_name>
• connect-file-sink-access.properties: topics=<topic_name>
• connect-file-sink.properties: topics=<topic_name>
Invocation:
./create-kafka-topic <topic_name>
This script expects a single parameter: the name of the Kafka topic to be created.
delete-kafka-topic
This script deletes the Kafka topic with the specified name.
Invocation:
./delete-kafka-topic <topic_name>
This script expects a single parameter: the name of the Kafka topic to be deleted.
run-kafka-connect-protect-transform
After editing the Kafka Connect worker Java Properties file to include the Kafka broker list
provided in the file vsdistrib.properties (described below), this script calls the Kafka
script connect-standalone.sh to start Kafka in stand-alone mode.
CONFIDENTIAL 10-16
Developer Templates Integration Guide Version 5.0 Limitations of the Kafka Connect Developer Template
This script expects two or more parameters: the name of a Kafka worker Java Properties file
and the names of one or more connector Java Properties files.
vsdistrib.properties
This configuration file is used by the other scripts in the bin directory (described above) to get
the required Kafka installation settings from a single place in order to avoid redundant editing
in multiple scripts. It defines the following variables:
• kafkaBrokerList
• kafkaServerPropsFile
• kafkaBinDir
For more information, see "Editing the Distribution-Specific Run-Time Settings" (page 10-12).
The Kafka Connect Developer Template provides a very basic sample integration, showing
custom Kafka transformation classes that protect and access simple input data. Specifically, two
of the simplifications made in this template are:
This simple input data format is required by the first built-in Kafka transform class
(HoistData) used by the protect transformation using the connector class
FileStreamSource. The HoistData class pulls either key or value data as an entire
file line to use when constructing a JSON record.
10-17 CONFIDENTIAL
Limitations of the Kafka Connect Developer Template Developer Templates Integration Guide Version 5.0
Your requirements are likely to require more complex processing logic, building a JSON
records with multiple related fields and, possibly, protecting or accessing more than one
of those fields. The good news is that the custom Kafka transformation classes
ProtectField and AccessField, provided with the Kafka Connect Developer
Template, are already designed to protect multiple fields in a JSON record being written
to a Kafka topic or access multiple fields in a JSON record being read from a Kafka topic,
respectively. However, to take advantage of this functionality, different transforms than
are demonstrated would need to be used to create the required JSON records. For
example, a JSON record with both Social Security and credit cards numbers, such as the
following:
{"ssn" : "675-03-4941", "cc" : "5225-6290-4183-4450"}
Furthermore, if the data coming into the transformations using the custom Kafta
transformation classes ProtectField and AccessField are not formatted as JSON
records, as shown above, the code in these custom transformation classes will need to
be modified accordingly.
• Simplified Configuration
The Kafka Connect Developer Template reads its configuration and authentication/
authorization settings from XML configuration files in a default location on the local file
system; an approach that may be more simple than is called for by your scenario. Your
production integration may require alternative approaches for configuring these
settings, including putting the XML configuration files in a different location on the local
file system. For more information about configuring the location in which the Kafka
Connect protect and access transformation will look for their XML configuration files,
see "Configuration File Location" (page 10-10).
Note that in the event that you choose to run Kafka Connect in distributed mode, you
must put the XML configuration files in the same location on every computer on which
your Kafka Connect tasks will run. This is true regardless of whether you use the default
configuration file location or use the alternate configuration location key/value pairs in
the Kafka Connect protect and access Java Properties files to specify an alternate
location.
With respect to transformations, Kafka Connect inherently does not support batching.
For the Kafka Connect Developer Template, this means that data is protected or
accessed one record (message) at a time. In other words, one line from the sample input
file and/or one line to the output file(s) at a time.
CONFIDENTIAL 10-18
Developer Templates Integration Guide Version 5.0 Limitations of the Kafka Connect Developer Template
The Kafka Connect Developer Template does not support Voltage SecureData IBSE/
AES protect and access operations.
The Kafka Connect Developer Template does not support Kerberos authentication nor
LDAP + Shared Secret authentication/authorization.
The inherent Kafka Connect limitation behind these authentication limitations is that
there is no Kafka Connect API for retrieving the identity of the user who started the
Kafka worker and connectors.
10-19 CONFIDENTIAL
Limitations of the Kafka Connect Developer Template Developer Templates Integration Guide Version 5.0
CONFIDENTIAL 10-20
11 Kafka-Storm Integration
The Kafka-Storm Developer Template demonstrates how to integrate Voltage SecureData data
protection technology in the context of Apache Kafka and Apache Storm. This demonstration
includes the use of the Simple API (version 4.0 and greater) and the REST API.
Storm provides an obvious integration opportunity in the form of its bolt technology. Storm
bolts provide discrete processing steps in a flow of data. The Protect Bolt and Access Bolt
provided with the Kafka-Storm Developer Template serve as examples of using Storm bolts to
protect or access, respectively, the data flowing through them. They work in conjunction with
the Java packages in the common infrastructure. This chapter provides a description of these
bolts as well as the overall Storm topology into which the Protect Bolt is integrated (The Access
bolt is not demonstrated in the provided Storm topology, but is nevertheless included in the
Kafka-Storm Developer Template for completeness.)
This simple topology uses an off-the-shelf Storm component, the Kafka Spout, that reads Kafka
records from a particular Kafka topic (in this scenario, Kafka is configured to read input data
from a file, placing each line from the file into its own record). The Kafka Spout streams data to
be protected to the Protect Bolt, which protects the incoming data using the configured
Voltage SecureData credentials, format, and Voltage SecureData technology. The Protect Bolt,
in turn, streams the protected data to another off-the-shelf Storm component, the HDFS Bolt,
which streams the protected data to a file in the Hadoop Distributed File System (HDFS). The
following illustration shows the basic structure of the Storm topology used by the Kafka-Storm
Developer Template:
NOTE: Both the Kafka Spout and HDFS Bolt are off-the-shelf components provided with
Storm.
This chapter also provides instructions on how to run the Kafka-Storm Developer Template
using the provided sample data.
For more information about the common infrastructure used by all of the Developer Templates,
including the Kafka-Storm Developer Template, see Chapter 3, “Common Infrastructure”.
NOTE: The Kafka-Storm Developer Template comes with its own sample data, distinct from
the sample data provided for the three Hadoop Developer Templates. It has been simplified
even further for demonstration purposes, with just a single type of data, such as credit card
numbers, Social Security numbers, or email addresses, provided in each input file, one data
value per line. For more information about these sample data files, see "Sample Data for the
Kafka-Storm Developer Template" (page 11-10).
11-1 CONFIDENTIAL
Integration Architecture of the Kafka-Storm Developer Template Developer Templates Integration Guide Version 5.0
The documentation related to the global configuration settings relevant to the Kafka-Storm
Developer Template is provided in Chapter 3, “Common Infrastructure”, in the section
"Common Configuration" (page 3-57). This section provides information about the common
infrastructure Java classes used to read and create in-memory copies of the settings, as well as
a description of the individual settings. This chapter will review these global configuration
settings in the context of the Kafka-Storm Developer Template.
• Configuration Settings for the Kafka-Storm Developer Template (page 11-7) - This
section reviews the global configuration settings that are relevant to the Kafka-Storm
Developer Template.
• Sample Data for the Kafka-Storm Developer Template (page 11-10) - This section
provides a description of the simplified sample data provided for the Kafka-Storm
Developer Template.
• Running the Kafka-Storm Developer Template (page 11-10) - This section provides
instructions for running the Kafka-Storm Developer Template as provided.
Storm provides an obvious integration mechanism in the form of its extensible spout and bolt
architecture. The Kafka-Storm Developer Template uses this approach by providing protect
and access bolts that can be wedged between the Kafka Spout and the HDFS Bolt that are
provided with Storm. The Protect Bolt and Access Bolt protect incoming plaintext or access
incoming ciphertext, respectively, using Format-Preserving Encryption (FPE) or Secure
Stateless Tokenization™ (SST). These bolts can be configured with the standard FPE and SST
parameters: format, identity, and authentication credentials in one of two forms. They can also
be configured to perform the protect or access operations using one of three different
SecureData data protection APIs: the Simple API (version 4.0 and greater) and the REST API
(the latter API can be used for SST processing while the former cannot).
CONFIDENTIAL 11-2
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Kafka-Storm Developer Template
The Protect Bolt and the Access Bolt make use of the common infrastructure provided with
Voltage SecureData for Hadoop for retrieving global file-based configuration information,
providing data translation and a cryptographic abstraction layer, as well as a REST client. The
Kafka-Storm Developer Template provides the following Java packages:
See "Protect and Access Bolts, and Storm Topology" (page 11-3).
Kafka Producer
The following Java package and its associated Java source code provide a class that
implements the Kafka producer that reads input lines from a plaintext data file and streams
each line as a record into a specific Kafka topic.
The class in this package implements a simple Kafka producer that reads values from a
specified file and writes each line in that file as a record to the specified Kafka topic. The names
of the file and topic are provided as command line parameters, along with a list of Kafka
brokers.
The Kafka producer package defines the following class in a .java source file of the same
name:
11-3 CONFIDENTIAL
Integration Architecture of the Kafka-Storm Developer Template Developer Templates Integration Guide Version 5.0
The Kafka-Storm integration uses a Storm bolt to perform data protection processing on one or
more incoming plaintexts. As shipped, it uses an off-the-shelf Kafka spout to read tuples from a
Kafka topic that each contain one item of data to be protected, in a format corresponding to the
configured FPE or SST format. Each tuple is passed to the Voltage SecureData protect bolt,
which protects the data as configured. The protected tuple is then passed to an off-the-shelf
HDFS bolt, which writes the ciphertext or token to its own line in a file in HDFS.
The Storm package defines the following classes in .java source files of the same name:
• AccessBolt - This class implements the Voltage SecureData access bolt, extending
the functionality in its base class BaseBolt. It uses the specified configuration
information to access its ciphertext or token input and sends its output to the
downstream bolt, the HDFS bolt in the case of the Kafka-Storm Developer Template’s
Storm topology.
Note that the access bolt is not included in the Kafka-Storm Developer Template’s
Storm topology as shipped, but is nevertheless included for completeness.
• BaseBolt - This class implements the functionality shared by the protect and access
bolts, serving as their base class. It extends the Storm class BaseRichBolt.
• ProtectBolt - This class implements the Voltage SecureData protect bolt, extending
the functionality in its base class BaseBolt. It uses the specified configuration
information to protect its plaintext input and sends its output to the downstream bolt,
the HDFS bolt in the case of the Kafka-Storm Developer Template’s Storm topology.
• StormTopology - This class implements the Storm topology for the Kafka-Storm
Developer Template. It requires seven, and optionally eight, command line parameters
that specify the various configurable aspects of the topology. It uses these parameters
to create and set the Kafka Spout, the Voltage SecureData protect bolt, and the HDFS
bolt, and finally, to submit the topology to Storm for execution.
The Voltage SecureData protect and access Storm bolt classes use the classes in this package
to read global configuration settings from the XML configuration files vsauth.xml and
vsconfig.xml.
CONFIDENTIAL 11-4
Developer Templates Integration Guide Version 5.0 Integration Architecture of the Kafka-Storm Developer Template
As shipped, the Voltage SecureData protect and access Storm bolt classes do not require any
configuration settings other than the common global configuration settings that are read and
established in-memory by the package com.voltage.securedata.config, as described in
"Common Configuration" (page 3-57). However, whereas the Voltage SecureData for Hadoop
Developer Templates reads these configuration files from HDFS, the DataStream Developer
Templates (including the Kafka-Storm Developer Template) read these configuration files from
a local directory.
For more information about this shared Java package, see "Shared Code for the DataStream
Developer Templates" (page 3-63).
1. Construction - The bolt is constructed when the topology is built and instantiated, on
the computer from which the topology is submitted to the Storm cluster.
2. Serialization - The bolt instance is serialized and sent to the Storm cluster worker
nodes.
3. Preparation - The prepare method of the bolt instance is called on each worker node,
which initializes it to process tuples.
4. Execution - The execute method of the bolt instance is called on each worker node,
causing it to process each tuple it receives from any upstream spouts or bolts in the
topology, sending its output to the downstream bolt, if any.
In the context of this life cycle, SecureData cryptographic processing is integrated into the
Protect and Access Bolts as follows:
1. Construction - During this phase of the bolt life cycle, the Protect Bolt is constructed
with the configuration, authentication, and authorization settings needed to perform the
cryptographic processing. For information about how the Kafka-Storm Developer
Template’s Storm topology reads and uses these settings when constructing an
instance of this bolt, see the method main in the class StormTopology and the
constructor in the class BaseBolt. For information about alternative approaches to
handling the required configuration settings, see "Alternative Approaches to
Configuration" (page 11-8).
3. Preparation - During this phase of the bolt life cycle, the method prepare of the class
BaseBolt initializes the static CryptoFactory instance using the deserialized
configuration settings.
11-5 CONFIDENTIAL
Integration Architecture of the Kafka-Storm Developer Template Developer Templates Integration Guide Version 5.0
4. Execution - During this phase of the bolt life cycle, the Protect Bolt performs
cryptographic processing on the input tuple. For information about how this is
implemented, see the method execute in the class BaseBolt.
Any errors encountered by the Voltage SecureData protect or access bolts are trapped in the
bolt's execute method, which then acknowledges (ACKs) the tuple and reports the error,
using the following core Storm API calls:
• collector.ack(tuple);
This call acknowledges the tuple so that it isn't retried. This is appropriate for the
Kafka-Storm Developer Template because cryptographic operation failures are very
likely the result of malformed data or incorrect configuration and any retry attempt
would likely just fail in the same way.
• collector.reportError(e);
This call reports the error to the Storm framework so that it is written to the Storm
logs and displayed in the Storm UI.
To see how this error handling is implemented see the execute method in the BaseBolt
class.
This means that if you publish an invalid data item to the Kafka-Storm Developer Template’s
Kafka topic, such as by interactively using the Kafka console producer, the Voltage SecureData
protect bolt will report the error to the logs and to the Storm UI, and will not send any
corresponding ciphertext or token line to the output file in HDFS.
CAUTION: In general, do not throw an exception from the execute method in this type of
Storm topology. If you do, Storm will consider the bolt instance to have crashed and will re-
create it. This will result in new instances of the downstream HDFS bolt and cause multiple
zero-byte output files to be created in HDFS.
The correct approach, as described above, is to trap and report the error without throwing it
to the caller of the execute method.
CONFIDENTIAL 11-6
Developer Templates Integration Guide Version 5.0 Configuration Settings for the Kafka-Storm Developer Template
The Kafka-Storm Developer Template uses the standard XML configuration files defined by,
and used throughout, the Developer Templates. This section provides information about these
configuration files.
The Kafka-Storm Developer Template, and the Protect and Access Bolts in particular, use the
same two XML configuration files as are used by the all of the Hadoop Developer Templates
(other than the NiFi Developer Template): vsauth.xml and vsconfig.xml However,
whereas some of the Hadoop Developer Templates read these configuration files from HDFS,
the Kafka-Storm Developer Template reads these configuration files from the local file system.
NOTE: Starting with version 5.0, the Kafka-Storm Developer Template uses XML
configuration files instead of Java Properties configuration files. If the Java Properties
configuration files used in previous versions are present and the newer XML configuration
files are not present, the former will be used.
There are three classes of configuration settings used by the Kafka-Storm Developer Template:
11-7 CONFIDENTIAL
Configuration Settings for the Kafka-Storm Developer Template Developer Templates Integration Guide Version 5.0
The Kafka-Storm Developer Template does not require any custom configuration settings
beyond what can be processed using the common configuration infrastructure provided for
common use throughout the Hadoop Developer Templates. For more information about these
settings, see "Common Configuration" (page 3-57), "Configuration Settings" (page 3-5), "XML
Configuration Files" (page 3-32), and the comments in the configuration files themselves.
Before you begin to modify the Kafka-Storm Developer Template XML configuration files for
your own purposes, such as using your own Voltage SecureData Server or different data
formats, Micro Focus Data Security recommends that you first run the Kafka-Storm Developer
Template sample as provided, giving you assurance that your Kafka and Storm installations are
configured correctly and functioning as expected.
You will need to update many of the values in the XML configuration files vsauth.xml and
vsconfig.xml in order to protect your own data using your own Voltage SecureData Server.
Also, because these configuration settings are read when the Storm topology is first initialized
and submitted, anytime you make changes to either of these configuration files, you must kill
and resubmit the Storm topology for any new settings to be read and used.
CAUTION: This sample approach may not be appropriate in your production Kafka/Storm
integrations. In particular, keep in mind that the Storm topology deployment mechanism
involves serializing the spout and bolt instances and sending them over the network to the
worker nodes in the Storm cluster. These worker nodes deserialize the spout and bolt
instances and use them to execute the operations in the workflow. Depending on the
security settings and isolation of your Storm cluster, it may not be secure to send the
sensitive authentication and authorization credentials over the network in this manner,
especially if this transmission is not protected by a secure cluster configuration.
CONFIDENTIAL 11-8
Developer Templates Integration Guide Version 5.0 Configuration Settings for the Kafka-Storm Developer Template
You can, of course, choose, alternative approaches to providing the necessary configuration
information to the spouts and bolts in your production Kafka/Storm integrations, depending on
your specific topology implementation and cluster configuration. Some examples of other
possible approaches for this configuration information are as follows:
• You could copy the relevant configuration files to a local directory on all of the Storm
worker nodes in the cluster using a configuration management tool such as Puppet or
Chef, protecting them using file permission settings. The Protect and Access bolts could
then be re-written to look for these files in a specific local directory on each worker node.
This approach has the advantage that no sensitive authentication and authorization
settings are serialized and sent over the network. Those settings are already waiting for
the Protect Bolt on the worker nodes. On the other hand, the distribution management
tool used to put the configuration files on the worker nodes in advance must itself be
secure. Further, because the Storm workers run as user storm, by default, and not as
the user building and submitting the topology, there could be issues when using file
permission settings to protect these sensitive files: if you limit read access to the
authentication and authorization file, the generic storm user on the worker nodes will
not be able to read it. Note, however, that this limitation may be mitigated by configuring
the Storm workers to run as the user deploying the topology. For information about how
to do this, see the documentation for your Storm installation, and in particular, the
configuration setting run.worker.as.user.
See the method loadConfig in the class BaseBolt for an example of how such an
approach may be implemented.
11-9 CONFIDENTIAL
Sample Data for the Kafka-Storm Developer Template Developer Templates Integration Guide Version 5.0
These approaches are two of many alternative approaches that you can investigate for use in
your scenario. Your approach may be completely different, or perhaps a hybrid of one of the
approaches described here and other possible changes, such as using XML for the
configuration file format.
• creditcard.txt - Protect the credit card data in this file using the built-in FPE format
cc.
You can also use the SST format cc-sst-6-4 for this input data, if you also change the
API Type property to rest, indicating the use of the REST API for the SST operations.
• ssn.txt - Protect the social security number data in this file using the built-in FPE
format ssn.
• email.txt - Protect the email address data in this file using the FPE format
AlphaNumeric, pre-configured on the dataprotection Voltage SecureData Server
hosted by Micro Focus Data Security.
• date.txt - Protect the date data in this file using the FPE format DATE-ISO-8601,
pre-configured on the dataprotection Voltage SecureData Server hosted by Micro
Focus Data Security.
If you are using your own Voltage SecureData Server to process the sample data in the files
email.txt or date.txt or to tokenize the data in the file creditcard.txt, you will need to
create the corresponding format(s) in your Voltage SecureData Server, as described in
Appendix A, “Voltage SecureData Server Configuration”.
CONFIDENTIAL 11-10
Developer Templates Integration Guide Version 5.0 Running the Kafka-Storm Developer Template
Run-Time Prerequisites
To run the Kafka-Storm Developer Template, including writing the results of the Storm
topology to HDFS, you will need the following services configured and running on your Hadoop
cluster:
• Kafka
• Storm
• Zookeeper
kafkaBrokerList
This property specifies a list of Kafka Brokers in <host>:<port> format, separated by
commas. For example:
hostname1.domain.com:1234,hostname2.domain.com:1234
The administrator who configured Kafka on your cluster will be able to provide you with this
host and port information for the Kafka Broker hosts in your cluster. In some cases, the port
number may be found in the listeners property in the Kafka properties file
server.properties.
kafkaServerPropsFile
This property specifies the absolute file path to the Kafka properties file server.properties.
The administrator who configured Kafka on your cluster will be able to provide you with the
path to this properties file. Note that in some cases this file may be outside the main Kafka
installation path. Some examples of possible locations for this file include:
• /etc/kafka/<version>/0/server.properties
• /opt/cloudera/parcels/KAFKA/etc/kafka/conf.dist/server.properties
If you are not able find this file on your cluster, the following command may help you locate it:
locate "*kafka*server.properties"
11-11 CONFIDENTIAL
Running the Kafka-Storm Developer Template Developer Templates Integration Guide Version 5.0
kafkaBinDir
This property specifies the absolute path to the Kafka bin directory, which contains Kafka
scripts such as kafka-topics.sh. If the Kafka bin directory is already included in the system
PATH variable, do not provide a value for this property. Otherwise, specify the full path to the
directory kafka/bin within your Kafka installation location, ending with a slash character (/).
Example locations:
• /usr/hdp/<version>/kafka/bin/
• /opt/cloudera/parcels/KAFKA/lib/kafka/bin/
NOTE: When you specify a value for this property, it must end with the slash character (/).
If you are not able find this path for your cluster, the following command may help you locate it:
locate "kafka-topics.sh"
hdfsOutDir
This property specifies the output directory in HDFS into which you want the HDFS Bolt to
create the output files containing the ciphertext results of the Storm topology. Note that the
user submitting the topology must have write permission for this directory and must be able to
grant write permission to other users.
NOTE: The Kafka-Storm Developer Template’s scripts grant write permissions on this
directory to other users, since Storm workers run as user storm by default. Therefore, you
must specify a directory for which it is acceptable, in terms of the security of your HDFS, for
other users to have write permission. Alternatively, you can update the script run-storm-
topology to no longer grant this permission, instead configuring the Storm workers to run
as the user submitting the topology. See the documentation for your Storm installation for
configuration details on this advanced Storm run.worker.as.user configuration setting.
For the simplest case, the value for this property can be the user's home directory in HDFS,
which you can specify by replacing the <username> place-holder with the appropriate
username value. Alternatively, you can specify any other directory in HDFS, as long as you have
(and can grant) write permissions on that directory.
CONFIDENTIAL 11-12
Developer Templates Integration Guide Version 5.0 Running the Kafka-Storm Developer Template
NOTE: Before performing these steps, confirm that your Hadoop cluster has a home
directory in HDFS for the user account under which you plan to submit the template’s Storm
topology. For more information, see "Creating a Home Directory in HDFS" (page 3-75).
1. If you performed the steps to build the target Storm topology JAR file on a computer
outside your Hadoop cluster, copy the following directories from your build computer to
a local directory on your Hadoop cluster:
5. Change directory (cd) to the following directory to run the scripts in the following steps:
<install_dir>/stream/kafka_storm/bin
6. Confirm that you have correctly edited the script variables in the script configuration file
vsdistrib.properties to specify the distribution-specific settings and output
location. For more information, see "Editing the Distribution-Specific Run-Time Settings"
(page 11-11).
7. Run the script create-kafka-topic to create a new Kafka topic with the name “ssn”:
./create-kafka-topic ssn
11-13 CONFIDENTIAL
Running the Kafka-Storm Developer Template Developer Templates Integration Guide Version 5.0
8. Run the script run-storm-topology to submit the template’s Storm topology with
the name ssn-topology, specifying the Kafka topic ssn as the input source for the
Kafka Spout, and the cryptId name ssn to identify the cryptographic parameters to use
for the topology’s protect operations:
./run-storm-topology ssn ssn-topology ssn
The parameters to this script are: topic name, topology name, cryptId name, and
optionally, the API type (with the Simple API as the default choice). In this example, only
three parameters are provided and the topic name happens to be the same as the
cryptId name. They can be different.
NOTE: If you are using the deprecated Java Properties configuration files with the
Kafka-Storm Developer Template, specify a data protection format name as the third
parameter instead of a cryptId name.
10. Check the configured output directory in HDFS (specified by the hdfsOutDir property
in the script configuration file vsdistrib.properties) for the output file produced
by the Storm topology submitted in step 8 (ssn-topology). This topology reads input
records from the Kafka topic ssn, protects the plaintext in those records, and writes the
resulting ciphertext to a file in the specified output directory in HDFS. The form of the
name of the output file is as follows:
vs-storm-sample-<hdfs-bolt-identifier>.txt
11. Optionally, call the provided delete scripts to clean-up the template’s Storm topology, its
Kafka topic, and the output files created by the topology, respectively:
./delete-storm-topology ssn-topology
./delete-kafka-topic ssn
./delete-hdfs-files
CONFIDENTIAL 11-14
Developer Templates Integration Guide Version 5.0 Running the Kafka-Storm Developer Template
The steps for running the template’s Storm topology using this interactive approach are as
follows, using credit card data protected by the REST API:
1. Run the script create-kafka-topic to create a new Kafka topic with the name cc:
./create-kafka-topic cc
2. Run the script run-storm-topology to submit the template’s Storm topology with
the name cc-topology, specifying the Kafka topic cc as the input source for the Kafka
Spout, the cryptId name cc-sst-6-4 to identify the cryptographic parameters to use,
and the REST API for the topology’s protect operations:
./run-storm-topology cc cc-topology cc-sst-6-4 rest
NOTE: If you are using the deprecated Java Properties configuration files with the
Kafka-Storm Developer Template, specify a data protection format name as the third
parameter instead of a cryptId name.
Then, as you enter lines of data at the console prompt, one item per line, they will be
published to that topic. For example, try typing in the following sample credit card
numbers, entered separately:
1111-2222-3333-4444
2222-3333-4444-5555
4. Check the configured output directory in HDFS (specified by the hdfsOutDir property
in the script configuration file vsdistrib.properties) for the output file produced
by the Storm topology submitted in step 2 (cc-topology). This topology reads input
records from the Kafka topic cc, protects the plaintext in those records, and writes the
resulting ciphertext to a file in the specified output directory in HDFS. The form of the
name of the output file is as follows:
vs-storm-sample-<hdfs-bolt-identifier>.txt
11-15 CONFIDENTIAL
Running the Kafka-Storm Developer Template Developer Templates Integration Guide Version 5.0
The output file in HDFS is assigned a unique name by the HDFS Bolt, with a numeric
identifier. When you run the template’s Storm topology, check the output directory in HDFS
for the exact name of this file, and then tail it as follows:
Where <OutDir> is the output directory in HDFS and <UniqueId> is the unique portion of
the output filename assigned by the HDFS Bolt.
Then, in a separate console window, publish new input data to the Kafka topic (either using
the script run-kafka-sample-producer and provided sample data, or interactively using
the script run-kafka-console-producer), and watch as the protected ciphertext is
written to the end of this output file.
Note that it may take several seconds for the new ciphertext lines to be written to this output
file in HDFS, depending on when the HDFS Bolt syncs the output to the file system.
Script Summary
Examine the scripts in the bin directory to see how they call the Kafka and Storm commands in
the context of the Kafka-Storm Developer Template. This section summaries these scripts.
create-kafka-topic
This script creates a new Kafka topic with the specified name.
Invocation:
./create-kafka-topic <topic_name>
This script expects a single parameter: the name of the Kafka topic to be created.
delete-hdfs-files
This script deletes the output file(s) in HDFS generated by the template’s Storm topology: all
files in the configured HDFS output directory, as specified by the hdfsOutDir property in the
script configuration file vsdistrib.properties, that match the filename pattern
vs-storm-sample-*.txt.
Invocation:
./delete-hdfs-files
CONFIDENTIAL 11-16
Developer Templates Integration Guide Version 5.0 Running the Kafka-Storm Developer Template
delete-kafka-topic
This script deletes the Kafka topic with the specified name.
Invocation:
./delete-kafka-topic <topic_name>
This script expects a single parameter: the name of the Kafka topic to be deleted.
delete-storm-topology
This script terminates the Storm topology with the specified name.
Invocation:
./delete-storm-topology <topology_name>
This script expects a single parameter: the name of the Storm topology to be terminated.
run-kafka-console-producer
This script calls the script kafka-console-producer.sh that is provided in the core Kafka
installation, which reads lines from the console and publishes them to the specified Kafka topic.
Invocation:
./run-kafka-console-producer <topic_name>
This script expects a single parameter: the name of the Kafka topic to which console input will
be published.
run-kafka-sample-producer
This script runs the SampleKafkaProducer class in the Kafka-Storm Developer Template’s
target JAR file, which reads lines from the specified input file and publishes them to the
specified Kafka topic.
Invocation:
./run-kafka-sample-producer <input_file> <topic_name>
This script expects two parameters: 1) the path to the input file, and 2) the Kafka topic to which
lines in that file will be published.
11-17 CONFIDENTIAL
Simplifications of the Kafka-Storm Developer Template Developer Templates Integration Guide Version 5.0
run-storm-topology
This script runs the StormTopology class in the Kafka-Storm Developer Template’s target
JAR file, which submits a new Storm topology with the specified name. This topology reads
records from the specified Kafka topic and protects them using the cryptographic parameters
specified by the named cryptId (and optional API type).
Invocation:
./run-storm-topology <topic_name> <topology_name> <cryptId_name>
<optional_API_type>
This script expects three or four parameters: 1) the name of the Kafka topic from which to read
records, 2) the name under which to run the topology, 3) the name of a cryptId, the
cryptographic parameters of which to use during protect operations, and optionally 4) the type
of Voltage SecureData API to be used (either simpleapi (the default) or rest).
vsdistrib.properties
This configuration file is used by the other scripts in the bin directory (described above) to get
the required cluster settings from a single place in order to avoid redundant editing in multiple
scripts. It defines the following variables:
• kafkaBrokerList
• kafkaServerPropsFile
• kafkaBinDir
• hdfsOutDir
For more information, see "Editing the Distribution-Specific Run-Time Settings" (page 11-11).
The Kafka-Storm Developer Template provides a very basic sample integration, showing a
simple Storm bolt that protects simple input data. Specifically, two of the simplifications made in
this template are:
CONFIDENTIAL 11-18
Developer Templates Integration Guide Version 5.0 Simplifications of the Kafka-Storm Developer Template
Your requirements are likely to require more complex processing logic, working on
multiple fields in the input tuple. If this is the case, use the sample bolts as a starting
point and customize the code to handle your more advanced scenario.
• Simplified Configuration
The Storm topology provided with the Kafka-Storm Developer Template reads its
configuration and authentication/authorization settings from local properties files; an
approach that may be more simple than is called for by your scenario. Your production
integration may require alternative approaches for configuring these settings. For more
information about alternative approaches to configuration for the Kafka-Storm
Developer Template, see "Alternative Approaches to Configuration" (page 11-8).
Also keep in mind that Storm bolts do not perform batch processing, instead processing
individual tuples in their execute method. This is especially important when your scenario calls
for sending multiple input items to the Protect Bolt. In such cases, if the Protect Bolt is
configured to perform remote cryptographic processing using the REST API, the network
overhead of this API will not allow the topology to perform well. For large scale message
processing, it is much more efficient to perform local cryptographic operations using the Simple
API. If you need to perform bulk input processing using the remote REST API, you should
consider using the batch processing technologies built on top of Storm, such as Apache
Trident.
11-19 CONFIDENTIAL
Simplifications of the Kafka-Storm Developer Template Developer Templates Integration Guide Version 5.0
CONFIDENTIAL 11-20
12 Troubleshooting
This chapter describes some common problems that can arise when working with the Voltage
SecureData for Hadoop Developer Templates, and provides tips for solving them. It includes
the following topics:
• Calling the Simple API From More Than One NAR File (page 12-2)
• Simple Queries using Hive UDFs Fail on Some Hadoop Distributions (page 12-3)
• Queries Using Hive UDFs Fail with Literal Values (page 12-4)
• Hive Script Changes Required When Using Hive 3.0 (page 12-4)
• Hive Queries Fail in Kerberized Clusters When kinit Is Not Performed (page 12-4)
• Binary Hive UDFs Fail Due to Data Being Too Large for the REST API (page 12-4)
• Failure to Copy JAR Files to the hive/lib Directory on All Data Nodes (page 12-4)
• Sqoop Steps (Including codegen) Fail with DB Driver Error (page 12-5)
• Sqoop codegen Command Fails with Streaming Result Set Error (page 12-6)
• Sqoop Jobs Fail with ORM Class Method Error (page 12-7)
• Simple API Operations for Dates Fail with VE_ERROR_GENERAL Error (page 12-9)
• Simple API Operations Fail with Library Load Error (page 12-9)
• Simple API Operations Fail with Network Connection Error (page 12-9)
• Developer Templates Cannot Find the Configuration Files in HDFS (page 12-10)
12-1 CONFIDENTIAL
Hadoop Build Issues Developer Templates Integration Guide Version 5.0
• Remote Queries Using Hive UDFs Fail on BigInsight 4.0 (page 12-10)
• Hadoop Job Error: “Container exited with a non-zero exit code 134” (page 12-11)
• Hadoop Job Not Failing when Invalid Auth Credentials Used for Access (page 12-11)
• Hadoop Job Fails with REST on Older Voltage SecureData Servers (page 12-11)
• Hadoop Job Fails with Specific REST API Error Code and Message (page 12-11)
• Hadoop Job Tasks Fail When Voltage SecureData Server is Overloaded (page 12-12)
This section discusses issues that can arise when building the Hadoop Developer Templates.
To resolve this issue, install the Glib* package for your Linux distribution. For example, if you
are using CentOs 6, run the command yum install glib*
This section discusses issues that can arise when building the NiFi Developer Template.
Calling the Simple API From More Than One NAR File
If your scenario involves building more than one NAR file that calls the Simple API, you must
build those NAR files such that the dependent JAR files, including the Simple API JAR file
vibesimplejava.jar, are not included in those NAR files. Instead, you will take steps to get
those JAR files loaded in a way that allows the NAR files to share them without trying to load
them more than once. For more information, see the relevant build note in "Build Notes" (page
2-25).
CONFIDENTIAL 12-2
Developer Templates Integration Guide Version 5.0 Issues Running Hadoop Jobs
Errors that occur when running a Hadoop job can be caused by issues with your Hadoop
implementation, rather than with the Voltage SecureData software. Be sure that you can
successfully run MapReduce, Hive, and Sqoop jobs that do not require the Voltage SecureData
software.
Job errors can also be caused by connection or authentication errors with the Voltage
SecureData Server, or issues with the data protection APIs (the Simple API and the REST API).
If you encounter an error from the Simple API when running the Developer Template code,
verify that you have followed the Simple API installation and verification steps, and that you can
run the sample code provided with the Simple API as the user under which you will be running
the Developer Templates. For example, some errors might be related to file permissions. See
the Voltage SecureData Simple API Developer Guide for details about specific error codes.
• The second (uber) JAR file below depends on the first JAR file, dictating that the latter
must be specified before the former when creating temporary or permanent UDFs:
2. voltage-hadoop.jar (depends on 1)
NOTE: The JAR file voltage-hadoop.jar is built as an uber JAR file that contains
vsrestclient.jar and its JSON and HTTP Client library dependencies.
NOTE: For some Hadoop distributions, it is necessary to copy these two JAR files
(and the configuration JAR file, vsconfig.jar, if used) to the hive/lib directory
on all data nodes in your Hadoop cluster. While not required for every Hadoop
distribution, this action is otherwise harmless and is documented here as current best
practice for the Hadoop Developer Templates on all Hadoop distributions. For more
information, see "Setting Up to Run the Hive Developer Template" (page 5-24).
• Queries run from a computer outside the Hadoop cluster using a JDBC/ODBC call must
use permanent UDFs.
12-3 CONFIDENTIAL
Issues Running Hadoop Jobs Developer Templates Integration Guide Version 5.0
• Queries run from a node within the Hadoop cluster may use either temporary or
permanent UDFs (permanent UDFs are recommended).
If you encounter this error, run kinit as usual on your kerberized Hadoop cluster.
Binary Hive UDFs Fail Due to Data Being Too Large for the REST API
By default, the Voltage SecureData Server limits the size of Web Service data to 25 MB, which
may not be sufficient for large data, such as images and video, being protected and accessed
using the binary Hive UDFs with the REST API.
For more information about this type of error and its possible solutions, see "Size Restrictions
When Using the REST API" (page 5-8).
Failure to Copy JAR Files to the hive/lib Directory on All Data Nodes
You should always copy the required two (or three) JAR files to the hive/lib directory on all
data nodes in your Hadoop cluster. While this is not required for all of the supported versions of
all of the supported Hadoop distributions, it should be harmless on those for which it is not
required. Given the general ease of file distribution in Hadoop clusters, the simplest approach is
to include this copy step, regardless of Hadoop distribution and version.
CONFIDENTIAL 12-4
Developer Templates Integration Guide Version 5.0 Issues Running Hadoop Jobs
If you fail to perform this step for a Hadoop cluster for which this requirement exists, you will
encounter ClassLoader issues that generate error messages like the following:
Caused by: java.lang.RuntimeException: Failed to initialize Simple API
at com.voltage.securedata.crypto.LocalCrypto.initSimpleAPI(LocalCrypto.java:93)
at com.voltage.securedata.crypto.LocalCrypto.<init>(LocalCrypto.java:185)
at com.voltage.securedata.crypto.CryptoFactory.getCrypto(CryptoFactory.java:123)
at com.voltage.securedata.hadoop.hive.BaseHiveUDF.getCrypto(BaseHiveUDF.java:106)
at com.voltage.securedata.hadoop.hive.BaseHiveUDF.evaluate(BaseHiveUDF.java:207)
... 44 more
Caused by: java.lang.UnsatisfiedLinkError: Native Library /opt/voltage/simpleapi/
voltage-simple-api-java-5.0.0-Linux-x86_64-64b-r213914/lib/libvibesimplejava.so
already loaded in another classloader
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1903)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1822)
For more information about this distribution-independent Hive preparatory step, see "Setting
Up to Run the Hive Developer Template" (page 5-24).
This error can indicate that you are attempting to run the MapReduce job from a node that is
not allowed to submit MapReduce jobs (some nodes might not be configured as MapReduce
clients). Verify that your node is permitted to submit regular MapReduce jobs before
attempting to run the Developer Templates.
This error indicates that you have not installed the MySQL JDBC driver needed to connect to
your MySQL database server. Consult your Hadoop distribution’s documentation for
information about how to install the required JDBC driver JAR file. For example, in some
distributions, you must copy the JAR file to the sqoop lib directory (such as /usr/lib/
sqoop/lib).
12-5 CONFIDENTIAL
Issues Running Hadoop Jobs Developer Templates Integration Guide Version 5.0
If you see this error, it concerns Sqoop and older versions of the MySQL connector and is not
related to the Hadoop Developer Templates, per se. The root cause of this issue is an
incompatible connector JAR file, and has been detected in the following Hadoop distributions
(and it may occur in other distributions going forward):
• HDP 2.3
• CDH 5.4
For more information, see the Sqoop bug at the following URL:
https://issues.apache.org/jira/browse/SQOOP-1400
The version of Sqoop that comes with the affected distributions is not compatible with the file
mysql-connector-java-5.1.17.jar. The workaround/solution is to use a newer version
of this connector JAR file, such as mysql-connector-java-5.1.31.jar (or newer).
There are two approaches to working around this problem, described below. Try the simple
work-around first, and if it does not work, try the advanced work-around that has appeared in
previous versions of this document.
Simple Work-Around
In most environments, the simple work-around for this is to explicitly specify the JDBC driver to
use, by adding the argument --driver com.mysql.jdbc.Driver to the sqoop codegen
command:
sqoop codegen \
--username $DATABASE_USERNAME \
-P \
--connect jdbc:mysql://$DATABASE_HOST/$DATABASE_NAME \
--table $TABLE_NAME \
--driver com.mysql.jdbc.Driver \
--class-name com.voltage.sqoop.DataRecord \
--bindir .\
--outdir .
exit $?
CONFIDENTIAL 12-6
Developer Templates Integration Guide Version 5.0 Issues Running Hadoop Jobs
Including the --driver argument in the sqoop codegen command instructs Sqoop to use
the most recent connector JAR file for MySQL installed on the machine, which in most cases will
pick up the correct compatible connector.
Advanced Work-Around
If the simple work-around described above did not work as expected, try this more advanced
work-around, which involves downloading the latest MySQL connector from the MySQL Web
site at the following URL (access to which may require an Oracle account):
http://www.mysql.com/downloads/
After you have downloaded a new MySQL connector JAR file, the exact steps for installing it
depend on your specific Hadoop distribution. For example, for the Hortonworks HDP 2.3
distribution, perform the following steps on all Sqoop client nodes in your Hadoop cluster
before attempting the sqoop codegen command again:
3. Point the symbolic link in this directory from the mysql-connector-java.jar to this
new JAR file:
> cd /usr/share/java
> ln -s mysql-connector-java-<version-suffix>.jar \
> mysql-connector-java.jar
Where <version-suffix> is the suffix of the new MySQL connector JAR file you
downloaded. For example, the full filename may be:
mysql-connector-java-5.1.37-bin.jar
12-7 CONFIDENTIAL
Issues Running Hadoop Jobs Developer Templates Integration Guide Version 5.0
If you see this error when running the Sqoop job, check the configuration file vsconfig.xml
to make sure the column names in the Sqoop section exactly match the column names in the
database table, including their case. You can also look at the source code of the generated ORM
class file DataRecord.java, to see the exact names of the generated getter methods.
For example, when generating the ORM class from a table in an Oracle database, you have to
specify table and schema names in uppercase when running the codegen command.
Therefore, the field names configured in the configuration file vsconfig.xml also need to be
specified using uppercase:
<field name = "EMAIL" cryptId = "alpha"/>
This is required to match the generated getter method in the ORM class: get_EMAIL. Although
Oracle will interactively accept lowercase names, the Sqoop integration uses the case-sensitive
Java Reflection API and will not treat email the same as EMAIL.
Simple API:
VE_ERROR_CANNOT_VERIFY_CERT
REST API:
sun.security.validator.ValidatorException: PKIX path building
failed: sun.security.provider.certpath.SunCertPathBuilderException:
unable to find valid certification path to requested target
Either of these errors indicates that the trusted root certificate that signed the TLS certificate
for your Voltage SecureData Server is not trusted by the data protection API in use.
If you are using an untrusted TLS certificate for your Voltage SecureData Server, you must
install the corresponding trusted root certificate into the Simple API trustStore directory on
every data node upon which the Simple API is installed. You must also run the c_rehash
command in each of those trustStore directories. For more information, see the Voltage
SecureData Simple API Installation Guide.
NOTE: The sample code in the Developer Templates uses the same trustStore directory
for the Simple API and the REST API, as explained in "Multiple Developer Template
TrustStores - Background and Usage" (page 3-56). No additional steps are required to
update the JVM truststore used by the REST API.
CONFIDENTIAL 12-8
Developer Templates Integration Guide Version 5.0 Issues Running Hadoop Jobs
If this error occurs, use the class SimpleAPIDateTranslator to convert date formats to and
from the date format expected by the Simple API. For more information, see "Data Translation"
(page 3-67).
For shared secret authentication, the underlying Voltage SecureData cryptographic code on
the client includes a time stamp in the hash-protected internal authentication token, allowing
for expiration after 24 hours. This means that incorrect clock settings on either the client or the
server can cause authentication failures if the Voltage SecureData Key Server detects that the
“issuance date” in the token is too old. When this occurs, the following message can be found in
the Key Server debug logs:
If you see this type of error, make sure that the Simple API is properly installed on all Hadoop
data nodes in your cluster and that the permission settings for the shared library file
libvibesimplejava.so are set correctly to allow the corresponding user to run the job.
12-9 CONFIDENTIAL
Issues Running Hadoop Jobs Developer Templates Integration Guide Version 5.0
This can occur when the permissions for the trustStore directory have been set manually, or
if there are restrictive umask settings on the node. You must reset the permissions for the
trustStore directory on all nodes to be world-readable.
This workaround, which is primarily needed for the HDP 2.2 distribution, assumes the following
location (which is the default location for HDP 2.2) for the core-site.xml file:
/etc/hadoop/conf/core-site.xml
Verify that your version of the file core-site.xml is in the directory /etc/hadoop/conf
before you do a build. If the file is in a different location, you can update the following line of
code in the file HDFSConfigLoader.java with the correct location:
private static final String CORE_SITE_XML =
"/etc/hadoop/conf/core-site.xml";
If you installed the Simple API in the directory /root, you must reinstall it in a different location.
To resolve this, restart HiveServer2 and re-run the Remote Hive query.
CONFIDENTIAL 12-10
Developer Templates Integration Guide Version 5.0 Issues Running Hadoop Jobs
Hadoop Job Error: “Container exited with a non-zero exit code 134”
The full error will look something like this:
Exit code: 134
Stack trace: ExitCodeException exitCode=134:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
...
at java.lang.Thread.run(Thread.java:745)
Shell output: main : command provided 1
Container exited with a non-zero exit code 134
If the Hadoop job (for example, MapReduce) fails with this error, confirm that the permissions
on the Developer Templates directory specify that it is accessible by the user running the job. A
typical cause for this error is if the Developer Templates directory was copied from another
location as the root user and is not set for access by other users.
Hadoop Job Not Failing when Invalid Auth Credentials Used for Access
If a Hadoop access (decryption or de-tokenization) job runs to completion without throwing an
exception, even though the authentication/authorization credentials are not correct, check the
value of the following configuration setting in the configuration file vsconfig.xml:
<general returnProtectedValueOnAccessAuthFailure="true_or_false" />
If this value is set to true, then this behavior is expected: authentication/authorization failures
are trapped and ignored during access operations, with the protected values returned instead
as part of the successful completion. If you do not want this behavior, change this configuration
setting to false. Note that false is the default setting.
If your Voltage SecureData Server has a version earlier than 6.0, it does not support the REST
API. If you want to use the REST API, you must upgrade your Voltage SecureData Server to
version 6.0 or later.
Hadoop Job Fails with Specific REST API Error Code and Message
If the Hadoop job fails with an HTTP status, error code, and message of the following form, the
Voltage SecureData REST API Developer Guide will provide additional details about the error:
httpStatus: <status_number>; code: <error_code>; message: <message_text>
12-11 CONFIDENTIAL
Issues Running Hadoop Jobs Developer Templates Integration Guide Version 5.0
This error usually indicates that the data was not loaded into the database table as UTF8,
causing corruption in the interpreted bytes. To correct this, make sure you include the directive
to specify the character set as UTF8 so that the non-ASCII data in the name column is loaded
correctly:
LOAD DATA LOCAL INFILE '/<your_absolute_path>/plaintext.csv' INTO
TABLE voltage_sample CHARACTER SET UTF8 FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
The exact error messages in the log files on the Hadoop client and Voltage SecureData Server
vary, but the general behavior is a failure of one or more tasks to download keys from the Key
Server, even though there are no problems (such as invalid authentication credentials) with the
actual requests themselves.
If you experience this issue, note that you generally need to scale your Voltage SecureData
Server cluster to handle the load coming from your Hadoop cluster. There is no strict formula
for determining exactly how many hosts you need in your Voltage SecureData Server cluster,
but the general recommendation is about one Voltage SecureData Server (configured with
1000 server threads each) per 1000 MapReduce containers. You may have to try a few
different cluster configurations, starting with this recommendation, to find the one that works
best for you.
CONFIDENTIAL 12-12
Developer Templates Integration Guide Version 5.0 Issues Running Hadoop Jobs
at com.voltage.securedata.spark.rdd.SDSparkDriver$.main(SDSparkDriver.scala:60)
at com.voltage.securedata.spark.rdd.SDSparkDriver.main(SDSparkDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl
.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$
.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:730)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Spark2 is significantly different than the original version of Spark (Spark1), including internal
changes and changes to the API. The Spark Developer Template requires Spark2 to execute
successfully. You can check your version of Spark using the following command:
spark-submit --version
12-13 CONFIDENTIAL
Issues Running Hadoop Jobs Developer Templates Integration Guide Version 5.0
CONFIDENTIAL 12-14
A Voltage SecureData Server Configuration
The formats, identities, and authentication credentials that work with the Developer Templates
are specific to the dataprotection Voltage SecureData Server hosted by Micro Focus Data
Security (voltage-pp-0000.dataprotection.voltage.com). If you don’t have access to
this hosted Voltage SecureData Server, you will need to configure your own Voltage
SecureData Server with identical settings in order to run the Developer Templates, as delivered.
This section specifies those settings. For additional assistance, contact your Voltage
SecureData administrator or see the Voltage SecureData Administrator Guide.
Formats
A format specifies the settings used during data protection operations. These settings include
restrictions on input data, the appearance of the protected data, and the type of protection,
either encryption or tokenization. The settings are bundled together as a format that can be
referenced by name. The format name is a required cryptId setting in the XML configuration file
vsconfig.xml. For details, see "Format" (page 3-20).
Use the Management Console to configure the formats you need to protect and access your
data. Navigate to the Data Protection Settings > Format Settings tab on the Management
Console to create these formats. You can use either encryption or tokenization to protect credit
card numbers and social security numbers. You can only use encryption to protect strings,
numbers, and dates.
The following formats, used by the Developer Templates, are already configured on the
dataprotection Voltage SecureData Server:
• Alphanumeric
• SSN
• cc-sst-6-4
• DATE-ISO-8601
• AlphaExtendedTest1 (Required for the REST API only, and available only in
versions 6.0 and later of the Voltage SecureData Server)
The Alphanumeric and SSN formats are pre-configured on all Voltage SecureData Servers, but
if you are not using the dataprotection Voltage SecureData Server, you must configure the
latter three formats above.
Navigate to the Data Protection Settings > Format Settings tab on the Management
Console if you need to create these formats.
A-1 CONFIDENTIAL
Formats Developer Templates Integration Guide Version 5.0
cc-sst-6-4 Format
Click the Create Credit Card Format link to open the Add new Credit Card format page.
Configure the new credit card format using the following settings:
CONFIDENTIAL A-2
Developer Templates Integration Guide Version 5.0 Formats
DATE-ISO-8601 Format
Click the Create Date Format link to open the Add new Date format page. Configure the new
date format using the following settings:
AlphaExtendedTest1
To create an extended alphabet format for use with the REST API, click the Create Variable-
length String Format link to open the Add new Variable Length String format page.
Configure the new variable-length string format using the following settings:
• Alphabet:
0x41-0x5A,0x61-0x7A,0xC0,0xC2,0xC4,0xC6-0xCB,0xCE-0xCF,0xD4,0xD6,
0xD9,0xDB-0xDC,0xDF,0xE0-0xE2,0xE4,0xE6-0xEB,0xEE-0xEF,0xF4,0xF6,
0xF9,0xFB-0xFC,0x130-0x131,0x11E-0x11F,0x15E-0x15F
A-3 CONFIDENTIAL
Identity Developer Templates Integration Guide Version 5.0
Identity
The REST API limits access to data protection and access operations by using an identity to
determine the level of authorization.
NOTE: Unlike the REST API, the Simple API only supports authentication, without different
levels of authorization.
Encryption formats can be used by multiple identities, but each tokenization format is bound to
a single identity. A Voltage SecureData administrator specifies the identity when creating a
tokenization format, and then sets up authorization rules for all identities that are to be used for
authorization. The identity is a required cryptId setting in the configuration file vsconfig.xml.
For more information, see "Authentication and Authorization Overview" (page 3-1) and
"Identity" (page 3-17).
The rules, which are configured using the Web Service > Identity Authorization tab of the
Management Console, control which operations can be performed when using a specific set of
authentication credentials and a matching identity.
• Protect - Permits a user to encrypt or tokenize data. Without permission to Protect, the
user can perform decryption and de-tokenization actions only.
CONFIDENTIAL A-4
Developer Templates Integration Guide Version 5.0 Authentication
For example, if you need to both tokenize and de-tokenize data, with the ability to see the
plaintext values for the de-tokenized data, the Voltage SecureData administrator must enable
an identity authorization rule that authorizes both Protect and Full Access for the identity
bound to the tokenization format. When you run the Voltage SecureData for Hadoop software,
you must use authentication credentials that meet the criterion for matching this rule.
NOTE: If an identity authorization rule grants masked access, some or all of the plaintext
values in the output will display mask characters instead of the plaintext values. If an identity
authorization rule specifies no access, the plaintext values will not be displayed at all.
Authentication
By default, the Voltage SecureData for Hadoop Developer Templates use the shared secret
authentication method that is already configured on the dataprotection Voltage
SecureData Server. If you are not using that Voltage SecureData Server, you must configure the
shared secret or LDAP (username/password) authentication method in both the Key
Management > Authentication tab and the Web Service > Identity Authorization tab in the
Management Console.
NOTE: The default values in the configuration file vsauth.xml use an identity of
[email protected] and a Shared Secret authentication method with the secret value of
voltage123. If you are using different values on your Voltage SecureData Server, you must
update the configuration file vsauth.xml with those values.
Contact your Voltage SecureData administrator or see the Voltage SecureData Administrator
Guide for details about configuring an identity and authentication method.
Authentication verifies that users running protect and access operations have identified
themselves to the Voltage SecureData Server with valid credentials. Two types of
authentication can be used:
• Shared Secret - An arbitrary string (used like a password) that is shared between the
Voltage SecureData for Hadoop software and the Voltage SecureData Server. An
Voltage SecureData administrator enters this string when creating the authentication
method, and communicates its value to users of client applications.
A-5 CONFIDENTIAL
Authentication Developer Templates Integration Guide Version 5.0
The Voltage SecureData Server verifies that the authentication credentials used in the protect
or access operations are valid for at least one of the authentication methods available for that
Voltage SecureData Server. Authentication information must be included in the configuration
file vsauth.xml. For more information, see "Authentication and Authorization Overview" (page
3-1).
You must ensure that at least one authentication method is configured in both the Key
Management > Authentication tab and the Web Service > Identity Authorization tab in the
Management Console.
NOTE: After changing the settings in the Management Console, navigate to the System tab
and click the Deploy button.
CONFIDENTIAL A-6