Qlik Associative Big Data Index Setup Configuration and Deployment
Qlik Associative Big Data Index Setup Configuration and Deployment
This new capability provides a governed performant associative layer which can be deployed within
sources such as Hadoop based data lakes, without the need to load the data into memory. QABDI
Index enables fast and engaging data discovery on massive data volumes with full access to all the
details of the underlying data.
This paper is designed to deliver a technical overview of the steps involved deploying a QABDI
infrastructure in AWS EKS, indexing a sample dataset and building an on-demand solution in Qlik
Sense.
Scenario
The deployment supports a sample customer use case in which they have a requirement to derive
value from a large volume of data to be analyzed with Qlik Sense, they want to allow a section of
users to access all the data in a governed environment without impacting the source with SQL
queries. A combination of approaches will be utilized including the deploying QABDI in conjunction
with on demand app generation (ODAG). The source data used as in this paper is a combination of
open source travel data from Flights, New York Taxi, Chicago Taxi and New York City Bike
websites.
All data is in Parquet file format the expectation is the data has been “prepared” prior to the indexing
procedure, potentially by Qlik Data Catalyst. The following environment is used to analyze this
data:
• ODAG detail app to replace the source for QABDI instead of the database
This following high level process enables the deployment of QABDI in an AWS EKS cluster:
Workstation
The cluster is deployed using a workstation with the following prerequisites in place and the
following software installed on an Ubuntu 18.04 instance. (Note Kubectl currently has an issue with
windows deployment terminal so Linux is recommended):
• The AWS command line interface configured to access the instance - Aws Cli
February 2019 release or above with the following settings is also required:
[Settings 7]
EnableBDIClient=1
BDIAsyncRequests=1
BDIStrictSynchronisation=0
Chart recommendation feature that leads to complex expressions generated from drag-and-drops is
not yet supported by QABDI, it can be disabled in the C:\Program
Files\Qlik\Sense\CapabilityServicecapabilities. json file with:
{
"flag": "DISABLE_AUTO_CHART",
"enabled": true
}
A per application set statement is also required to disable the insight advisor
SET DISABLE_INSIGHTS = 1;
AWS EKS Installation
Instantiation of the EKS cluster can be achieved using an open source utility eksctl :
$ curl -sL
"https://github.com/weaveworks/eksctl/releases/download/latest_release/eksctl_$(uname-
s)_amd64.tar.gz" | tar xz -C /tmp
$ sudo mv /tmp/eksctl /usr/local/bin
aws-iam-authenticator
A tool to use AWS IAM credentials to authenticate to a Kubernetes cluster is required. To enable
this aws-iam-authenticator can be used.
https://docs.aws.amazon.com/eks/latest/userguide/install-aws-iam-authenticator.html
The cluster is Initialized with a name, region, tags,nodes and node type. For additional configuration
options see the eksctl documentation:
If eksctl fails, check CloudFormation for error messages. Confirm that aws-iam-authenticator has
worked by checking your current Kubernetes context:
Configuring EFS
The index is created within the EFS file system created as follows:
This will also create a FileSystemId, additionally VpcId and SubnetIds are required
Mount points are created for the EFS storage and connected to the Vpc that the EKS cluster is in.
The VpcId, SubnetId and SecurityGroupId’s are required, the Vpc Id can be found as follows:
Sample output:
vpc-0ad865ba8700c957d
Sample output:
subnet-019b8cc9e2a5c7ef0
subnet-0aba1710d318a9db4
subnet-039dfe928f362c575
Create a mount point for each SubnetId with the VpcId and SecurityGroupIds:
The efs provisioner allows you to mount EFS storage as PersistentVolumes in Kubernetes.
Install the EFS provisioner (name = aws-efs with storgeclass = efs) with Helm by setting
the FileSystemId and region. Be sure Path is set to / or the pod will fail to create.
For more configuration options consult the official EFS provisioner chart.
Deploying QABDI in the AWS EKS Cluster
Key and Repository setup
Helm packages Kubernetes applications together as a "chart". These charts are tarballs that are
stored externally.
Bintray is used as a chart repository, adding the Helm repository containing the QABDI charts as
follows:
Deployment of QABDI
Installation of the QABDI chart with the default values achieved by providing a release-name,
repository (bt_qlik), license acceptance and licence key as follows:
The release-name is a string that can be used to differentiate Helm deployments (as you could
theoretically deploy QABDI more than once on a cluster). If one is not provided, Helm will generate
one.
By setting acceptLicense=true you agree to the Qlik User License Agreement (QULA), which is
required to start the qsl_processor_tool and the indexer_tool. You don't need to do it while running
helm install but it is the easiest way. Another way to accept the license agreement is to log in to the
bastion and type export ACCEPT_QULA=true. When running helm install the QULA text will be
printed to the console.
The default values.yaml and be overridden with additional yaml files and optionally set flags:
The number of QABDI services are configured in additional yaml files. To install the configuration
required i.e. three indexers, one indexingmanager, one qslexecutor, one qslmanager, two
qslworkers and three symbolservers by changing the replicaCounts :
## Image configuration.
image:
repository: qlik-docker-qabdi.bintray.io/bdiproduct
To access the index from Sense via the QSL manager LoadBalancer can be used with default port
55000 (qsl-manager-loadbalancer.yaml) provided :
## BDI values
qslmanager:
service:
type: LoadBalancer
In addition, a series of yaml files are provided with this paper for varying data volumes.
Checking the pods running per node and status will show a similar output to:
Retrieving the external IP of the qsl manager which will be used as the Host when creating a new
QABDI connection in Qlik Sense:
Note the host for the QABDI connection in the above example will be:
a3b0757b1decb11e88b8f0a1f2ce7daa-397101174.eu-west-1.elb.amazonaws.com
Mount the EFS Drive into the Pods
The deployed pods required a shared mount to access the source parquet files and to act as a
repository for the index output.
Note. The mount command requires root access to enable this “privileged” is required to be set to
“true” allowing the docker container root access to the pods:
Mounting the drives is a two stage process (in this case a shell script – exec_in_all_pods.sh shared
with this paper) is used to execute the commands in all pods vs individually, firstly the shared folder
is created:
The EFS drive is mounted with reference to the Filesystem id and region:
The source parquet files are organized in an EFS bucket and the required configuration files are
updated to specify the source/output and associations. All configuration files are stored in
/home/ubuntu/dist/runtime/config/ in the pods.
field_mappings_file.json
This file is required to specify the associations between the files (called A2A in the indexing
process).
{
"field_mappings": [
{
"column1": "Flights.Flight_Year",
"column2": "Link.link_flight_year"
},
{
"column1": "Taxi_Bike_Trips.pickup_year",
"column2": "Link.link_pickup_year"
}
]
}
indexing_setting.json
The purpose of this file is to indicate where the source data/output index/mapping file for association
are stored along with the model name (alltrips).
{
"output_root_folder": "/home/output",
"symbol_output_folder": "",
"index_output_folder": "",
"symbol_positions_output_folder": "",
"symbol_server_async_threads": 1,
"create_column_index_threads": 1,
"dataset_name": "alltrips",
"source_data_path":"/home/data/alltrip_source",
"field_mappings_file":"/home/ubuntu/dist/runtime/config/field_mappings_file.json
",
"logging_settings_file": ""
}
“output_root_folder” - output folder for the index files/log and config files
“dataset_name” - used as the model name in Qlik Sense
“source_data_path” - the location of the NYC sample parquet files
”Field_mappings_file” - name of the associations json file.
All the indexing startup scripts and subsequent tasks can be executed from anyone of the available
pods, for this deployment the bastion is used by logging on to it as follows:
$ ./dist/runtime/scripts/indexer/start_indexing_env.sh
Once the indexing services have started the service_manager.sh script can be used to query, start,
or stop indexing:
$ ./home/ubuntu/dist/runtime/scripts/indexer/service_manager.sh
source_data_path: /home/data/alltrips_source
Register ip found in cluster configuration: qlik-bdi-indexingmanager
Running in interactive Mode:
Valid Options:
h) help
1) list
2) stop
3) start
a) stop all services
q) quit
The task_manager.sh script is used to invoke the indexing process via a series of steps:
$ ./dist/runtime/scripts/indexer/task_manager.sh
Step 1 scans the data and creates a series of json files in the output folder
home/data/alltrips_output/indexing/output/config/indexer containing the data types and source
parquet files reference:
1) **** scan data for schema generation ****
[19-02-01T08:30:45:208]-[ss_srv-info]-[000504] /home/data/alltrips
_source/flights.table/ fares.parquet/1_0_9.parquet is a single parquet file
[19-02-01T08:31:33:681]-[ss_srv-info]-[000504] /home/data/alltrips_source/flights.table/
fares.parquet/1_1_4.parquet is a single parquet file
[19-02-01T08:32:21:881]-[ss_srv-info]-[000504] Create symbol table for table 'flights'
takes 8890 seconds
[19-02-01T08:32:21:881]-[ss_srv-info]-[000504] Apply compaction...
[19-02-01T08:33:27:193]-[ss_srv-info]-[000504] Apply compaction for table ' flights' in
dataset 'alltrips' takes 650 seconds
[19-02-01T08:33:27:193]-[ss_srv-info]-[000504] Create symbol table for table 'flights'
indataset 'alltrips'... DONE
Choose Action: l
[19-02-01 08:36:19:457]-[console-info]-[000400] Connect to icd-mn-indexingmanager:55020
[19-02-01 08:36:19:461]-[console-info]-[000400] Symbol Table creation progress at: 100%
[19-02-01 08:36:19:461]-[console-info]-[000400] UnmappedColumn Index creation progress
at: 100%
[19-02-01 08:36:19:462]-[console-info]-[000400] Column Index creation progress at: 100%
[19-02-01 08:36:19:463]-[console-info]-[000400] A2A creation progress at: 100%
QSL services are required to process the selections made from the Sense client and to process any
extractions made in the load script from the Index into memory. The QSL services are dependent
on several services, including Indexing Registry, Persistence Manager and all Symbol services. All
executables to start the QSL are in runtime/scripts/qsl_processor folder.
once the following message appears in the console the process has started:
• Taxi and Bike Details.qvf - Detail application with modified ODAG script
Qlik Sense February 2019 release and above are the only versions which can use the QABDI
functionality and the following flags need to be set:
[Settings 7]
EnableBDIClient=1
BDIAsyncRequests=1
BDIStrictSynchronisation=0
Chart recommendation feature that leads to complex expressions generated from drag-and-drops
which are not yet supported by QABDI, it can be disabled in capabilities. json file with:
{
"flag": "DISABLE_AUTO_CHART",
"enabled": true
}
A per application set statement is also required to disable the insight advisor:
SET DISABLE_INSIGHTS = 1;
One of the modes of interacting with the index is “live” mode which effectively allows Qlik Sense to
have a minimal memory footprint by only loading the metadata into memory.
A connection string is configured with the following parameters: Create a new connection string and
select BDI from the list and enter the criteria:
Currently QABDI does not support search and insight advisor, the search index and insights are
disabled:
SET CreateSearchIndexOnReload=0;
SET DISABLE_INSIGHTS = 1;
As part of the product a QABDI connector has been developed which will allow users to extract data
from the index using autogenerated script through the Data Load Editor (DLE) into an in-memory
application.
Opening the GUI displays all the available entities in the alltrips model for selection:
Inserting the script from connector GUI contains the following:
A default limit for the number of rows of data (MaxRowsPerTable) is set to 10,000 which can be
changed for a higher volume, this is controlled by an initial count before extraction resulting in the
script exiting if the limit is breached:
[rowcount@ Flights]:
QSL Select count(*) as nRows0 from [alltrips].[ Flights] at STATE ${bdiHandle};
let nFilteredRows = Peek('nRows0',0, Flights ');
The load scripts for on-demand template apps contain connection and data load commands whose
parameters are set by a special variable - odb_setHandle that the on-demand app service uses for
linkage. The odb_setHandle variable is used specifically for QABDI linkage and captures all the
selection states from the selection app:
Flights:
QSL SELECT
[ItinID],
[SeqNum],
[Coupons],
[Flight_Year],
[Flight_Quarter],
[Origin],
[OriginCountry],
[OriginState],
[Dest],
[DestCountry],
[DestState],
[TkCarrier],
[Passengers],
[FareClass],
[Distance]
FROM [alltrips].[Flights]
AT STATE $(odb_setHandle);
AT STATE [alltrips].[h6790741d_5ed5_496c_9a5b_fd47b4c165e2]
Conversion of existing SQL generating ODAG apps to use the index as the source replaces the
WHERE_CLAUSE variable creation with a selection state clause using QSL syntax, for example we can
create a specific QSL SET statement which will create the “set handle” containing a reference to the
columns selected and the data to filter on.
The QSL syntax will apply the set handle to the underlying model via an AT STATE statement with
the following syntax:
LOAD <column1,column2>;
QSL SELECT <column1,column2>
FROM <modelname.table>
AT STATE hPassedSelection;
The hPassedSelection set handle is dynamically populated by the ODAG process, in the example
below a set handle is created which consists of:
The following changes are applied to an on-demand detail app which executes SQL in comparison
to the syntax required for QABDI
The SELECTION_STATE generation is modified create syntax for the selection state format and to
cater for multiple selection state criteria, all instances of WHERE_PART are placed with
SELECTION_STATE.
The WHERE and IN clauses are replaced with QSL syntax, the main change is substituting the model
name into the statement [alltrips] with a “.1” suffix to indicate current selections.
To cater for multiple selection states and the format required some string replacement is required:
END SUB;
Changing quoting char for SELECTION_STATE in CALL BuildValueList
For each of the bound fields modification of the quoting options in the CALL BuildValuelist
statements is required to cater for the QSL syntax, by changing the ASCII character code 0:
The script loops through all field bindings and calls the modified subroutine -
(ExtendSelectionState).
The set handle statement which will reference the SELECTION_STATE variable is applied to the fact
table to filter the data:
And finally, the set handle is applied to the QSL SELECT statement containing the fields, model
name and required table:
FROM [alltrips].[Flights]
AT STATE hPassedSelection;
Trouble Shooting and Cheat Sheet
Starting the indexing service and check the services are running:
$ cd /home/ubuntu/dist/runtime/scripts/indexer
$ ./start_indexing_env.sh
$ cd /home/ubuntu/dist/runtime/scripts/qsl_processor
$ ./start_qsl_env.sh
$ ./stop_qsl_env.sh
Error Checking
If the source data is not in the location specified in the indexing_setting.json file or the format is not
as described the following error will be thrown:
$ ps aux|grep qsl
$ ps -ef | grep -E '[i]ndexer_tool|[q]sl_processor_tool'
Note that in case the Indexing/QSL processors stop responding, or crash or just disappear, the best
thing is to start with killing of all the running processes, and then when it comes to start the QSL
processors, you will need to start the indexing cluster, and then skip the perform Indexing tasks and
then start the QSL_Processor tool.
Currently you can kill the QSL and Indexer processes in two ways;
Recommended approach:
$ cd /home/ubuntu/dist/runtime/scripts/qsl_processor
$ ./stop_qsl_env.sh
$ cd /home/ubuntu/dist/runtime/scripts/indexer
$ ./service_manager.sh
$ Enter option: a (stop all services)
Kill command: connect to each QSL and Indexing instance, and type:
QABDI logs are stored in your indexing output folder. Based on the configuration defined for your
QABDI environment, it should be in the following location: /home/efs/alltrips_output/logs
tail -f Mgr_xxx-qslmanager_55000.qlog
Checking the configuration files updated during the indexing process stored in
output/config/indexer/:
registry_service.json
persitence_manager.json
indexing_manager_service.json
To destroy the release and remove all pods, volumes and associated data:
helm del --purge qlik-mn
About Qlik
Qlik is on a mission to create a data-literate world, where everyone can use data to solve their most
challenging problems. Only Qlik’s end-to-end data management and analytics platform brings together
all of an organization’s data from any source, enabling people at any skill level to use their curiosity to
uncover new insights. Companies use Qlik products to see more deeply into customer behavior,
reinvent business processes, discover new revenue streams, and balance risk and reward. Qlik does
business in more than 100 countries and serves over 48,000 customers around the world.
qlik.com
© 2018 QlikTech International AB. All rights reserved. Qlik®, Qlik Sense®, QlikView®, QlikTech®, Qlik Cloud®, Qlik DataMarket®, Qlik Analytics Platform®, Qlik NPrinting®, Qlik
Connectors®, Qlik GeoAnalytics®, Qlik Core®, Associative Difference®, Lead with Data™, Qlik Data Catalyst™, Qlik Associative Big Data Index™ and the QlikTech logos are trademarks of
QlikTech International AB that have been registered in one or more countries. Other marks and logos mentioned herein are trademarks or registered trademarks of their respective owners.
BIGDATAWP092618_MD