Installation Guide Apache Kylin
Installation Guide Apache Kylin
Software Requirements
Tests passed on Hortonworks HDP2.4, Cloudera CDH 5.7 and 6.3.2, AWS EMR 5.31 and
6.0, Azure HDInsight 4.0.
We recommend you to try out Kylin or develop it using the integrated sandbox, such as
HDP sandbox, and make sure it has at least 10 GB of memory. When configuring a
sandbox, we recommend that you use the Bridged Adapter model instead of the NAT
model.
Hardware Requirements
The minimum configuration of a server running Kylin is 4 core CPU, 16 GB RAM and 100
GB disk. For high-load scenarios, a 24-core CPU, 64 GB RAM or higher is recommended.
Hadoop Environment
Kylin relies on Hadoop clusters to handle large data sets. You need to prepare a Hadoop
cluster with HDFS, YARN, Hive, Zookeeper and other services for Kylin to run.
Kylin can be launched on any node in a Hadoop cluster. For convenience, you can run
Kylin on the master node. For better stability, it is recommended to deploy Kylin on a clean
Hadoop client node with Hive, HDFS and other command lines installed and client
configuration (such as core-site.xml, hive-site.xmland others) are also reasonably
configured and can be automatically synchronized with other nodes.
Linux accounts running Kylin must have access to the Hadoop cluster, including the
permission to create/write HDFS folders, Hive tables.
Kylin Installation
Download a Apache kylin 4.0.0 binary package from the Apache Kylin Download
Site. For example, the following command line can be used:
cd /usr/local/
wget http://mirror.bit.edu.cn/apache/kylin/apache-kylin-4.0.0/apache-
kylin-4.0.0-bin.tar.gz
Unzip the tarball and configure the environment variable $KYLIN_HOME to the Kylin
folder.
$KYLIN_HOME/bin/download-spark.sh
Kylin 4.0 uses MySQL as metadata storage, make the following configuration in
kylin.properties:
kylin.metadata.url=kylin_metadata@jdbc,driverClassName=com.mysql.jdbc.Dri
ver,url=jdbc:mysql//localhost:3306/kylin_test,username=,password=
kylin.env.zookeeper-connect-string=ip:2181
You need to change the Mysql user name and password, as well as the database and table
where the metadata is stored. And put mysql jdbc connector into $KYLIN_HOME/ext/, if
there is no such directory, please create it.
Please refer to 配置 Mysql 为 Metastore learn about the detailed configuration of MySQL
as a Metastore.
For Hadoop environment of CDH6.X, EMR5.X, EMR6.X, you need to perform some
additional steps before starting kylin.
For CDH6.X environment, please check the document: Deploy kylin4.0 on CDH6
For EMR environment, please check the document: Deploy kylin4.0 on EMR
Kylin runs on a Hadoop cluster and has certain requirements for the version, access rights,
and CLASSPATH of each component. To avoid various environmental problems, you can
run the script, $KYLIN_HOME/bin/check-env.sh to have a test on your environment, if
there are any problems with your environment, the script will print a detailed error
message. If there is no error message, it means that your environment is suitable for Kylin
to run.
Start Kylin
Run the script, $KYLIN_HOME/bin/kylin.sh start , to start Kylin. The interface output is
as follows:
Using Kylin
Stop Kylin
Run the $KYLIN_HOME/bin/kylin.sh stop script to stop Kylin. The console output is as
follows:
You can run ps -ef | grep kylin to see if the Kylin process has stopped.
After realizing the feature of supporting build and query in Spark Standalone mode, we
tried to deploy Kylin 4.0 without Hadoop on the EC2 instance of AWS, and successfully
built the cube and query.
Environment preparation
The component version information provided here is that we selected during the test. If
users need to use other versions for deployment, you can replace them by yourself and
ensure the compatibility between component versions.
JDK 1.8
Hive 2.3.9
Zookeeper 3.4.13
Kylin 4.0 for spark3
Spark 3.1.1
Hadoop 3.2.0(No startup required)
Deployment process
Modify profile
vim /etc/profile
# Add the following at the end of the profile file
export JAVA_HOME=/usr/local/java/jdk1.8.0_291
export JRE_HOME=${JAVA_HOME}/jre
export HADOOP_HOME=/etc/hadoop/hadoop-3.2.0
export HIVE_HOME=/etc/hadoop/hive
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=$HIVE_HOME/bin:$HIVE_HOME/conf:${HADOOP_HOME}/bin:$
{JAVA_HOME}/bin:$PATH
# Execute after saving the contents of the above file
source /etc/profile
2 Install JDK 1.8
Copy the jar package required by S3 to the Hadoop class loading path, otherwise an
error of ClassNotFound type may occur
cd /etc/hadoop
cp hadoop-3.2.0/share/hadoop/tools/lib/aws-java-sdk-bundle-
1.11.375.jar hadoop-3.2.0/share/hadoop/common/lib/
cp hadoop-3.2.0/share/hadoop/tools/lib/hadoop-aws-3.2.0.jar hadoop-
3.2.0/share/hadoop/common/lib/
Note: Please configure VPC and security group correctly to ensure that EC2
instances can access the database normally.
http://www.apache.org/licenses/LICENSE-2.0
java.lang.NoSuchMethodError:
com.google.common.base.Preconditions.checkArgument(ZLjava/lang/Stri
ng;Ljava/lang/Object;)V
This is caused by the inconsistency between the guava version in hive2 and the
guava version in Hadoop3. Please replace the guava jar in directory
$HIVE_HOME/lib with the guava jar in directory
$HADOOP_HOME/share/hadoop/common/lib/.
To prevent jar package conflicts in the subsequent process, you need to remove
some spark and scala related jar packages from hive’s class loading path:
mkdir $HIVE_HOME/spark_jar
mv $HIVE_HOME/lib/spark-* $HIVE_HOME/spark_jar
mv $HIVE_HOME/lib/jackson-module-scala_2.11-2.6.5.jar
$HIVE_HOME/spark_jar
Note: Here just lists the conflicting jar packages encountered during the test. If users
encounter problems similar to jar package conflicts, you can judge which jar
packages have conflicts according to the class loading path and remove the relevant
jar packages. It is recommended to keep the jar package version under the spark
class loading path when the same jar package has version conflicts.
Preparing the zookeeper configuration file. Since only one EC2 node is used in the
test, the zookeeper pseudo cluster is deployed here.
cp /etc/hadoop/zookeeper/conf/zoo_sample.cfg
/etc/hadoop/zookeeper/conf/zoo1.cfg
cp /etc/hadoop/zookeeper/conf/zoo_sample.cfg
/etc/hadoop/zookeeper/conf/zoo2.cfg
cp /etc/hadoop/zookeeper/conf/zoo_sample.cfg
/etc/hadoop/zookeeper/conf/zoo3.cfg
Modify the above three configuration files in sequence and add the following
contents, note that change the directory name to a different directory:
server.1=localhost:2287:3387
server.2=localhost:2288:3388
server.3=localhost:2289:3389
dataDir=/tmp/zookeeper/zk1/data
dataLogDir=/tmp/zookeeper/zk1/log
clientPort=2181
Kylin may encounter ClassNotFound type errors during startUp. Please refer to the
following methods to restart kylin:
# Download commons-collections-3.2.2.jar
cp commons-collections-3.2.2.jar
$KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/lib/
# Download commons-configuration-1.3.jar
cp commons-configuration-1.3.jar
$KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/lib/
cp $HADOOP_HOME/share/hadoop/common/lib/aws-java-sdk-bundle-
1.11.563.jar $KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/lib/
cp $HADOOP_HOME/share/hadoop/common/lib/hadoop-aws-3.2.2.jar
$HADOOP_HOME/tomcat/webapps/kylin/WEB-INF/lib/
If you need to cluster multiple Kylin nodes, make sure they use the same Hadoop cluster.
Then do the following steps in each node’s configuration file
$KYLIN_HOME/conf/kylin.properties:
1. Configure the same kylin.metadata.url value to configure all Kylin nodes to use
the same Mysql metastore.
2. Configure the Kylin node list kylin.server.cluster-servers, including all
nodes (the current node is also included). When the event changes, the node
receiving the change needs to notify all other nodes (the current node is also
included).
3. Configure the running mode kylin.server.mode of the Kylin node. Optional
values include all, job, query. The default value is all.
The job mode means that the service is only used for job scheduling, not for queries;
the query pattern means that the service is only used for queries, not for scheduling
jobs; the all pattern represents the service for both job scheduling and queries.
Note: By default, only one instance can be used for the job scheduling (ie.,
kylin.server.mode is set to all or job).
Since v2.0, Kylin supports multiple job engines running together, which is more extensible,
available and reliable than the default job scheduler.
To enable the distributed job scheduler, you need to set or update the configs in the
kylin.properties, there are two configuration options:
kylin.job.scheduler.default=2
kylin.job.lock=org.apache.kylin.job.lock.zookeeper.ZookeeperJobLock
Then please add all job servers and query servers to the kylin.server.cluster-servers.
Use CuratorScheculer
Since v3.0.0-alpha, kylin introduces the Leader/Follower mode multiple job engines
scheduler based on Curator. Users can modify the following configuration to enable
CuratorScheduler:
kylin.job.scheduler.default=100
kylin.server.self-discovery-enabled=true
For more details about the kylin job scheduler, please refer to Apache Kylin Wiki.
To send query requests to a cluster instead of a single node, you can deploy a load balancer
such as Nginx, F5 or cloudlb, etc., so that the client and load balancer communication
instead communicate with a specific Kylin instance.
There are some differences between read and write separation deployment of kylin 4 and
kylin 3, Please refer to : Read Write Separation Deployment for Kylin 4
JDK 1.8
Hadoop 2.8.5
Hive 1.2.1
Spark 2.4.7
Kafka 1.1.1
MySQL 5.1.73
Zookeeper 3.4.6
We have pushed the Kylin image for the user to the docker hub. Users do not need to build
the image locally, just execute the following command to pull the image from the docker
hub:
After the pull is successful, execute the following command to start the container:
docker run -d \
-m 8G \
-p 7070:7070 \
-p 8088:8088 \
-p 50070:50070 \
-p 8032:8032 \
-p 8042:8042 \
-p 2181:2181 \
apachekylin/apache-kylin-standalone:4.0.0
The following services are automatically started when the container starts:
NameNode, DataNode
ResourceManager, NodeManager
Kylin
After the container is started, we can enter the container through the docker exec -it
<container_id> bash command. Of course, since we have mapped the specified port in
the container to the local port, we can open the pages of each service directly in the native
browser, such as:
In order to allow Kylin to build the cube smoothly, the memory resource we configured for
Yarn NodeManager is 6G, plus the memory occupied by each service, please ensure that
the memory of the container is not less than 8G, so as to avoid errors due to insufficient
memory.
For the resource setting method for the container, please refer to:
For how to customize the image, please check the github page kylin/docker.
Advanced Settings
Overwrite default kylin.properties at Cube level
Since Kylin 2.0, Kylin support multiple job engines running together, which is more
extensible, available and reliable than the default job scheduler.
To enable the distributed job scheduler, you need to set or update the configs in the
kylin.properties:
kylin.job.scheduler.default=2
kylin.job.lock=org.apache.kylin.storage.hbase.util.ZookeeperJobLock
Please add all job servers and query servers to the kylin.server.cluster-servers.
Kylin can send email notification on job complete/fail; To enable this, edit
conf/kylin.properties, set the following parameters:
mail.enabled=true
mail.host=your-smtp-server
mail.username=your-smtp-account
mail.password=your-smtp-pwd
mail.sender=your-sender-address
kylin.job.admin.dls=adminstrator-address
Restart Kylin server to take effective. To disable, set mail.enabled back to false.
Administrator will get notifications for all jobs. Modeler and Analyst need enter email
address into the “Notification List” at the first page of cube wizard, and then will get
notified for that cube.
Kylin can use MySQL as the metadata storage, for the scenarios that HBase is not the best
option; To enable this, you can perform the following steps:
kylin.metadata.url={your_metadata_tablename}@jdbc,url=jdbc:mysql://localh
ost:3306/kylin,username={your_username},password={your_password},driverCl
assName=com.mysql.jdbc.Driver
kylin.metadata.jdbc.dialect=mysql
kylin.metadata.jdbc.json-always-small-cell=true
kylin.metadata.jdbc.small-cell-meta-size-warning-threshold=100mb
kylin.metadata.jdbc.small-cell-meta-size-error-threshold=1gb
kylin.metadata.jdbc.max-cell-size=1mb
In “kylin.metadata.url” more configuration items can be added; The url, username, and
password are required items. If not configured, the default configuration items will be
used:
cd $KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/lib
java -classpath kylin-server-base-\<version\>.jar:kylin-core-
common-\<version\>.jar:spring-beans-4.3.10.RELEASE.jar:spring-core-
4.3.10.RELEASE.jar:commons-codec-1.7.jar
org.apache.kylin.rest.security.PasswordPlaceholderConfigurer AES
<your_password>
Start Kylin