0% found this document useful (0 votes)
41 views46 pages

POC Issues 0327

Uploaded by

praveen kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as XLSX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views46 pages

POC Issues 0327

Uploaded by

praveen kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as XLSX, PDF, TXT or read online on Scribd
You are on page 1/ 46

spark-shell --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 --repositories http://repo.hortonworks.

co
/usr/iop/4.2.5.0-0000/hbase/lib/hbase-client.jar,/usr/iop/4.2.5.0-0000/hbase/lib/hbase-protocol.jar,/usr/iop/
server.jar,/usr/iop/4.2.5.0-0000/hbase/lib/guava-12.0.1.jar,/usr/iop/4.2.5.0-0000/hbase/lib/htrace-core-3.1.0-incu
lib/protobuf-java-2.5.0.jar,/usr/iop/4.2.5.0-0000/hbase/lib/hbase-hadoop2-compat.jar,/usr/iop/4.2.5.0-0000/
2.2.0.jar,/usr/iop/4.2.5.0-0000/hbase/lib/htrace-core-3.1.0-incubating.jar,/usr/iop/4.2.5.0-0000/hbase/lib/hbase-
hadoop/lib/hadoop-lzo-0.5.1.jar --m

Reading HDFS File

sc.textFile("hdfs:/Data/csc_insights/disability/rdz/cbs/member/preferences/ONETIME_CBS_TCBECSP_20171
(l.substring(0, 10).trim())).toDF.show(5)

Querying HDFS File


textFile.distinct.count
textFile.registerTempTable("Sample")
val result=sqlContext.sql("select count(distinct(value)) from Sample")
result.show

Filter Condition

Hbase table
import org.apache.hadoop.hbase.spark
import org.apache.spark.sql.{SQLContext, _}
import org.apache.spark.sql.execution.datasources.hbase._
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.{HTableDescriptor,HColumnDescriptor}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.{Put,HTable}
import org.apache.hadoop.fs.{Path, FileAlreadyExistsException, FileSystem}
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.spark._
val conf = HBaseConfiguration.create()
conf.addResource(new Path("/etc/hbase/4.2.5.0-0000/0/hbase-site.xml"))
conf.addResource(new Path("/etc/hbase/4.2.5.0-0000/0/core-site.xml"))
val hbaseContext = new HBaseContext(sc, conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val accountMapping = s"""rowkey INTEGER :key, SRC_SYS_NM STRING b:SRC_SYS_NM, RPT_NUM STRIN
val accountdf=sqlContext.read.format("org.apache.hadoop.hbase.spark").option("hbase.columns.mapping",accoun
accountdf.registerTempTable("edms_qa_test")

val result1 =sqlContext.sql("select count(*) from edms_qa_test where SRC_SYS_NM = 'UDS'")


val result1 =sqlContext.sql("select * from edms_qa_test limit 5")

Joining Hbase and HDFS file


es http://repo.hortonworks.com/content/groups/public/ --files /etc/hbase/4.2.5.0-0000/0/hbase-site.xml --jars
b/hbase-protocol.jar,/usr/iop/4.2.5.0-0000/hbase/lib/hbase-common.jar,/usr/iop/4.2.5.0-0000/hbase/lib/hbase-
base/lib/htrace-core-3.1.0-incubating.jar,/usr/iop/4.2.5.0-0000/hbase/lib/zookeeper.jar,/usr/iop/4.2.5.0-0000/hbase/
mpat.jar,/usr/iop/4.2.5.0-0000/hbase/lib/hbase-hadoop-compat.jar,/usr/iop/4.2.5.0-0000/hbase/lib/metrics-core-
4.2.5.0-0000/hbase/lib/hbase-spark.jar,/usr/iop/4.2.5.0-0000/hive2/lib/hive-hbase-handler.jar,/usr/iop/4.2.5.0-0000/
p/lib/hadoop-lzo-0.5.1.jar --master yarn

NETIME_CBS_TCBECSP_20171018021626.DAT").map(l =>
F.show(5)
.count
able("Sample")
distinct(value)) from Sample")
w
RC_SYS_NM, RPT_NUM STRING b:RPT_NUM""".stripMargin
ase.columns.mapping",accountMapping).option("hbase.table","T_RPT_NUM_ACT").load().persist
CREATE VIEW T_CLM_PY
( ROWKEY VARCHAR PRIMARY KEY,
b."SRC_SYS_NM" VARCHAR,
b."CLM_GUID " VARCHAR,
b."CLM_NUM" VARCHAR,
b."PMT_ID" VARCHAR,
b."PY_HIST_IND" VARCHAR,
e."PAYMENTCONTACTHISTORY" VARCHAR);

CREATE VIEW T_CLM


( ROWKEY VARCHAR PRIMARY KEY,
b."SRC_SYS_NM" VARCHAR,
b."CLM_NUM" VARCHAR,
e."PAYMENTCONTACTHISTORY" VARCHAR);

SELECT COUNT(*) FROM "T_CLM_PY";

SELECT * FROM "T_CLM_PY" WHERE "e"."PAYMENTCONTACTHISTORY" IS NOT NULL LIMIT 5;


Bravo Team have given a build which willdo the below comparison using SPARK
a. Base object: (Hive or HDFS or HBASE) vs (Hive or HDFS or HBASE)
b. Multiple objects: (Hive & HDFS) vs (Hbase_1 & Hbase_2) and all other combinations as well

Process:
1. The comparison is done using Spark. Mapreduce comparison is NA in this new build
2. As of now, this supports and reads data from Hive, HDFS & HBASE components
3. This process reads the data from Hive or HDFS files or HBASE tables and creates & loads them into Spa
4. The output files are written on HDFS. Since this is spark, its not possible to write log files in local Unix s

Below items are the pre-requisits


1. Java 1.8
2. Spark 2.1
3. Access to Namedoe [NT10] - We are seeings issues in ZooKeeper services in edge node [ET01]. In Nam
4. Access to write files to HDFS layer

POC's
1. HDFS vs HDFS comparison
/*HDFS vs HDFS*/
spark-submit --master local[*] --class cog.bravo.sparkComp.SparkComp_3 /hadoop/dsa/qa/jmichael2/Br
/hadoop/dsa/qa/jmichael2/Bravo_Next_Gen_Biz_QA_Unix/DataLake_Processing_QA/DataLake_Test_Ca
SparkSQL_Obj "" "" "<SparkSQLObj>
<object> <source_type>FILE</ <source>/user/jmich
<sql> select * from csv_src </sql></SparkSQLObj>" "" "" "" "" "" "" "
SparkSQL_Obj "" "" "<SparkSQLObj>
<object> <source_type>FILE</ <source>/user/jmich
<sql> select * fr </sql></SparkSQLObj>" "" "" "" "" "" "" "" "" "" "" "
true "true" "/user/jmichael2/test" "" "true"

Output files:
2. HIVE vs HIVE comparison
/*HIVE vs HIVE*/
spark-submit --master local[*] --packages org.apache.hbase:hbase-common:1.0.0,org.apache.hbase:hba
/hadoop/dsa/qa/jmichael2/Bravo_Next_Gen_Biz_QA_Unix/DataLake_Processing_QA/DataLake_Test_Ca
SparkSQL_Obj "" "" "<SparkSQLObj>
<object> <source_type>HIVE</<source>select * fr
<sql> select SUB </sql></SparkSQLObj>" "" "" "" "" "" "" "" "" "" "" "
SparkSQL_Obj "" "" "<SparkSQLObj>
<object> <source_type>HIVE</<source>select * fr
<sql> select SUB </sql></SparkSQLObj>" "" "" "" "" "" "" "" "" "" "" "
true "true" "/user/jmichael2/test" "" "true"

Output Files:

3. HBASE vs HIVE
/*HIVE vs HBASE - NT10*/
spark-submit --master local[*] --packages org.apache.hbase:hbase-common:1.0.0,org.apache.hbase:hba
/home/METNET/jmichael2 "logs" "TC_1" \
SparkSQL_Obj "" "" "<SparkSQLObj>
<object> <source_type>HIVE</<source>select * fr
<sql> select SUB </sql></SparkSQLObj>" "" "" "" "" "" "" "" "" "" "" "
SparkSQL_Obj "" "" "<SparkSQLObj>
<object> <source_type>HBASE<<source>T_RPT_NUM_ <alias>DPA_TGT</alia
true "true" "/user/jmichael2/test" "" "true"
combinations as well

s new build

creates & loads them into Spark-sql tables


to write log files in local Unix server [how our current build works is NA]

s in edge node [ET01]. In Namenode [NT10], its running successfully

/hadoop/dsa/qa/jmichael2/Bravo_Next_Gen_Biz_QA_Unix/DataLake_Processing_QA/DataLake_Test_Case_Executor/cogBrv3897_Metlife
cessing_QA/DataLake_Test_Case_Executor "logs" "TC_1" \

<alias>csv_src</alias><header>false</head<delimiter </object>
parkSQLObj>" "" "" "" "" "" "" "" "" "" "" "," "1" \

<alias>csv_src</alias><header>false</head<delimiter </object>
j>" "" "" "" "" "" "" "" "" "" "" "," "1" \
n:1.0.0,org.apache.hbase:hbase-client:1.0.0,org.apache.hbase:hbase-server:1.0.0 --class cog.bravo.sparkComp.SparkComp_3 /hadoop/ds
cessing_QA/DataLake_Test_Case_Executor "logs" "TC_1" \

<alias>DPA</object>
j>" "" "" "" "" "" "" "" "" "" "" "," "1" \

<alias>DPA</object>
j>" "" "" "" "" "" "" "" "" "" "" "," "1" \

n:1.0.0,org.apache.hbase:hbase-client:1.0.0,org.apache.hbase:hbase-server:1.0.0 --class cog.bravo.sparkComp.SparkComp_3 /home/MET

<alias>DPA</object>
j>" "" "" "" "" "" "" "" "" "" "" "," "1" \

<columns></object> <sql> select B_R </sql></SparkSQLObj>" "" "" "" "" "" "" "" "" "" "" "," "1" \
_Executor/cogBrv3897_Metlife.jar \
mp.SparkComp_3 /hadoop/dsa/qa/jmichael2/Bravo_Next_Gen_Biz_QA_Unix/DataLake_Processing_QA/DataLake_Test_Case_Executor/c

mp.SparkComp_3 /home/METNET/jmichael2/cogBrv3897_Metlife0320.jar \

"" "" "" "," "1" \


taLake_Test_Case_Executor/cogBrv3897_Metlife.jar \
1. Querying Larger data sets
Issue: Error: Operation timed out. (state=TIM01,code=6000)
Root Cause: This is primarily seen with queries running on larger data set because the default phoenix co

Resolution:
To resolve this issue we need to make sure that HBASE_CONF_PATH environment variable is set before l

1) Update or add the following configs to hbase-site.xml


--> phoenix.query.timeoutMs=1800000
--> hbase.regionserver.lease.period = 1200000
--> hbase.rpc.timeout = 1200000
--> hbase.client.scanner.caching = 1000
--> hbase.client.scanner.timeout.period = 1200000
2) Restart Hbase services to make these changes effective.
3) export HBASE_CONF_PATH = /etc/hbase/conf
4) Launch sqlline.py
5) run the same query that is failing

2. Running join queries


Issue: Size of hash cache (104857608 bytes) exceeds the maximum allowed size (104857600 bytes) [100
Root Cause: The cache size for processing the data is not sufficient
Resolution: Need to increase the buffer size of te hash cache
3. Unable to load RDZ data

> We need to load RDZ data into Phoenix for doing RDZ against EOS comparison
> Phoenix supports importing files which are only in .csv format
> Any tables that are created in Phoenix are also created in HBASE
> Hence, we always need to append a primary key to the RDZ file
> This primary key will be taken as the row_key when the data is getting inserted into HBASE
> When we load the data using the below two options, we are getting error

Option 1: Using phoenix-psql method


phoenix-psql -t Sample_edms_qa localhost /home/METNET/jmichael2/ONETIME_UDS_DPA_Conversion

Rootcause & Resolution: Unable to find


Option 2: Using phoenix jar and csvBulkLoadTool
hadoop jar /usr/iop/4.2.5.0-0000/phoenix/phoenix-4.8.1-HBase-1.2.0-IBM-21-client.jar org.apache.phoe

Root Cause:
This happens when user has an incorrect value defined for "zookeeper.znode.parent" in the hbase-site.x
For example the default "zookeeper.znode.parent" is set to "/hbase-unsecure" , but if you incorrectly sp

Resolution:
The solution here would be to update the hbase-site.xml / source out the same hbase-site.xml from the
ecause the default phoenix configurations are hitting timeout limits

onment variable is set before launching sqlline.py. This variable should point to hbase config directory.

d size (104857600 bytes) [100 MB]


serted into HBASE

ETIME_UDS_DPA_Conversion_20180119011722.csv
-21-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool -Dfs.permissions.umask-mode=000 --table Sample_edms_qa --input user/

de.parent" in the hbase-site.xml sourced on the client side or in case of a custom API written , the "zookeeper.znode.parent" was incorrec
ure" , but if you incorrectly specify that as lets say "/hbase" as opposed to what we have set up in the cluster, we will encounter this excep

same hbase-site.xml from the cluster or update the HBase API to correctly point out the "zookeeper.znode.parent" value as updated in the
ample_edms_qa --input user/jmichael2/test/ONETIME_UDS_DPA_Conversion_trial.csv

er.znode.parent" was incorrectly updated to a wrong location.


r, we will encounter this exception while trying to connect to the HBase cluster.

arent" value as updated in the HBase cluster.


1. Timeout error - during Huge count tables
Cause:
The above error (java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]) sugges
The stack displays the query plan calls broadcast joins. The awaitResult has a default timeout value of 3
The above error is displayed when this default timeout value is exceeded
Solution:
To resolve this issue, increase the default value of 300 for spark.sql.broadcastTimeout to 1200.
er [300 seconds]) suggests that there has been a timeout.
efault timeout value of 300 seconds for the broadcast wait time in broadcast joins

meout to 1200.

You might also like