hive-warehouse-connector jar released as part of HDP 3.1.5 has many third party jars embedded in it , which is conflicting with oozie, to solve that issue , you have to get the hwc dev jar or hotfix jar which does not have those conflicting classes Internal JIRAs to handle this issue : BUG-122013,BUG-122269.
eg :
199679223 2021-01-17 08:48 hive-warehouse-connector-assembly-1.0.0.3.1.5.0-152.jar // actual jar
56340621 2021-01-17 08:36 hive-warehouse-connector-assembly-1.0.0.3.1.5.0-152_dev.jar // dev jar
In this demo, we should use the dev jar hive-warehouse-connector-assembly-1.0.0.3.1.5.0-152_dev.jar
. You may get this Dev jar from Cloudera Support team.
Step 1) Refer this HWC Demo project and create Spark-HWC-test-1.0.jar
Step 2) Create a table default.employee
in hive and load some data
eg: Create table
CREATE TABLE IF NOT EXISTS employee ( eid int, name String, salary String, destination String)
COMMENT 'Employee details'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
Load data /tmp/data.txt
file into hdfs
cat /tmp/data.txt
1201,Emp1,45000,Technical manager
1202,Emp2,45000,Proof reader
1203,Emp3,40000,Technical writer
1204,Emp4,40000,Hr Admin
1205,Emp5,30000,Op Admin
LOAD DATA INPATH '/tmp/data.txt' OVERWRITE INTO TABLE employee;
Step 3) Create Oozie workflow.xml
, job.properties
and 'Spark-HWC-test-1.0.jar' , hive-warehouse-connector-assembly-1.0.0.3.1.5.0-152_dev.jar in ./lib dir
local files
cd /home/hr-user1/oozie_HWC_demo
ll -R
.:
total 12
-rw------- 1 hr-user1 hadoop 158 Jan 16 16:36 hr-user1.keytab
-rw-r--r-- 1 hr-user1 hadoop 351 Jan 18 06:05 job.properties
drwxr-xr-x 2 hr-user1 hadoop 115 Jan 16 16:26 lib
-rw-r--r-- 1 hr-user1 hadoop 2217 Jan 18 06:08 workflow.xml
./lib:
total 55032
-rw-r--r-- 1 hr-user1 hadoop 4650 Jan 16 16:15 Spark-HWC-test-1.0.jar
-rw-r--r-- 1 hr-user1 hadoop 56340621 Jan 16 16:26 hive-warehouse-connector-assembly-1.0.0.3.1.5.0-152_dev.jar
cat workflow.xml
<?xml version="1.0" encoding="UTF-8"?>
<workflow-app xmlns="uri:oozie:workflow:0.4" name="HWC-Spark-Demo-Workflow">
<credentials>
<credential name="hs2-creds" type="hive2">
<property>
<name>hive2.server.principal</name>
<value>hive/[email protected]</value>
</property>
<property>
<name>hive2.jdbc.url</name>
<value>jdbc:hive2://c220-node2.coelab.cloudera.com:2181,c220-node3.coelab.cloudera.com:2181,c220-node4.coelab.cloudera.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2</value>
</property>
</credential>
</credentials>
<start to="spark-hwc" />
<action name="spark-hwc" cred="hs2-creds">
<spark xmlns="uri:oozie:spark-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>${master}</master>
<name>HWC-Spark-Demo</name>
<class>com.cloudera.SparkHWCDemo</class>
<jar>${nameNode}${jarPath}/Spark-HWC-test-1.0.jar</jar>
<spark-opts>--conf spark.yarn.security.tokens.hiveserver2.enabled=true --keytab hr-user101.keytab --principal [email protected] --conf spark.datasource.hive.warehouse.load.staging.dir=/tmp --conf spark.datasource.hive.warehouse.metastoreUri=thrift://c220-node3.coelab.cloudera.com:9083 --conf spark.hadoop.hive.llap.daemon.service.hosts=@llap0 --conf spark.sql.hive.hiveserver2.jdbc.url.principal=hive/[email protected] --conf spark.security.credentials.hiveserver2.enabled=true --conf spark.sql.hive.hiveserver2.jdbc.url="jdbc:hive2://c220-node2.coelab.cloudera.com:2181,c220-node3.coelab.cloudera.com:2181,c220-node4.coelab.cloudera.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive" --conf spark.sql.hive.zookeeper.quorum="c220-node2.coelab.cloudera.com:2181,c220-node3.coelab.cloudera.com:2181,c220-node4.coelab.cloudera.com:2181"</spark-opts>
<arg>${arg1}</arg>
<arg>${arg2}</arg>
<file>${nameNode}${jarPath}/hr-user1.keytab#hr-user101.keytab</file>
</spark>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Workflow is Failed! message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end" />
</workflow-app>
cat job.properties
nameNode=hdfs://c220-node2.coelab.cloudera.com:8020
jobTracker=c220-node2.coelab.cloudera.com:8050
queueName=default
appname=HWC-Spark-Demo
oozie.use.system.libpath=true
oozie.wf.application.path=/tmp/oozie_HWC_demo
jarPath=/tmp/oozie_HWC_demo/lib
master=yarn-cluster
oozie.action.sharelib.for.spark=spark
arg1=default.employee
arg2=default.employee2
Step 4) Create /tmp/oozie_HWC_demo in HDFS and upload hr-user1.keytab, lib, workflow.xml
HDFS files
hdfs dfs -ls -R /tmp/oozie_HWC_demo/
drwxr-xr-x - hr-user1 hdfs 0 2021-01-18 06:55 /tmp/oozie_HWC_demo/lib
-rw-r--r-- 3 hr-user1 hdfs 4650 2021-01-16 16:15 /tmp/oozie_HWC_demo/lib/Spark-HWC-test-1.0.jar
-rw-r--r-- 3 hr-user1 hdfs 56340621 2021-01-18 06:55 /tmp/oozie_HWC_demo/lib/hive-warehouse-connector-assembly-1.0.0.3.1.5.0-152_dev.jar
-rw-r--r-- 3 hr-user1 hdfs 158 2021-01-16 16:37 /tmp/oozie_HWC_demo/lib/hr-user1.keytab
-rw-r--r-- 3 hr-user1 hdfs 2361 2021-01-18 07:09 /tmp/oozie_HWC_demo/workflow.xml
step 5) Submit oozie job.
oozie job -oozie http://c220-node4.coelab.cloudera.com:11000/oozie/ -config job.properties -run
This spark job reads table data from hive managed table default.employee
and insert into default.employee2
Step 6 ) Check the Job status from Oozie UI and if Status is 'SUCCEEDED'
a) Connect to hive and run select * from employee2;
to check the result in destination table employee2