0% found this document useful (0 votes)
7 views

Big Data Developer

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Big Data Developer

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

What is Big Data

Cloudera Software Installation


Hadoop Commands
Serialized File Formats
sqoop import
Sqoop Multiple Mappers
import portion of data
Sqoop eval and change the file delimiter
incremental import
Password Protection
Using Last Modified
Import multiple File Formats
Import multiple Tables
Handling Null during Import
Sqoop export
Sqoop Performance Tuning
HIVE
What is Hive
Create and load a table in Hive
Hive Table Types
Hive Partitions
Hive Use Case
Hive Buckets
Schema Evolution in Hive
create a sqoop job:
Working with Dates in Hive
Joins in Hive
1
MSCK Repair
Performance Tuning in Hive
Hive vs SQL
Installing and setting up Spark and Scala - Download links
SCALA
Scala Collections :
Collection Methods
SPARK
RDD Basics - Reading and Writing a File
Use Case - Analyze the Log Data
Spark Seamless Dataframe- Reading and Writing
Reading and Writing XML Data
Let's explore more transformations
Working with Fixed Width File
Parameterize using Config File
Reading Json from a web URL and flattening it
Flattening data by creating a Function
Working with HBase
Spark HBase Integration
Cassandra setup and working with Cassandra
Cassandra Spark Integration
Apache NIFI Installation
kafka installation and topic creation
Spark Kafka Integration
Spark Structured Streaming
Spark Kafka Integration

2
What is Big Data
What is Big Data ?
•Big data is collection of data that is huge in volume, is growing exponentially
with time.
•Example of Big Data : Data from social media, sales details from big retailers
like Walmart, Jet Engine data which can generate more than 10 terabytes in 30
mins of flight time etc
Types of Big Data:
•Structured . Eg : Tables
•Semi Structured . Eg : XML
•Unstructured
There are many tools and programs which help in processing big data. Some
of them are :
•Hive •Spark •Kafka •NoSQL Db •Presto •Flink •Hudi •Druid …..

Cloudera Software Installation


1. Download and install 7z sofware for extracting the zipped folder. This is only for Windows user.
https://www.7-zip.org/a/7z1900-x64.exe

2. Download Virutal box:


For Windows : https://download.virtualbox.org/virtualbox/6.1.18/VirtualBox-6.1.18-142142-Win.exe
For Mac : https://download.virtualbox.org/virtualbox/6.1.18/VirtualBox-6.1.18-142142-OSX.dmg

3. Download Putty. This is for Windows only.


4. Download winscp
5. Download the cloudera file. Please use any of the below links to download the cloudera file. Since the
file is huge, there may be restrictions on the downloads. If one link is not working, please use another
link. We will keep the links updated to accommodate more bandwidth.
https://drive.google.com/file/d/1krXOKn6eXdzZG23zRmMg0OXfHvPfSzu-/view?usp=sharing
https://drive.google.com/file/d/1xsqG8vDqMswsZh69xHjtzovlP5nyAtAh/view?usp=sharing
https://drive.google.com/file/d/13JN5whWK2W48Dcw-KjyA4d2rbtNCHj-S/view?usp=sharing
https://drive.google.com/file/d/1uHwPqIYa1ADSDC0cdlbLgP-zIWcSt388/view?usp=sharing
https://drive.google.com/file/d/1tksxTipPYeCTrtpumOYBe-EC8PCT6ZBM/view?usp=sharing
https://drive.google.com/file/d/1Egq2s3COvDjzcm2COXiT_Chblr-rQz0-/view?usp=sharing
https://gofile.io/d/rGHK0q
https://gofile.io/d/i6nf2H
Once the softwares are downloaded, Install them by following the instructions in the video.

3
Hadoop Commands
1. To list the files in HDFS:
hdfs dfs -ls
hdfs dfs -ls /user/cloudera/

2. create directory in HDFS:


hdfs dfs -mkdir /user/cloudera/dir1
hdfs dfs -mkdir /user/cloudera/dir2

3. listing files in hadoop :


hdfs dfs -ls /user/cloudera/ #list files
hdfs dfs -ls /user/cloudera/emp #list files in emp directory

4. copy from local file system to HDFS :


hdfs dfs -copyFromLocal /home/cloudera/test.csv /user/cloudera/

5. put: Copies files from the local file system to the destination file system.
This command can also read input from stdin and write to
the destination file system.
hdfs dfs -put localfile1 localfile2 /user/cloudera/hadoopdir;

6. moveFromLocal: Works similarly to the put command, except that the


source is deleted after it is copied.
hdfs dfs -moveFromLocal localfile1 localfile2 /user/cloudera/hadoopdir

7. cp: Copies one or more files from a specified source to a specified


destination. If you specify multiple sources, the specified destination must be
a directory.

4
hdfs dfs -cp /user/cloudera/file1 /user/cloudera/file2 /user/cloudera/dir

8. Copy from HDFS to Local


hdfs dfs -copyToLocal /user/cloudera/test.csv /home/cloudera/test.csv

9. get: Copies files to the local file system


hdfs dfs -get /user/cloudera/file3 localfile

10. To view the contents of a file :


hdfs dfs -cat /user/cloudera/student.txt

11. text: Outputs a specified source file in text format. Valid input file formats
are zip and TextRecordInputStream.
hdfs dfs -text /user/cloudera/file8.zip

12. touchz: Creates a new, empty file of size 0 in the specified path.
hdfs dfs -touchz /user/cloudera/file12

13. test: Returns attributes of the specified file or directory.


hdfs dfs -test /user/cloudera/dir1

14. groups : To know the existing groups :


hdfs groups

15. chmod: Changes the permissions of files.


hdfs dfs -chmod 777 test/data1.txt

5
16. count: Counts the number of directories, files, and bytes under the paths
that match the specified file pattern.
hdfs dfs -count emp #emp is directory

17. du: Displays the size of the specified file, or the sizes of files and
directories that are contained in the specified directory.
hdfs dfs -du /user/cloudera/dir1 /user/cloudera/file1

18. stat: Displays information about the specified path.


hdfs dfs -stat /user/cloudera/dir1

19. tail: Displays the last kilobyte of a specified file to stdout.


hdfs dfs -tail /user/cloudera/dir1

20. fsck generates a summary report that lists the overall health of the
filesystem. HDFS is considered healthy if—and only if—all files have a
minimum number of replicas available.
hadoop fsck /file1.txt
hadoop fsck /file1.txt -files -blocks -locations

21. remove files:


a) rm: Deletes one or more specified files.
hdfs dfs -rm /user/cloudera/dir/file9
b) rmr: Serves as the recursive version of –rm.
hdfs dfs -rmr /user/cloudera/dir
22. To come out of safe mode :
hadoop dfsadmin -safemode leave

6
Serialized File Formats
Serialization :
Serialization is the process of converting a data structure or an object into a
format that can be easily stored or transmitted over a network and can be
easily reconstructed later.
For example : If I have the below data :
“This is my Data”
Now this data will be serialized ie converted into stream of bytes(human
unreadable format). So serialized data is unreadable. IT is either stored or
transmitted over a network. Then this is deserialized ie converted back to its
original form which is readable.

Advantages of Serialization :
1) Serialized data are easy and fast to transmit over a network. Read and write
are fast
2) Some serialized file formats offer good compression, they can be encrypted
3) Deserialization also does not take much time

Serialized File Formats in Big Data :


1) Sequence file format
2) RC File format
3) ORC File Format
4) AVRO File Format
5) Parquet File Format

For more details on these files, check the Hive document.

1) Sequence File Format : Its pure Java serialization. It helps in easy retrieval
of data But since it is java serialization, it works best for map reduce only.
Does not work well with Spark or other tech. Since map reduce is extinct, so
is sequence

2) RC File Format: Row Columnar File Format : Write takes time. Read
would be easy .Columnar file format are easy to read. No compression
technique so data size would be huge

7
3) Parquet : Columnar File Format .Reasonable Compression technique (50
to 55%).
Compression technique supported by Parquet : snappy(default),lzo,gzip
Parquet is used in Target system as querying is fast

4) ORC : Optimized Row Columnar File Format.It is Columnar FIle format


75% compression Ratio. Use when you are dealing with historical data when
you are not using it often. Decompression will take time. So ORC is used for
historical data storages
ORC has ZLIB compression which provides much more compression. It also
has snappy but has less compression to ZLIB

5) AVRO - Row File Format


File size is big. Write is easy. Read is tough bcoz its row file format.AVRO
supports Schema Evolution(channginng the structure of the data)
Sequence File Format and RC file format are not used much. Parquet is the
most preferred one followed by both ORC and AVRO

Note :
Sqoop supports - Sequence, Parquet, AVRO directly
ORC - Sqoop needs Hive Integration

AVRO vs ORC vs Parquet :


1) AVRO is a row Format while ORC and Parquet are columnar format

2) Because AVRO is row Format, it is write heavy while ORC and Parquet
are read heavy.

3) In all these 3 file formats, along with data, the schema will also be there.
This means you can take these files from one machine and load it in another
machine and it will know what the data is about and will be able to process

4) All these file formats can be split across multiple multiple disks. Therefore
scalability and parallel processing is not an issue.

5) ORC provides maximum compression followed by Parquet followed by


AVRO
8
6) If the schema keeps changing then AVRO is preferred as AVRO supports
superior schema evolution. ORC also supports to some extent but AVRO is
best

7) AVRO is commonly used in streaming apps like Kafka, Parquet is


commonly used in Spark and ORC in Hive. Parquet is good in storing nested
data. So usually in spark, parquet is used
8) Because compression of ORCFile is more, it is used for space efficiency.
Query time will be little more as it has to decompress the data. In case of
AVRO, the time efficiency would be good as it takes little less time to
decompress. So if you want to save space, use ORC . if you want to save time,
use Parquet

9
sqoop import
sqoop import:

sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username root


--password cloudera --m 1 --table customers --target-dir
/user/cloudera/data_import

Try executing the command again. It fails. So use append:


sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username root
--password cloudera --m 1 --table customers --append --target-dir
/user/cloudera/data_import

What if we want to overwrite:

sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username root


--password cloudera --m 1 --table customers --delete-target-dir --target-dir
/user/cloudera/data_import

Sqoop Multiple Mappers


Multiple Threading in sqoop:
Multiple mappers will be working on importing the file. i.e multiple threads
will work on importing the file . This is basically like parallelism. We mention
the mappers using -m. So -m 1 is one mapper i.e one thread only. So you will
see only one part file in output directory.

Add 2 mappers here. Add split-by column as well:


sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username root
--password cloudera --m 2 --split-by customer_id --table customers --delete-
target-dir --target-dir /user/cloudera/data_import

Now what happens if you dont specify number of mappers? If you dont
specify number of mappers, it will by default take 4

10
If you dont specify split-by when the number of mappers are more than 1, it
will fail. However if the table has a primary column, then sqoop will take that
primary column as the split-by column

Delete the target directory if exists:


sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username root
--password cloudera --table customers --delete-target-dir --target-dir
/user/cloudera/data_import

import portion of data


using where:
sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username root
--password cloudera --m 2 --table customers --split-by customer_id
--where 'customer_state="TX"' --delete-target-dir --target-dir
/user/cloudera/where_eg

using select columns:


sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username root
--password cloudera --m 2 --table customers --split-by customer_id
--where 'customer_state="TX"' --columns
'customer_fname,customer_lname,customer_state,customer_city' --delete-
target-dir --target-dir /user/cloudera/where_eg

query:
sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username root
--password cloudera --m 2 --table customers --split-by customer_id
--query 'customer_fname,customer_lname,customer_state,customer_city from
customers where $CONDITIONS' --delete-target-dir --target-dir
/user/cloudera/query_eg

$CONDITIONS: When you are executing sqoop in parallel, each sqoop


process will replace this $CONDITIONS with a unique condition expression.
Here one mapper may execute your query with custid<5 and another mapper
with cust_id>5

11
What if we also have a where clause condition in our query. We will still need
to use $CONDITIonS as below :
sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username root
--password cloudera --m 2 --table customers --split-by customer_id
--query 'customer_fname,customer_lname,customer_state,customer_city from
customers where customer_state="TX" AND $CONDITIONS' --delete-
target-dir --target-dir /user/cloudera/query_eg

Sqoop eval and change the file delimiter


Sqoop Eval:

Sometimes we may want to evaluate the data before importing into Hadoop.
Sqoop eval will help us doing this. Sqoop Eval basically is connecting to
database from sqoop command. All the sql commands that you can fire on
database, we can execute them using the sqoop eval. When you put a sql
query in sqoop eval, it will connect to the database, fire the command and
display the results on the edge node. We use eval instead of import and place
the query in --query.

Eg :sqoop eval --connect jdbc:mysql://localhost:3306/retail_db --username


root --password cloudera --query "Select * from customers limit 10";

Changing Import delimiter:


When sqoop imports the data into hadoop, by default it stores the data in
comma delimeter. We can change this using : fields-terminated-by and lines-
terminated-by

Eg: sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username


root --password cloudera --m 2 --table customers --split-by customer_id --
fields-terminated-by '|' --lines-terminated-by '\n' --delete-target-dir --target-dir
/user/cloudera/customer

12
incremental import
Sqoop Incremental Import:
sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username root
--password cloudera --m 2 --table customers --split-by customer_id
--target-dir /user/cloudera/customer1 --incremental append --check-column
customer_id --last-value 0

--incremental append : to tell that it is incremental load


--check-column : Which colum you want to use to do the incremental load.
Here it is custid
--last-value : sqoop will import data that is greater than this last-value. Here it
will import where custid>0. This last value an be an integer, date or timestamp
but not string as it will search for data greater than that last value

Next time, you need to change the last-value to import from the value greater
than the last imported.So, everytime you need to import any records, you must
remember the last-value from previous import and use it in the sqoop
command. This is very manual. What if sqoop does this job for us ? What if
sqoop remembers the last-value and imports from that last-value in next sqoop
? This is possible by creating a sqoop job.

We create a sqoop job with an import statement. We give it a name. We then


execute the job which will internally execute the sqoop import and the last-
value is then saved by the sqoop. When there are additional rows to the table,
you just execute the sqoop job.

sqoop job --create myJob -- import --connect


jdbc:mysql://localhost:3306/retail_db --username root --password cloudera --
m 2 --table customers
--split-by customer_id
--target-dir /user/cloudera/customer2 --incremental append
--check-column custid --last-value 0
Note : myJob is the name of the sqoop job
-- import : There must be a space between -- and import.

13
Now to execute this job :
sqoop job --exec myJob

Now, the import command in the job will execute.

to get job details(metadata) : sqoop job --show <jobname>

Eg: sqoop job --show myJob

Now once a job is created, you cannot edit the job i.e you cannot edit the
import statements. You cannot view the import statment associated with a job.
If you want to make any changes, delete the job and recreate the job with
updated details.

To Know the available jobs : sqoop job --list

Delete the job : sqoop job --delete <jobname>

Password Protection
Saving the password to a file and using that file for password:
sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username root
--password-file file:///home/cloudera/passfile --m 1 --table customers --target-
dir /user/cloudera/data_import

Instead of storing the password in text format, we can encrypt the passoword
and store the encrypted password and then refer to this encrypted password in
our sqoop command. As of Sqoop version 1.4.5, Sqoop supports the use of
JKS to store passwords which would be encrypted,so that you do not need to
store passwords in clear text in a file.This can be achieved using Java
KeyStore.A Java KeyStore (JKS) is a repository of security certificates –
either authorization certificates or public key certificates – plus corresponding
private keys, used for instance in TLS encryption.

Below is the command to store the encrypted password:


hadoop credential create encrytpassword -provider
jceks://hdfs/tmp/mypassword
14
It will ask you for password Give your password and it will encrypt the
password and store in hdfs path :jceks://hdfs/tmp/mypassword. You can
provide the alias : encryptpassword instead of the orginal password or the
password file.

Now, we will use this encrypted password. We will pass the alias :
encryptedpassword we created above. Here is the command for this :
sqoop import -
Dhadoop.security.credential.provider.path=jceks://hdfs/tmp/mypassword --
connect jdbc:mysql://localhost:3306/retail_db --username root --password-
alias encrytpassword --table customers -m 1 --target-dir
/user/cloudera/data_import_encrypted

Using Last Modified


If there is an update in database , say some column value changed, then it
should reflect in hdfs. This is like scd type 1. Updating the hdfs to be in sync
with db. Sqoop will run 2 map reduce jobs here. One Map reduce job to bring
the data(here only mappers run) and the second map reduce job to compare
the data(here reducers also run because it would compare from different nodes
and consolidate).

If you run only import, there is only mappers involved. No reducers because
there is no shuffling.

eg: sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username


root --password-file file:///home/cloudera/passfile --m 1 --table orders --
target-dir /user/cloudera/data_import

--mysql database
select * from orders where order_id in (68871,68809,68817,68827);

Now lets update/insert values in the table :

15
update orders set order_status='test1' where custid=68809; --2014-03-12
update orders set order_status='test2' where custid=68817; -- 2014-03-27
update orders set order_status='test3' where custid=68827; -- 2014-04-16
update orders set order_status='test4' where custid=68871; -- 2014-06-28

Below is the query for last modified:


sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username root
--password-file file:///home/cloudera/passfile --m 1 --table order
--m 1 --target-dir /user/cloudera/cust_mod --incremental lastmodified --
check-column order_date --last-value 2014-04-15 --merge-key order_id

Here, --check-colum createdt : mention the column where it should pick data
for comparison

--last-value : Give the value for createdt . From this value, it would pick the
data

--merge-key : Column used for comparison

Import multiple File Formats

Import as Sequence File :


sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username
root --password-file file:///home/cloudera/passfile --m 1 --table order
--target-dir /user/cloudera/sequence_dir --as-sequencefile
Import as AVRO File:
sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username
root --password-file file:///home/cloudera/passfile --m 1 --table order
--target-dir /user/cloudera/avro_dir
--as-avrodatafile

16
Import as Parquet File:
sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username
root --password-file file:///home/cloudera/passfile --m 1 --table order
--target-dir /user/cloudera/parquet_dir
--as-parquetfile
Import as ORC : There is no equivalent of –as-avrodatafile or –as-sequence
or –as-parquetfile for ORC. In order to import as ORC file, we need to
leverage the sqoop’s Hcatalog integration feature. HCatalog is a table storage
management tool for Hadoop that exposes the tabular data of Hive
metastore to other Hadoop applications.
It enables users with different data processing tools (Pig, MapReduce) to
easily write data onto a grid.You can think of Hcatalog as an API to access
Hive metastore.
To import as ORC, it involves the below two steps:
1) Create Hive Database with your desired HDFS warehouse location
2) Run Sqoop import command to import from RDBMS table to Hcatalog
table
Let’s sqoop import the table : customermod3 as ORC. We basically import
the data into some hive database. We can ask the sqoop to create a hive
table. So basically a hive table will be created and the directory for that hive
table will be in ORC format.
Here is the sqoop import command :
sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username
root --password-file file:///home/cloudera/passfile --m 1 --table order
--m 1
--hcatalog-database test

17
--hcatalog-table order
--create-hcatalog-table
--hcatalog-storage-stanza "stored as orcfile";
Additional options here are the –hcatalog options where we are giving the
database name, hive table to be created and how the storage needs to
happen. So hive table will be created under a database test.

Import multiple Tables


Below query imports all except for few tables:
sqoop import-all-tables --connect jdbc:mysql://localhost:3306/retail_db --
username root --password-file file:///home/cloudera/passfile --m 1
--exclude-tables categories,customers,orders --warehouse-dir
/user/cloudera/tables;

import-all-tables will import all the tables. But since I want to import only 2
tables, I have to skip the remaining tables. This we can do using –exclude-
tables.

--warehouse-dir is same as –target-dir except that in –target-dir a folder is


created by the table name and then we have part files in it. In warehouse-dir,
an extra folder is created and inside that we have folder for each table. Since
we are using multiple tables here, we are using –warehouse-dir instead of –
target-dir.

Handling Null during Import


Working with Null Data:

--create a table :test


CREATE TABLE test (
Id int,
Name varchar(255),
city varchar(255)
)
18
--insert 2 records
insert into test(Id,Name,city) values (100,'Kane',null);
insert into test(Id,Name,city) values (null,'Jacob','New York');

commit;

Let’s say if a colum has null in string datatype we want to display as nulldata
and if a column has null in non-string datype, we want do display it as 0. We
can use the below query for this purpose.

sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username root


--password-file file:///home/cloudera/passfile
--table test -m 1 --target-dir /user/cloudera/test_null --null-string "nulldata" --
null-non-string 0
Here,
--null-string will check all string columns and wherever there is null, will
replace it with nulldata
--null-non-string will check all non-string column s and wherever there is null,
will replace it with 0

Sqoop export
Sqoop Export Staging : We will create a staging table with same structure of
target table. Sqoop will first export into staging table and then from staging
table into target table. After this, staging table data is deleted. Staging table
should be in the same database as the target target.
Below is the command :
sqoop export --connect jdbc:mysql://localhost:3306/retail_db --username root
--password-file file:///home/cloudera/passfile
--table test --staging-table test_stg --m 1 --export-dir /user/cloudera/test_null
Here, --staging-table : Give the staging table name
--table : Give the target table name

19
If data exists in staging table and we want to first truncate the staging table via
sqoop, use --clear-staging-table . Here is the query for the same:
sqoop export --connect jdbc:mysql://localhost:3306/retail_db --username root
--password-file file:///home/cloudera/passfile
--table test --staging-table test_stg --clear-staging-table --m 1 --export-dir
/user/cloudera/test_null

Sqoop Performance Tuning


Performance Tuning – Sqoop Imports:
1. Use Mappers
2. Using –direct option. if you give --direct instead of connectin to jdbc driver,
it will connect using native library. Here is an example for this

sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username root


--password cloudera --m 1 --table customers --delete-target-dir --target-dir
/user/cloudera/data_import1 –direct

3. --fetch-size : How many rows should each mapper fetch at a time. By


default it is 1000 rows

Performance Tuning – Sqoop Export:

In Sqoop export, 10k rows will be inserted per sqoop statement by default. We
can tune this using below 2 properties :

sqoop export -Dsqoop.export.statements.per.transcation=100

-Dsqoop.export.records.per.statement=100

20
Dsqoop.export.records.per.statement =100 --> This property will tell you how
many rows are inserted per insert statment. here 100 rows per statement
Dsqoop.export.statements.per.transcation=100 --> This property will tell you
how many insert statements are fired. here its 100

So, 100 statements are fired and each statement has 100 rows . So totally
100*100=10k rows

21
HIVE
Hive-Data Preparation: Below commands are used for exporting the data
from mysql:
SELECT *
FROM orders
INTO OUTFILE '/var/tmp/orderss.csv'
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';

SELECT *
FROM categories
INTO OUTFILE '/var/tmp/categories.csv'
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';

SELECT *
FROM customers
INTO OUTFILE '/var/tmp/customers.csv'
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';

SELECT *
FROM departments
INTO OUTFILE '/var/tmp/departments.csv'
22
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';

SELECT *
FROM order_items
INTO OUTFILE '/var/tmp/order_items.csv'
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';

SELECT *
FROM products
INTO OUTFILE '/var/tmp/products.csv'
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
--create a folder called datasets in the home directory
mkdir datasets
--copy the files to the new folder
cp /var/tmp/categories.csv /home/cloudera/datasets/categories.csv
cp /var/tmp/customers.csv /home/cloudera/datasets/customers.csv
cp /var/tmp/departments.csv /home/cloudera/datasets/departments.csv
cp /var/tmp/order_items.csv /home/cloudera/datasets/order_items.csv
cp /var/tmp/orderss.csv /home/cloudera/datasets/orders.csv
cp /var/tmp/products.csv /home/cloudera/datasets/products.csv
Download the required files from the resources in this lecture

23
America.txt countries.txt India.txt

What is Hive
Hive is a data warehouse system which is used to analyze structured data. It is
built on the top of Hadoop. It was developed by Facebook.

Hive stores the metadata in hive metastore. The Hive metastore is simply a
relational database. It stores metadata related to the tables/schemas you create
to easily query big data stored in HDFS. When you create a new Hive table,
the information related to the schema (column names, data types) is stored in
the Hive metastore relational database. Hive uses the Derby database for its
metadata storage.

Features of hive :
•Hive is open source. We can use it for free
•HQL(High Query Language) is very similar to your SQL
•Hive is schema on read
•Hive can be used as an ETL Tool and can process huge amount of data
•Hive supports partitioning and Bucketing
•Hive is warehouse tool designed for analytical purpose. Not for transactional
purpose
•Can work with multiple file formats
•Can be plugged into BI tools for data visualization

Limitations of Hive:
•Hive is not designed for OLTP operations. Used for OLAP
•It has limited subquery support
•Latency of Hive is little high
•Support for Updates and Deletion is very minimal
•Not used for real time queries as it takes a bit of time to give the results

24
Create and load a table in Hive
create database:
create database practise;
use practise;
Hive has a default directory : /user/hive/warehouse
Whenever a database is created, or a table is created or data is inserted, a
directory will be created in the above location . This is the default path.
Checking the default path:
hdfs dfs -ls /user/hive/warehouse

create table : customers


create table practise.customers
( customer_id INT,
customer_fname varchar(50),
customer_lname varchar(50),
customer_email varchar(50),
customer_password varchar(50),
customer_street varchar(255),
customer_city varchar(50),
customer_state varchar(50),
customer_zipcode varchar(50) )
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile ;
25
describe table :
describe formatted practise.customers

Loading data from edge node:


load data local inpath '/home/cloudera/datasets/customers.csv' into table
customers;
--querying the table
select * from customers;
select * from customers where customer_id=1;
Youcan also load the data from hadoop. Let's first copy the data from edge
node to hadoop.For this, first lets create directories in hadoop. Below
commands are used to create the directories:
hdfs dfs -mkdir /user/cloudera/datasets

hdfs dfs -mkdir /user/cloudera/datasets/categories

hdfs dfs -mkdir /user/cloudera/datasets/customers

hdfs dfs -mkdir /user/cloudera/datasets/departments

hdfs dfs -mkdir /user/cloudera/datasets/order_items

hdfs dfs -mkdir /user/cloudera/datasets/orders

hdfs dfs -mkdir /user/cloudera/datasets/products

Below commands are for copying data from local to HDFS:

hdfs dfs -copyFromLocal /home/cloudera/datasets/categories.csv /user/cloudera/datasets/categories/categories.csv

hdfs dfs -copyFromLocal /home/cloudera/datasets/customers.csv /user/cloudera/datasets/customers/customers.csv

hdfs dfs -copyFromLocal /home/cloudera/datasets/departments.csv /user/cloudera/datasets/departments/departments.csv

hdfs dfs -copyFromLocal /home/cloudera/datasets/order_items.csv /user/cloudera/datasets/order_item/order_items.csv

hdfs dfs -copyFromLocal /home/cloudera/datasets/orderss.csv /user/cloudera/datasets/orders/orders.csv

hdfs dfs -copyFromLocal /home/cloudera/datasets/products.csv /user/cloudera/datasets/products/products.csv

--load the data into customers table


load data inpath '/user/cloudera/datasets/customers.csv' into table customers;

26
Note:If the data is being loaded from edge node, file is copied to the hive
warehouse and if it is loaded from hadoop, file will be moved to warehouse instead of
copying

We have mentioned columns and its structure in hive table creation. If the
file that we are loading using inpath does not have the same structure, say it
has less/more columns or if the column type is mismatching, it will still be
loaded and while we query through hive, if there is a mismatch on a column,
it will show it as null instead of failing. This is because hive will enfornce
schema on read.
lets try that. Let's create table order_eg
create table practise.order_eg
(
order_id INT,
)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile ;
lets load the data from edge node:
load data local inpath '/home/cloudera/datasets/orders.csv' into table order_eg;
select * from order_eg;
if we drop this table the entire schema and the files will be deleted. This is managed
table.
drop table order_eg;
hdfs dfs -ls /user/hive/warehouse/
Let's create one more table: We will use some other location instead of the default
hive location.
create table practise.categories
27
(
category_id INT,
category_department_id INT,
category_name varchar(50)
)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile
location '/user/cloudera/datasets/category';
If location does not exists, hive will create that location
Load data into table:
load data local inpath '/home/cloudera/datasets/categories.csv' into table categories;
You can also create a table with an existing table:
create table category_eg as select * from categories;
--inserting into one table from another
insert into table category_eg select * from categories; #Appends
select count(*) from category_eg;
If you want to overwrite use:
insert overwrite table category_eg select * from categories #Overwrites

28
Hive Table Types
• Managed Table : When table dropped, backend directory associated with the table is
deleted as well. Use it for staging purpose
• External Table : When table dropped, backend directory associated with the table
would still exist. Use it for target system
--create an external table:
create external table practise.customers
(
customer_id INT,
customer_fname varchar(50),
customer_lname varchar(50),
customer_email varchar(50),
customer_password varchar(50),
customer_street varchar(255),
customer_city varchar(50),
customer_state varchar(50),
customer_zipcode varchar(50) )
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile
location '/user/cloudera/datasets/customers/';
describe formatted customers;
--drop a table:
drop table customers;
hdfs dfs -ls /user/cloudera/datasets/customers/;
29
Hive Partitions
Partition is a way of dividing the data in a table into related parts using a
partition column.
Types:
•Static Partition : Partitions are created whenever user specifies
•Dynamic Partition : Partitions are created dynamically
Hive Static Partition:
Static Load Partition : Here, we specify the partition name to which the data
needs to be loaded
Static Insert Partition : Here, we first create a non partitioned table and then
insert the data from this non partitioned table into a partitioned table
Download the files that are uploaded in the resources section for this lecture.
We will be using those files. First, let's create some directories:
mkdir /home/cloudera/country
mkdir /home/cloudera/country/India
mkdir /home/cloudera/country/America
mkdir /home/cloudera/country/countries
create external table normal_table
(
state varchar(255),
capital varchar(255),
language varchar(255)
)
row format delimited
fields terminated by ','

30
stored as textfile
location '/user/cloudera/normal_dir';
This table is without partition.
load the data:
load data local inpath '/home/cloudera/country/India' into table normal_table ;
load data local inpath '/home/cloudera/country/America' into table normal_table ;
hdfs dfs -ls /user/cloudera/normal_dir

--create a partitioned table


create external table partitioned_table
(
state varchar(255),
capital varchar(255),
language varchar(255)
)
partitioned by (country varchar(100))
row format delimited
fields terminated by ','
stored as textfile
location '/user/cloudera/partitioned_dir';
--loading the data
load data local inpath '/home/cloudera/country/India' into table partitioned_table
partition(country='IND');
load data local inpath '/home/cloudera/country/America' into table partitioned_table
partition(country='US');
check the directory:
hdfs dfs -ls /user/cloudera/partitioned_dir

31
--querying the table:
select * from partitioned_table;
select * from partitioned_table where country='IND';
Static Insert partition: Inserting the data from a non-partitioned table to a partitioned
table.We have s single file countries that contain both America and India data.
We want to insert individual countries to each partitioned file at hdfs. Load will not
work as we have a single file. So static load wont work. In this case we use static insert
partition. So what we do here is first we create a non-partitioned table. We will load all
the data into this table. Now this table has all the data . This table will now be our
source. We will now create our partitioned table. And from the non-partitioned table,
we will insert into the partition table by specifying the partition name. By this we can
create the partitioned table with the required partitions. Again, since we are manually
providing the partition here, this is static and since we are using insert instead of load,
this is static insert partition. Lets see that in action.
First we must create a non-partitioned table to load this data into. Lets create a non-
partitioned table.
create external table non_partitioned_table
(
state varchar(255),
capital varchar(255),
language varchar(255),
country varchar(255)
)
row format delimited
fields terminated by ','
stored as textfile
location '/user/cloudera/non_partitioned_dir1';

Now lets load the file into this location:


32
load data local inpath '/home/cloudera/country/countries' into table
non_partitioned_table
create a partitioned table:
create external table partitioned_table_country
(
state varchar(255),
capital varchar(255),
language varchar(255) )
partitioned by (country varchar(100))
row format delimited
fields terminated by ','
stored as textfile
location '/user/cloudera/partitioned_dir_country1';
We can also insert from a non-partitioned table into a partitioned table as :
insert into partitioned_table_country partition(country='IND')
select state,capital,language
from non_partitioned_table where country='IND';
We have to specify the column names here. * will not work because the table we
want to insert has 3 columns but when you use * it will generate 4 colunms, 4th column
is your partitioned column.
similarly,
insert into partitioned_table_country partition(country='US')
select state,capital,language
from non_partitioned_table where country='US';

Dynamic Load Partition:

33
First data is loaded into a non partitioned table and then the data from non
partitioned table will be inserted into the partitioned table dynamically.
This is very similar to the static insert. The only difference is in static insert,
we tell the partition name that the data needs to go. Here in dynamic
partition those partitions are automatically created from the query result.
Let's jump into an example to understand better.
Let's create a normal table:
create external table source_country
(
state varchar(255),
capital varchar(255),
language varchar(255),
country varchar(255)
)
row format delimited
fields terminated by ','
stored as textfile
location '/user/cloudera/source_country';

Now lets load the file into this location:


load data local inpath '/home/cloudera/country/countries' into table source_country;

Now lets create partitioned table:


create external table target_country
(
state varchar(255),

34
capital varchar(255),
language varchar(255)
)
partitioned by (country varchar(100))
row format delimited
fields terminated by ','
stored as textfile
location '/user/cloudera/target_country';

Lets insert data from non-partitioend table to partition table. In our source
table, we have cntry. In target table, we have the partitioned column as
country. Dynamic partition will group all the cntry data from source and
create a directory per group. Now while inserting from source to target, we
need to tell hive the column that needs to be used for partition. We tell it by
specifying that as the last column in the select statement.
insert into target_country partition(country)
select state,capital,language,country
from source_country;
Now this will give an error. By default dynamic partition are not enabled. So set
property;
set hive.exec.dynamic.partition.mode=nonstrict;
now execute the query
We can create sub partitions as well. We have the non-parttioend table :
source_country. So lets use this table. Now we will create a partitioned table:
create external table target_country_sub
(
state varchar(255),

35
capital varchar(255),
language varchar(255)
)
partitioned by (country varchar(100),lang varchar(100))
row format delimited
fields terminated by ','
stored as textfile
location '/user/cloudera/target_country_sub';
insert into target_country_sub partition(country,lang)
select state,capital,language,country,language
from source_country;
hdfs dfs -ls /user/cloudera/target_country_sub
By default, each map reduce job will allow 100 partitions (ie 100 folders, including
partitions and subpartitions) to be created. If you want to change this property:
set hive.exec.max.dynamic.partitions=500;

Hive Use Case


Use case : We have a table in mysql: customers. Import the data as parquet
file and then create a hive table and query from it.
Below sqoop command will import the data from mysql into parquet format
file:
sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username
root --password-file file:///home/cloudera/passfile --m 1 --table customers
--target-dir /user/cloudera/customers_dir_order
--as-parquetfile

36
--create a hive external table
create external table customers
(
customer_id INT,
customer_fname varchar(50),
customer_lname varchar(50),
customer_email varchar(50),
customer_password varchar(50),
customer_street varchar(255),
customer_city varchar(50),
customer_state varchar(50),
customer_zipcode varchar(50)
)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as parquet
location '/user/cloudera/customers_dir_order';
Now you can query this hive table.

37
Hive Buckets
Bucketing in hive is the concept of breaking data down into ranges, which are
known as buckets, to give extra structure to the data so it may be used for
more efficient queries.
For bucketing to happen, we need to enforce below properties:
set hive.exec.dynamic.partition.mode=nonstrict
set hive.enforce.bucketing=true
--create a non partitioned table
create external table orders
(
order_id int,
order_date timestamp,
order_customer_id int,
order_status varchar(50)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/cloudera/datasets/orders';
Now lets create a partitioned table. We will create buckets for this table.
create table orders_bucket
(
order_id int,

38
order_date timestamp,
order_customer_id int
)
partitioned by (order_status STRING)
clustered BY (order_id) INTO 5 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/cloudera/orders_bucket';
//insert from non partitioned to partitioend atable
insert into orders_bucket partition(order_status) select
order_id,order_date,order_customer_id,order_status from orders;

check the data:


hdfs dfs -ls /user/cloudera/orders_bucket2/order_status=COMPLETE
Note :
1) When you set the properties make sure you type in correctly. If there is a
spelling mistake, it does not give any error. So if you believe certain
properties are not working double check your properties for any spell
check.
2) if you dont enforce bucketing, there will be no error, rather all the data
will be under one file.
You can disable the bucketing by this property:
set hive.enforce.bucketing=false
You can also create bucketing without partitioning.
39
Schema Evolution in Hive
We have this table: "test" created earlier in our course. Lets use this table.
lets delete the data and insert new data here.
insert into test(Id,Name,city) values (100,'John','New York');
insert into test(Id,Name,city) values (101,'Pooja','Mumbai');
insert into test(Id,Name,city) values (103,'Michael','London');
insert into test(Id,Name,city) values (104,'Jessy','Cochin');
let's sqoop import this. This data will be incrementally loaded. Hence we use
sqoop job for this purpose.
sqoop job --create myJob1 -- import --connect
jdbc:mysql://localhost:3306/retail_db --username root --password cloudera -
-m 1 --table test --split-by Id --target-dir /user/cloudera/test3 --incremental
append --check-column Id --last-value 0

sqoop job --exec myJob1

Check the data:


hdfs dfs -ls /user/cloudera/test3
Now lets create a hive table on this location. We will create a external table.
create external table test_1
(
id int,
name varchar(50),
city varchar(50)

40
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/cloudera/test3';
select * from test_1;
This is good. Now lets insert two more records to mysql:
insert into test(Id,Name,city) values (105,'Robin','Mumbai');
insert into test(Id,Name,city) values (106,'Anjali','Cochin');
lets execute our sqoop job:
sqoop job --exec myJob1
Check the data:
hdfs dfs -ls /user/cloudera/test3
Query hive table:
select * from test_1;
This is working as expected. Now Imagine at source, hey have dropped the
column: Name. Now they are only sending id and city.
alter table test drop column Name;
insert into test(Id,city) values (107,'London');
insert into test(Id,city) values (108,'Hyderabad');
Now lets run our sqoop job:
Check the data:

41
hdfs dfs -ls /user/cloudera/test3
Query hive table:
select * from test_1;
You will observe that the data is not in proper order. It will not throw any
error but you will see data mismatch here.
Now, to handle the schema changes, instead of using the text file, we can
store the data in avro file.
Let's redo this again. Let's use table : test_stg.
insert into test_stg(Id,Name,city) values (100,'John','New York');
insert into test_stg(Id,Name,city) values (101,'Pooja','Mumbai');
insert into test_stg(Id,Name,city) values (103,'Michael','London');
insert into test_stg(Id,Name,city) values (104,'Jessy','Cochin');

create a sqoop job:


sqoop job --create myJob4 -- import --connect
jdbc:mysql://localhost:3306/retail_db --username root --password cloudera -
-m 1 --table test_stg --split-by Id --target-dir /user/cloudera/test_stg1 --
incremental append --check-column Id --last-value 0 --as-avrodatafile
sqoop job --exec myJob4
During sqoop import of avro, a tablename.avsc file will be generated which
will contain the schema. This is here : /home/cloudera/test_stg.avsc
Now while creating a hive table, instead of giving column names, pass the
avsc file as its schema. Any change in the schema, we just update this avsc
file and therefore the hive table which refers this avsc file will have the
updated schema.
42
Lets push this schema to hdfs location;
hdfs dfs -copyFromLocal /home/cloudera/test_stg.avsc
/user/cloudera/test_stg.avsc

now lets create the hive table:


create external table test_stg1
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS AVRO
LOCATION '/user/cloudera/test_stg1'
TBLPROPERTIES ('avro.schema.url'='/user/cloudera/test_stg.avsc');
Now lets drop the column on source side:
alter table test_stg drop column Name;
insert into test_stg(Id,city) values (107,'London');
insert into test_stg(Id,city) values (108,'Hyderabad');
Now what we have to do is very simple, we just update the .avsc file with this
schema change.
vi /home/cloudera/test_stg.avsc
Here name is dropped. so we will delete name from this file.
push the file to hdfs:
hdfs dfs -copyFromLocal /home/cloudera/test_stg.avsc
/user/cloudera/test_stg.avsc

Now run the sqoop job:


sqoop job --exec myJob4
43
check data:
hdfs dfs -ls /user/cloudera/test_stg1
hive: select * from test_stg1;
Data will be populated properly
Execute hive queries using a script
You can execute a hive command from edge node as below:
hive -e "select * from test_stg1";
To execute hive from a file, put all commands in a file say file1.hql
create a file : file1.hql and add the below contents to that file:
create database demo1;
use demo;
create table demo1.customers
(
customer_id INT,
customer_fname varchar(50),
customer_lname varchar(50),
customer_email varchar(50),
customer_password varchar(50),
customer_street varchar(255),
customer_city varchar(50),
customer_state varchar(50),
customer_zipcode varchar(50)
)

44
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile ;
describe formatted demo1.customers;
load data local inpath '/home/cloudera/datasets/customers.csv' into table
customers;
select * from customers;
Now run this file as :
hive -f file1.hql
You will observe that all the commands within the file are executed one by
one.

Working with Dates in Hive


Default Format : yyyy-MM-dd
Converting from one format to another :
select from_unixtime(unix_timestamp('22-06-2020','dd-MM-yyyy'),'yyyy/MM/dd');
1) SELECT UNIX_TIMESTAMP() ; --> this is sysdate-unix time which is 1970-01-01
00:00:00 UTC. It then converts into seconds
SELECT UNIX_TIMESTAMP('1970-01-01 00:00:00');
2) select from_unixtime(unix_timestamp())
select from_unixtime(unix_timestamp('1970-01-01 00:00:00'))
3) SELECT TO_DATE('2000-01-01 10:20:30')
4) select year('2020-02-04');
select month('2020-02-04')
select day('2020-02-04')
45
select year(from_unixtime(unix_timestamp()));
5) select DATEDIFF('2000-03-10', '2000-03-01') ;
6) select DATE_ADD('2000-03-01', 5) ;
7) select DATE_SUB('2000-03-01', 5) ;

Joins in Hive
Hive supports sql joins like the inner join, left outer, right outer, full outer
joins. In addition to these, hive supports few join strategies.
Lets first create a table to use the join concepts.
create external table order_items
(
order_item_id int,
order_item_order_id int,
order_item_product_id int,
order_item_quantity int,
order_item_subtotal double,
order_item_product_price double
)

ROW FORMAT DELIMITED


FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/cloudera/datasets/order_items';
--Inner Join Example
select
46
o.order_id,
o.order_status,
od.order_item_product_price
from orders o inner join order_items od
on (o.order_id=od.order_item_order_id);
If query fails set the following property:
set hive.auto.convert.join=false;
Now for the join to occur, we need to bring either orders records(both
datanodes) to order_items datanode(both datanodes) or bring order_items
datanode to orders datanode. This is called Shuffling and is a very costly
operation. Now which data is brought to which datanode during shuffling is
unknown.
Apart from normal sql joins , hive has few more joins
1) Map Join :
Small table is copied to all nodes of the bigger table.In Map based join, the
smaller table will go and sit on all the data nodes of the bigger table. Join will
now happen on the same data nodes. So shuffling wont happen during join.
The only shuffling is when the data goes and sits on the other data node
eg:
select
/*+ MAPJOIN(o) */ o.order_id,
o.order_status,
od.order_item_product_price
from orders o
inner join order_items od

47
on (o.order_id=od.order_item_order_id);
2) Bucket Map Join
For Bucket Map join to take place, it has to satisfy many criteria.
First, both the join tables must be bucketed and it must be bucketed on the
join key. Also the number of buckets between two tables must be multiple of
each other. For example if table1 has 2 buckets then table 2 must have
buckets in multiples of 2 i.e it should have either 2,4,6,8 .. .buckets
And we also need to setup below properties:
set hive.optimize.bucketmapjoin = true
set hive.enforce.bucketing = true;
eg:
SELECT
/*+ MAPJOIN(table2) */ table1.emp_id,
table1.emp_name,
table2.job_title
FROM table1 inner JOIN table2
ON table1.emp_id = table2.emp_id;

Here, table2 is the smaller table


3) Sort Merge Bucket Join:
This is similar to bucket join. In Bucket map join, data is not sorted
before join. If you want to sort the data before join, use Sort Merge Bucket
join
Below properties needs to be set for Sort merge bucket join:
set hive.auto.convert.sortmerge.join=true;
48
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;
4) Skew Join : To use skew join, you need to setup below properties:
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=500000;
set hive.skewjoin.mapjoin.map.tasks=10000;
set hive.skewjoin.mapjoin.min.split=33554432;

49
MSCK Repair
In all our examples, we have created a partitioned table and then loaded
data into those partitions. What if we already have the data in hdfs and we
want to create a partitioned table on that data ? We haven't tried this yet.
We have created a non partitioned external table on the existing data and
we were able to query data from hive. But we never tried with a partitioned
table. Let's try that.
We have already created a partitioned table in earlier classes. Let's make
use of one of them.
let's use partitioned_table;
describe formatted partitioned_table;
We have the data here : /user/cloudera/partitioned_dir
So lets create another table on top of this data:
create external table partitioned_table_existing
(
state varchar(255),
capital varchar(255),
language varchar(255)
)
partitioned by (country varchar(100))
row format delimited
fields terminated by ','
stored as textfile
location '/user/cloudera/partitioned_dir';

50
lets query the table:
select * from partitioned_table_existing; -- no data. Why ?
When we create a partitioned table, partitions are generated and stored in
the Hive metastore.if we try to create a partitioned table on the existing
data, partitions are not automatically registered in the Hive metastore.We
need to recover the partitions
For this we use:
msck repair table partitioned_table_existing
Now query the table. You should see the data
select * from partitioned_table_existing;

51
Performance Tuning in Hive
Hive Performance Tuning:
1. Partitions : Using partitions will minimize scanning on the entire data
which would improve the query performance
2. Bucketing :Use bucketing when you are going to query on particular
column frequently
3. Map Joins, Skew Joins: Use the appropriate joins based on the data.
4. Vectorization: By default, hive will process 1 row at a time. i.e 1 mapper
for 1 row. If you enable Vectorization, hive mapper will process 1024
rows at a time. The disadvantage here is that it will occupy lot of
memory.
5. set hive.vectorized.execution.enable=true
6. Hive Parallel execution :If you have queries that are independent of
each other, you can make them run in parallel by enabling the Hive
Parallel execution.

Hive vs SQL
1) Hive is schema on read while sql is schema on write
2) Hive is for analytics while SQL is for transcational
3) Hive is a datawarehouse and SQL is a database
4) SQL supports only structured data while hive supports structured and
semi structured data

Hive+notes.txt Hive+Queries.txt hivebuiltinfunctions.txt

52
Installing and setting up Spark and Scala -
Download links
Download the softwares from the below links and follow the instructions from the video
session for setting up Scala and Spark.
1. Download JDK 1.8 version:
Paste the below link in your browser
https://www.oracle.com/java/technologies/javase/javase8-archive-
downloads.html
Now search for jdk-8u172-windows-x64.exe and click on it to download
2. Download Spark
Paste the below link in your browser to download.
https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.6.tgz
3. Download scala:
Paste the below link in your browser to download.
http://downloads.typesafe.com/scalaide-pack/4.7.0-vfinal-oxygen-212-
20170929/scala-SDK-4.7.0-vfinal-2.12-win32.win32.x86_64.zip
4) Winutils
Paste the below link in your browser to download.
https://github.com/steveloughran/winutils/raw/master/hadoop-
2.7.1/bin/winutils.exe

53
SCALA
Scala Collections :
Scala collections are containers that hold sequenced linear set of elements .
The Scala collections can be mutable and immutable.
Scala.collection.mutable contains all mutable collections. If you want to use
the mutable collections, you must import this package
Scala.collection.immutable contains all immutable collections. Scala by
default imports this package.
Here are the collection types:
Set : Set stores Unique elements. Does not maintain any order.
Elements can be of different datatypes
Eg: val games=Set(“cricket”,”Football”,”Hockey”)

Seq: Represents indexed sequences that are guaranteed to be


immutable.
Eg: var seq:Seq[Int] = Seq(52,85,1,8,3,2,7,”sa”)
You can iterate through elements using for loop. It returns a List.

List: Store ordered elements.It can take combination of different types.


Eg: val games=List(“cricket”,”football”,”hockey”)

Vector: Vector is a general-purpose, immutable data structure. It


provides random access of elements. It is good for large collection of
elements

54
Eg : var vector2 = Vector(5,2,6,3)
Queue: Queue implements a data structure that allows inserting and
retrieving elements in a first-in-first-out (FIFO) manner.
In scala, Queue is implemented as a pair of lists. One is used to insert
the elements and second to contain deleted elements. Elements are
added to the first list and removed from the second list.
Eg : var queue = Queue(1,5,6,2,3,9,5,2,5)

Map: Map is used to store elements. It stores elements in pairs of key


and values. In scala, you can create map by using two ways either by
using comma separated pairs or by using rocket operator
var map = Map(("A","Apple"),("B","Ball"))
var map2 = Map("A"->"Aple","B"->"Ball")
Using map.keys we will get key values. Using map.values we get the
values

Tuple: A tuple is a collection of elements in ordered form. If there is no


element present, it is called empty tuple. It can take any datatype
val t1=(1,2,"Robin",222.5)
access elements using dot
eg: to get first element t1._1
To iterate through tuple we use : productIterator.
so it will be t1.productIterator.foreach(println)
Array: Array is a collection of similar datatypes.Elements are accessed
using index

55
var arr = Array(1,2,3,4,5)

Collection Methods
Here are some of the collection Methods:
1. Map : Takes a function and applies that function to every element in
the collection.
Eg: val mul=num.map(x=>x*2)

2. flatMap : It does what map does.It also flattens the elements


Val fl=list.flatMap(x=>x.split(“~”))

3. filter: Filters the elements.


Val fil=list.filter(x=>x%2==0)

4. count: Count will have filter condition.It will give the count of
elements that satisfies the filter condition.
Val cc=num.count(x=>x%2)

5.exists: Return true if a particular condition is met


val exist_even=num.exists(x=>x%2==0)

6. foreach : Loop through collection


Eg :List.foreach(println)

56
7. partition: Grouping the elements. You specify some condition.
Elements satisfying that condition will be grouped into one partition
and the remaining elements into another group.
val part_even=num.partition(x=>x%2==0)
It will create two lists. One with even elements and other with non-
even elements

8. reduce, reduceLeft, reduceRight:


Reduce methods are applied on collection. You can apply binary
operations on the collection. It takes 2 elements in the collection at a
time and does the operation
Val num=List(1,2,3,4)
Eg : num.reduceLeft(_+_)
Eg : num.reduceRight(_+_)

9. foldLeft,foldRight:
foldLeft and foldRight does what reduceLeft and reduceRight does.
The only difference is foldLeft and foldRight will have an initial value
val fold_left=name.foldLeft(“robin")(_+_)

10. scanLeft,scanRight:
Same as fold. The basic difference is , in fold we have the final output
while in scan we get the intermediate results as well as output.
val scan_right=name.scanRight("ron")(_+_)

57
SPARK
RDD Basics - Reading and Writing a File
Download the BigBasket file by clicking on the below link:
https://www.kaggle.com/datasets/surajjha101/bigbasket-entire-product-list-28k-
datapoints?select=BigBasket+Products.csv
Once downloaded and saved, follow the instructions in the video

Use Case - Analyze the Log Data


Download the log file from the below link and follow the instruction given in the
video.

https://github.com/logpai/loghub/blob/master/Hadoop/Hadoop_2k.log

sales.txt words.txt

Spark Seamless Dataframe- Reading and Writing


Download the dataset from Kaggle for this session. Paste the below link in the
browser to download the dataset.

https://www.kaggle.com/datasets/rohitsahoo/employee

Once the file is downloaded, follow the instructions provided in the video session

Reading and Writing XML Data


Download the sample xml file from below link:
https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ms762271(v=vs.85)

The spark XML jars are already uploaded to the Resource section in the video
lecture. The same can be downloaded using the below links:

58
commons-io-2.8.0.jar :
https://mvnrepository.com/artifact/commons-io/commons-io/2.8.0
txw2-2.3.3.jar :
https://mvnrepository.com/artifact/org.glassfish.jaxb/txw2/2.3.3
xmlschema-core-2.2.5.jar :
https://mvnrepository.com/artifact/org.apache.ws.xmlschema/xmlschema-
core/2.2.5
Read xml jar : https://mvnrepository.com/artifact/com.databricks/spark-
xml_2.11/0.11.0

Let's explore more transformations


Download the dataset for this session from the below link:

https://www.kaggle.com/datasets/shivamb/bank-customer-segmentation

Below are the Functions we have worked with in this session:

1. select : Used to select the required columns

2. selectExpr: Does what select does. In addition, it helps in applying sql


transformation on the columns.

3. withColumn: Similar to selectExpr, it allows you to apply transformation


on the selected column while retaining all other columns in the dataframe

4. withColumnRenamed: withColumnRenamed is used to rename a column

5. case when: Acts like a case statement in sql , if then else in programming
language

6. drop: Drops the column from the dataframe

59
Working with Fixed Width File
The files required for this are uploaded in the Resources section in the Video
Lecture. Download the files and follow the instructions as per the video.
Below is the complete code for Fixed Width File scenario:
package sparkPractise
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
object fixedWidth {
def dml(column1:Array[Row]):List[String]=
{
var final_dml:String=""
var last_dml:List[String]=Nil
for(i<-column1)
{
val data=i.mkString(",")
val columnName=data.split(",")(0)
val pos=data.split(",")(1)
val len=data.split(",")(2)
final_dml=s"substring(fixed_data,$pos,$len) as $columnName"
last_dml=last_dml :+ final_dml
//println(last_dml)

60
}
last_dml
}

def main(args:Array[String]):Unit={
val conf=new SparkConf().setAppName("para").setMaster("local[*]")
val sc=new SparkContext(conf)
sc.setLogLevel("Error")
val spark=SparkSession.builder().getOrCreate()
import spark.implicits._
val schemaFile=spark.read.format("json").load("file:///C:/data/input/dml.json")
val col_df=schemaFile.collect()
val dd=dml(col_df)
val inputFile=spark.read.format("csv").option("header", "false")
.load("file:///C:/data/input/input.txt")
.withColumnRenamed("_c0", "fixed_data")
println("**************Extracting the data********")
val df=inputFile.selectExpr(dd:_*)
df.show()
}
}

61
Parameterize using Config File
For parameterizing using the config file, we need to download a jar. You can download
this jar from the below link:
https://search.maven.org/artifact/com.typesafe/config/1.3.4/jar
The same can be downloaded from the resources in the video Lecture above.
Follow the instructions for adding the jar and developing the code as well as deploying
it.

The config files are added to the Resources in the above Lecture.
Here is the code for configFile1.scala
package sparkPractise
import com.typesafe.config.{Config,ConfigFactory}
object configFile1 {
def main(arg:Array[String]):Unit=
{
ConfigFactory.invalidateCaches()
println("**************Application.conf**********")
val config=ConfigFactory.load("Application.conf").getConfig("MyProject")
println(config.getString("spark.app-name"))
println(config.getString("spark.master"))
println(config.getString("mysql.username"))
println(config.getString("mysql.password"))
println("**************Application.properties**********")
val config1=ConfigFactory.load("Application.properties")
println(config1.getString("dev.input.base.dir"))
println(config1.getString("dev.output.base.dir"))
62
println(config1.getString("prod.input.base.dir"))
println(config1.getString("prod.output.base.dir"))
println(config1.getString("input"))
println(config1.getString("output"))
}
}

Here is the code for configFile2.scala:


package sparkPractise
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkContext
import com.typesafe.config.{Config,ConfigFactory}
object configFile2 {
def main(arg:Array[String]):Unit=
{
val conf=new SparkConf().setAppName("job1").setMaster("local[*]")
val sc=new SparkContext(conf)
sc.setLogLevel("Error")
val spark=SparkSession.builder().config(conf).getOrCreate()
ConfigFactory.invalidateCaches()
val config=ConfigFactory.load("Application.properties")

//dev
if(arg(0)=="dev")
{
63
val read_csv_df=spark.read.format("csv").option("header","true")
.load(config.getString("dev.input.base.dir"))

read_csv_df.write.format("csv").option("header","true")
.mode("overwrite")
.save(config.getString("dev.output.base.dir"))
}

//prod
if(arg(0)=="prod")
{
val read_csv_df=spark.read.format("csv").option("header","true")
.load(config.getString("prod.input.base.dir"))

read_csv_df.write.format("csv").option("header","true")
.mode("overwrite")
.save(config.getString("prod.output.base.dir"))
}
}
}

Reading Json from a web URL and flattening it


We will be using the random user api for reading the json data and then flattening it.
Follow the instructions in the video to flatten the data. Below is the url for this usecase:
https://randomuser.me/api/0.8/?results=10
Below is the code for this session:
package sparkPractise
64
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import scala.io.Source
object flatten_json_webURL {

def main(args:Array[String]):Unit={

val conf=new SparkConf().setAppName("weburl").setMaster("local[*]")


val sc=new SparkContext(conf)
sc.setLogLevel("Error")

val spark=SparkSession.builder().getOrCreate()

val input_data=Source.fromURL("https://randomuser.me/api/0.8/?results=10").mkString

val rdd=sc.parallelize(List(input_data))
val df=spark.read.json(rdd)
println("*************Raw data************")
df.printSchema()
df.show(false)
println("*************Flatten Array************")
val flat_array_df=df.withColumn("results", explode(col("results")))
flat_array_df.printSchema()
flat_array_df.show(false)
65
println("*************Flatten Struct************")
val flat_df=flat_array_df.select(
"nationality", "results.user.cell","results.user.dob","results.user.gender",
"results.user.location.city","results.user.location.state","results.user.name.first",
"seed","version"
)
flat_df.printSchema()
flat_df.show(false)
}
}

Flattening data by creating a Function


//Code for flattening the url using a function
package sparkPractise
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import scala.io.Source
import org.apache.spark.sql.Column
object flatteningUsingFunction {
def flattenStructSchema(schema:StructType,prefix:String=null):Array[Column]=
{
schema.fields.flatMap(
f=>
66
{
val columnName= if (prefix == null) f.name else (prefix+"."+f.name)
// println("*************Printing column Names********")
// println(columnName)
f.dataType match{
case st:StructType => flattenStructSchema(st,columnName)
case _=> Array(col(columnName).as(columnName.replace(".", "_")))
}
}
)
}

def main(args:Array[String]):Unit=
{
val conf=new SparkConf().setAppName("weburl").setMaster("local[*]")
val sc=new SparkContext(conf)
sc.setLogLevel("Error")

val spark=SparkSession.builder().getOrCreate()

//read from url and create a rdd


val
input_data=Source.fromURL("https://randomuser.me/api/0.8/?results=10").mkString
val rdd=sc.parallelize(List(input_data))

//convert rdd to df
67
val df=spark.read.json(rdd)

//Initial schema check and print data


println("****************Raw data***********************")
df.printSchema()
df.show()

//flatten array
println("****************Flatten Array***********************")
val web_api_exploded_df=df.withColumn("results", explode(col("results")))
web_api_exploded_df.printSchema()
web_api_exploded_df.show()
println("****************web_api_exploded_df.schema***********************")
// flattenStructSchema(web_api_exploded_df.schema)
// println(web_api_exploded_df.schema.fields)
// web_api_exploded_df.schema.fields.foreach(f=>println(f.name+","+f.dataType))

val
web_api_flat=web_api_exploded_df.select(flattenStructSchema(web_api_exploded_df.
schema):_*)
web_api_flat.printSchema()
web_api_flat.show(false)
}
}

68
Working with HBase
To work with Hbase, open cloudera and fire the below commands in order:

hadoop dfsadmin -safemode leave

hadoop fs -rmr /hbase

hadoop fs -mkdir -p /hbase/data

sudo service hbase-master-restart

sudo service hbase-regionserver restart

Spark HBase Integration


The jars required for Spark HBase Integration are uploaded to Resources in the video
Lecture. You can download from there or you can click on the below links to download:
https://mvnrepository.com/artifact/org.apache.hbase/hbase-client/1.1.2.2.6.2.0-205
https://mvnrepository.com/artifact/org.apache.hbase/hbase-common/1.1.2.2.6.2.0-
205
https://mvnrepository.com/artifact/com.hortonworks/shc-core/1.1.1-2.1-s_2.11
https://mvnrepository.com/artifact/org.apache.hbase/hbase-protocol/1.1.2.2.6.2.0-205
https://mvnrepository.com/artifact/org.apache.hbase/hbase-server/1.1.2.2.6.2.0-205

package sparkPractise
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
69
import org.apache.spark.sql.execution.datasources.hbase._

object spark_Hbase_Integration {
def main(args:Array[String]):Unit=
{
val conf=new SparkConf().setAppName("com").setMaster("local[*]")
val sc=new SparkContext(conf)
sc.setLogLevel("Error")

val spark=SparkSession.builder().getOrCreate()
import spark.implicits._

def catalog_1=s"""{
"table":{"namespace":"hrdata","name":"employee"},
"rowkey":"rowkey",
"columns":{
"rowid":{"cf":"rowkey","col":"rowkey","type":"string"},
"id":{"cf":"cf_emp","col":"eid","type":"string"},
"name":{"cf":"cf_emp","col":"ename","type":"string"}
}

}""".stripMargin

val df=spark.read.options(Map(HBaseTableCatalog.tableCatalog->catalog_1))
.format("org.apache.spark.sql.execution.datasources.hbase").load()
df.show(false)
70
}
}
package sparkPractise
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.execution.datasources.hbase._

object hbase_write {
def main(args:Array[String]):Unit=
{
val conf=new SparkConf().setAppName("com").setMaster("local[*]")
val sc=new SparkContext(conf)
sc.setLogLevel("Error")

val spark=SparkSession.builder().getOrCreate()
import spark.implicits._

//read input file


val input_df=spark.read.format("csv").option("header", "true").option("delimiter",
"|").load("/user/cloudera/datasets/txn1")

val sel_df=input_df.select("txno", "custno","category")

//define catalog.Map spark cols to hbase cols


def catalog_write=s"""{
71
"table":{"namespace":"hbase_practise","name":"hbase_transactions"},
"rowkey":"rowkey",
"columns":{
"txno":{"cf":"rowkey","col":"rowkey","type":"string"},
"custno":{"cf":"cf_txn","col":"cid","type":"string"},
"category":{"cf":"cf_txn","col":"category","type":"string"}
}

}""".stripMargin

val df=sel_df.write.options(
Map(HBaseTableCatalog.tableCatalog-
>catalog_write,HBaseTableCatalog.newTable->"4")
).format("org.apache.spark.sql.execution.datasources.hbase").save()
}
}

72
Cassandra setup and working with Cassandra
We need to install two softwares:

1. Datastax
2. Cassandra

1. Datastax : Datastax is uploaded to the resources section in this Lecture.


Download from here.
2. Cassandra : Cassandra is also uploaded to the resources section. You can
download from here or you can use the below link to download the software:
https://www.apache.org/dyn/closer.lua/cassandra/3.11.13/apache-cassandra-
3.11.13-bin.tar.gz

Once the softwares are downloaded, install them as per the instructions from
the Video Lecture.

73
Cassandra Spark Integration
To integrate Spark with Cassandra, we need to download the below two jars:
https://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-
connector_2.11/2.3.1
https://mvnrepository.com/artifact/com.twitter/jsr166e/1.1.0

These jars are also uploaded to the resources section. So you can download
from here as well. Once downloaded, follow the instructions from the video
lecture.

package sparkPractise

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.explode

object cassandra_write {

def main(args:Array[String]):Unit=
{

val conf=new SparkConf().setAppName("com").setMaster("local[*]")


val sc=new SparkContext(conf)
sc.setLogLevel("Error")

val spark=SparkSession.builder().getOrCreate()
import spark.implicits._

val df=spark.read.format("csv").option("header",
"true").option("delimiter", "|").load("file:///C:/data/India_1.txt")

//df.show(false)

df.write.format("org.apache.spark.sql.cassandra")
74
.option("spark.cassandra.connection.host","localhost")
.option("spark.cassandra.connection.port","9042")
.option("keyspace","practise")
.option("table","country")
.mode("append")
.save()
}
}
package sparkPractise
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.explode

object cassandra_read {

def main(args:Array[String]):Unit=
{

val conf=new SparkConf().setAppName("com").setMaster("local[*]")


val sc=new SparkContext(conf)
sc.setLogLevel("Error")

val spark=SparkSession.builder().getOrCreate()
import spark.implicits._

val df=spark.read.format("org.apache.spark.sql.cassandra")
.options(Map("table"->"country","keyspace"->"practise"))
.load()
df.show()
}

75
Apache NIFI Installation
Download Apache NIFI from the below URL:
https://archive.apache.org/dist/nifi/1.6.0/nifi-1.6.0-bin.zip

Extract it to a drive. Do not extract it in another folder. If you are extracting it


to c, put it in c drive. Don’t place it in another folder. This is important

Then Go to nifi-1.6.0-bin\nifi-1.6.0\bin and double click run-nifi. This will


run nifi.

To check whether nifi is running or not, just open cmd and ping jps. You
should see nifi running.

To open NIFI ,use the below URL:


http://localhost:8080/nifi/

76
kafka installation and topic creation
For running kafka services, we need to install Kafka and Zookeeper.

For downloading zookeeper, please use the below url:


https://archive.apache.org/dist/zookeeper/zookeeper-3.4.12/

For downloading kafka, go to the kafka download page. Below is the kafka
download page:
https://kafka.apache.org/downloads

From here, you need to download: kafka_2.11-0.11.0.0

Follow the instructions from the video lecture to install and run kafka.

Spark Kafka Integration


package sparkPractise

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming._
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe

object spark_Kafka_Integration {

77
def main(args:Array[String]):Unit={

val conf=new SparkConf().setAppName("spark_kafka_integ")


.setMaster("local[*]").set("spark.driver.allowMultipleContexts", "true")

val sc=new SparkContext(conf)

sc.setLogLevel("Error")

val spark=SparkSession.builder().getOrCreate()
import spark.implicits._

val ssc=new StreamingContext(conf,Seconds(2))

//kafka params
val kparams = Map[String, Object]("bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "new_consumer",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (true: java.lang.Boolean))

val topics=Array("new_topic")

val stream=KafkaUtils.createDirectStream(ssc, PreferConsistent,


Subscribe[String,String](topics,kparams)).map(x=>x.value())

stream.foreachRDD(x=>
if(!x.isEmpty())
{
val df=x.toDF().show()
}
)

78
//start streaming
ssc.start()
ssc.awaitTermination()

}
}

Spark Structured Streaming


package sparkPractise

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

object sparkFileStreamingStructured {

def main(args:Array[String]):Unit={

val conf=new SparkConf().setAppName("File_Streamig")


.setMaster("local[*]")

val sc=new SparkContext(conf)

79
sc.setLogLevel("Error")

val spark=SparkSession.builder().getOrCreate()
import spark.implicits._

val dml=StructType(Array(
StructField("id",StringType,true),
StructField("name",StringType,true)
));

val df=spark.readStream.format("csv").schema(dml)
.load("file:///C:/data/streaming/src_data")

df.writeStream.format("console")
.option("checkpointLocation", "file:///C:/data/streaming/check_point")
.start().awaitTermination()

}
}

80
Spark Kafka Integration
package sparkPractise
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
object sparkwithKafka {

def main(args:Array[String]):Unit={

val conf=new SparkConf().setAppName("File_Streamig")


.setMaster("local[*]")

val sc=new SparkContext(conf)


sc.setLogLevel("Error")

val spark=SparkSession.builder().getOrCreate()
import spark.implicits._

val readkafka=spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "read_topic")
.load().withColumn("value", expr("cast(value as string)"))
.selectExpr("concat(value,',Hello') as value")

//write to kafkatopic
readkafka.writeStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "write_topic").option("checkpointLocation",
"file:///C:/data/streaming/check_point")
.start().awaitTermination()
}
}
81

You might also like