Big Data Developer
Big Data Developer
2
What is Big Data
What is Big Data ?
•Big data is collection of data that is huge in volume, is growing exponentially
with time.
•Example of Big Data : Data from social media, sales details from big retailers
like Walmart, Jet Engine data which can generate more than 10 terabytes in 30
mins of flight time etc
Types of Big Data:
•Structured . Eg : Tables
•Semi Structured . Eg : XML
•Unstructured
There are many tools and programs which help in processing big data. Some
of them are :
•Hive •Spark •Kafka •NoSQL Db •Presto •Flink •Hudi •Druid …..
3
Hadoop Commands
1. To list the files in HDFS:
hdfs dfs -ls
hdfs dfs -ls /user/cloudera/
5. put: Copies files from the local file system to the destination file system.
This command can also read input from stdin and write to
the destination file system.
hdfs dfs -put localfile1 localfile2 /user/cloudera/hadoopdir;
4
hdfs dfs -cp /user/cloudera/file1 /user/cloudera/file2 /user/cloudera/dir
11. text: Outputs a specified source file in text format. Valid input file formats
are zip and TextRecordInputStream.
hdfs dfs -text /user/cloudera/file8.zip
12. touchz: Creates a new, empty file of size 0 in the specified path.
hdfs dfs -touchz /user/cloudera/file12
5
16. count: Counts the number of directories, files, and bytes under the paths
that match the specified file pattern.
hdfs dfs -count emp #emp is directory
17. du: Displays the size of the specified file, or the sizes of files and
directories that are contained in the specified directory.
hdfs dfs -du /user/cloudera/dir1 /user/cloudera/file1
20. fsck generates a summary report that lists the overall health of the
filesystem. HDFS is considered healthy if—and only if—all files have a
minimum number of replicas available.
hadoop fsck /file1.txt
hadoop fsck /file1.txt -files -blocks -locations
6
Serialized File Formats
Serialization :
Serialization is the process of converting a data structure or an object into a
format that can be easily stored or transmitted over a network and can be
easily reconstructed later.
For example : If I have the below data :
“This is my Data”
Now this data will be serialized ie converted into stream of bytes(human
unreadable format). So serialized data is unreadable. IT is either stored or
transmitted over a network. Then this is deserialized ie converted back to its
original form which is readable.
Advantages of Serialization :
1) Serialized data are easy and fast to transmit over a network. Read and write
are fast
2) Some serialized file formats offer good compression, they can be encrypted
3) Deserialization also does not take much time
1) Sequence File Format : Its pure Java serialization. It helps in easy retrieval
of data But since it is java serialization, it works best for map reduce only.
Does not work well with Spark or other tech. Since map reduce is extinct, so
is sequence
2) RC File Format: Row Columnar File Format : Write takes time. Read
would be easy .Columnar file format are easy to read. No compression
technique so data size would be huge
7
3) Parquet : Columnar File Format .Reasonable Compression technique (50
to 55%).
Compression technique supported by Parquet : snappy(default),lzo,gzip
Parquet is used in Target system as querying is fast
Note :
Sqoop supports - Sequence, Parquet, AVRO directly
ORC - Sqoop needs Hive Integration
2) Because AVRO is row Format, it is write heavy while ORC and Parquet
are read heavy.
3) In all these 3 file formats, along with data, the schema will also be there.
This means you can take these files from one machine and load it in another
machine and it will know what the data is about and will be able to process
4) All these file formats can be split across multiple multiple disks. Therefore
scalability and parallel processing is not an issue.
9
sqoop import
sqoop import:
Now what happens if you dont specify number of mappers? If you dont
specify number of mappers, it will by default take 4
10
If you dont specify split-by when the number of mappers are more than 1, it
will fail. However if the table has a primary column, then sqoop will take that
primary column as the split-by column
query:
sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username root
--password cloudera --m 2 --table customers --split-by customer_id
--query 'customer_fname,customer_lname,customer_state,customer_city from
customers where $CONDITIONS' --delete-target-dir --target-dir
/user/cloudera/query_eg
11
What if we also have a where clause condition in our query. We will still need
to use $CONDITIonS as below :
sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username root
--password cloudera --m 2 --table customers --split-by customer_id
--query 'customer_fname,customer_lname,customer_state,customer_city from
customers where customer_state="TX" AND $CONDITIONS' --delete-
target-dir --target-dir /user/cloudera/query_eg
Sometimes we may want to evaluate the data before importing into Hadoop.
Sqoop eval will help us doing this. Sqoop Eval basically is connecting to
database from sqoop command. All the sql commands that you can fire on
database, we can execute them using the sqoop eval. When you put a sql
query in sqoop eval, it will connect to the database, fire the command and
display the results on the edge node. We use eval instead of import and place
the query in --query.
12
incremental import
Sqoop Incremental Import:
sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username root
--password cloudera --m 2 --table customers --split-by customer_id
--target-dir /user/cloudera/customer1 --incremental append --check-column
customer_id --last-value 0
Next time, you need to change the last-value to import from the value greater
than the last imported.So, everytime you need to import any records, you must
remember the last-value from previous import and use it in the sqoop
command. This is very manual. What if sqoop does this job for us ? What if
sqoop remembers the last-value and imports from that last-value in next sqoop
? This is possible by creating a sqoop job.
13
Now to execute this job :
sqoop job --exec myJob
Now once a job is created, you cannot edit the job i.e you cannot edit the
import statements. You cannot view the import statment associated with a job.
If you want to make any changes, delete the job and recreate the job with
updated details.
Password Protection
Saving the password to a file and using that file for password:
sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username root
--password-file file:///home/cloudera/passfile --m 1 --table customers --target-
dir /user/cloudera/data_import
Instead of storing the password in text format, we can encrypt the passoword
and store the encrypted password and then refer to this encrypted password in
our sqoop command. As of Sqoop version 1.4.5, Sqoop supports the use of
JKS to store passwords which would be encrypted,so that you do not need to
store passwords in clear text in a file.This can be achieved using Java
KeyStore.A Java KeyStore (JKS) is a repository of security certificates –
either authorization certificates or public key certificates – plus corresponding
private keys, used for instance in TLS encryption.
Now, we will use this encrypted password. We will pass the alias :
encryptedpassword we created above. Here is the command for this :
sqoop import -
Dhadoop.security.credential.provider.path=jceks://hdfs/tmp/mypassword --
connect jdbc:mysql://localhost:3306/retail_db --username root --password-
alias encrytpassword --table customers -m 1 --target-dir
/user/cloudera/data_import_encrypted
If you run only import, there is only mappers involved. No reducers because
there is no shuffling.
--mysql database
select * from orders where order_id in (68871,68809,68817,68827);
15
update orders set order_status='test1' where custid=68809; --2014-03-12
update orders set order_status='test2' where custid=68817; -- 2014-03-27
update orders set order_status='test3' where custid=68827; -- 2014-04-16
update orders set order_status='test4' where custid=68871; -- 2014-06-28
Here, --check-colum createdt : mention the column where it should pick data
for comparison
--last-value : Give the value for createdt . From this value, it would pick the
data
16
Import as Parquet File:
sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username
root --password-file file:///home/cloudera/passfile --m 1 --table order
--target-dir /user/cloudera/parquet_dir
--as-parquetfile
Import as ORC : There is no equivalent of –as-avrodatafile or –as-sequence
or –as-parquetfile for ORC. In order to import as ORC file, we need to
leverage the sqoop’s Hcatalog integration feature. HCatalog is a table storage
management tool for Hadoop that exposes the tabular data of Hive
metastore to other Hadoop applications.
It enables users with different data processing tools (Pig, MapReduce) to
easily write data onto a grid.You can think of Hcatalog as an API to access
Hive metastore.
To import as ORC, it involves the below two steps:
1) Create Hive Database with your desired HDFS warehouse location
2) Run Sqoop import command to import from RDBMS table to Hcatalog
table
Let’s sqoop import the table : customermod3 as ORC. We basically import
the data into some hive database. We can ask the sqoop to create a hive
table. So basically a hive table will be created and the directory for that hive
table will be in ORC format.
Here is the sqoop import command :
sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username
root --password-file file:///home/cloudera/passfile --m 1 --table order
--m 1
--hcatalog-database test
17
--hcatalog-table order
--create-hcatalog-table
--hcatalog-storage-stanza "stored as orcfile";
Additional options here are the –hcatalog options where we are giving the
database name, hive table to be created and how the storage needs to
happen. So hive table will be created under a database test.
import-all-tables will import all the tables. But since I want to import only 2
tables, I have to skip the remaining tables. This we can do using –exclude-
tables.
commit;
Let’s say if a colum has null in string datatype we want to display as nulldata
and if a column has null in non-string datype, we want do display it as 0. We
can use the below query for this purpose.
Sqoop export
Sqoop Export Staging : We will create a staging table with same structure of
target table. Sqoop will first export into staging table and then from staging
table into target table. After this, staging table data is deleted. Staging table
should be in the same database as the target target.
Below is the command :
sqoop export --connect jdbc:mysql://localhost:3306/retail_db --username root
--password-file file:///home/cloudera/passfile
--table test --staging-table test_stg --m 1 --export-dir /user/cloudera/test_null
Here, --staging-table : Give the staging table name
--table : Give the target table name
19
If data exists in staging table and we want to first truncate the staging table via
sqoop, use --clear-staging-table . Here is the query for the same:
sqoop export --connect jdbc:mysql://localhost:3306/retail_db --username root
--password-file file:///home/cloudera/passfile
--table test --staging-table test_stg --clear-staging-table --m 1 --export-dir
/user/cloudera/test_null
In Sqoop export, 10k rows will be inserted per sqoop statement by default. We
can tune this using below 2 properties :
-Dsqoop.export.records.per.statement=100
20
Dsqoop.export.records.per.statement =100 --> This property will tell you how
many rows are inserted per insert statment. here 100 rows per statement
Dsqoop.export.statements.per.transcation=100 --> This property will tell you
how many insert statements are fired. here its 100
So, 100 statements are fired and each statement has 100 rows . So totally
100*100=10k rows
21
HIVE
Hive-Data Preparation: Below commands are used for exporting the data
from mysql:
SELECT *
FROM orders
INTO OUTFILE '/var/tmp/orderss.csv'
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
SELECT *
FROM categories
INTO OUTFILE '/var/tmp/categories.csv'
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
SELECT *
FROM customers
INTO OUTFILE '/var/tmp/customers.csv'
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
SELECT *
FROM departments
INTO OUTFILE '/var/tmp/departments.csv'
22
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
SELECT *
FROM order_items
INTO OUTFILE '/var/tmp/order_items.csv'
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
SELECT *
FROM products
INTO OUTFILE '/var/tmp/products.csv'
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
--create a folder called datasets in the home directory
mkdir datasets
--copy the files to the new folder
cp /var/tmp/categories.csv /home/cloudera/datasets/categories.csv
cp /var/tmp/customers.csv /home/cloudera/datasets/customers.csv
cp /var/tmp/departments.csv /home/cloudera/datasets/departments.csv
cp /var/tmp/order_items.csv /home/cloudera/datasets/order_items.csv
cp /var/tmp/orderss.csv /home/cloudera/datasets/orders.csv
cp /var/tmp/products.csv /home/cloudera/datasets/products.csv
Download the required files from the resources in this lecture
23
America.txt countries.txt India.txt
What is Hive
Hive is a data warehouse system which is used to analyze structured data. It is
built on the top of Hadoop. It was developed by Facebook.
Hive stores the metadata in hive metastore. The Hive metastore is simply a
relational database. It stores metadata related to the tables/schemas you create
to easily query big data stored in HDFS. When you create a new Hive table,
the information related to the schema (column names, data types) is stored in
the Hive metastore relational database. Hive uses the Derby database for its
metadata storage.
Features of hive :
•Hive is open source. We can use it for free
•HQL(High Query Language) is very similar to your SQL
•Hive is schema on read
•Hive can be used as an ETL Tool and can process huge amount of data
•Hive supports partitioning and Bucketing
•Hive is warehouse tool designed for analytical purpose. Not for transactional
purpose
•Can work with multiple file formats
•Can be plugged into BI tools for data visualization
Limitations of Hive:
•Hive is not designed for OLTP operations. Used for OLAP
•It has limited subquery support
•Latency of Hive is little high
•Support for Updates and Deletion is very minimal
•Not used for real time queries as it takes a bit of time to give the results
24
Create and load a table in Hive
create database:
create database practise;
use practise;
Hive has a default directory : /user/hive/warehouse
Whenever a database is created, or a table is created or data is inserted, a
directory will be created in the above location . This is the default path.
Checking the default path:
hdfs dfs -ls /user/hive/warehouse
26
Note:If the data is being loaded from edge node, file is copied to the hive
warehouse and if it is loaded from hadoop, file will be moved to warehouse instead of
copying
We have mentioned columns and its structure in hive table creation. If the
file that we are loading using inpath does not have the same structure, say it
has less/more columns or if the column type is mismatching, it will still be
loaded and while we query through hive, if there is a mismatch on a column,
it will show it as null instead of failing. This is because hive will enfornce
schema on read.
lets try that. Let's create table order_eg
create table practise.order_eg
(
order_id INT,
)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile ;
lets load the data from edge node:
load data local inpath '/home/cloudera/datasets/orders.csv' into table order_eg;
select * from order_eg;
if we drop this table the entire schema and the files will be deleted. This is managed
table.
drop table order_eg;
hdfs dfs -ls /user/hive/warehouse/
Let's create one more table: We will use some other location instead of the default
hive location.
create table practise.categories
27
(
category_id INT,
category_department_id INT,
category_name varchar(50)
)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile
location '/user/cloudera/datasets/category';
If location does not exists, hive will create that location
Load data into table:
load data local inpath '/home/cloudera/datasets/categories.csv' into table categories;
You can also create a table with an existing table:
create table category_eg as select * from categories;
--inserting into one table from another
insert into table category_eg select * from categories; #Appends
select count(*) from category_eg;
If you want to overwrite use:
insert overwrite table category_eg select * from categories #Overwrites
28
Hive Table Types
• Managed Table : When table dropped, backend directory associated with the table is
deleted as well. Use it for staging purpose
• External Table : When table dropped, backend directory associated with the table
would still exist. Use it for target system
--create an external table:
create external table practise.customers
(
customer_id INT,
customer_fname varchar(50),
customer_lname varchar(50),
customer_email varchar(50),
customer_password varchar(50),
customer_street varchar(255),
customer_city varchar(50),
customer_state varchar(50),
customer_zipcode varchar(50) )
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile
location '/user/cloudera/datasets/customers/';
describe formatted customers;
--drop a table:
drop table customers;
hdfs dfs -ls /user/cloudera/datasets/customers/;
29
Hive Partitions
Partition is a way of dividing the data in a table into related parts using a
partition column.
Types:
•Static Partition : Partitions are created whenever user specifies
•Dynamic Partition : Partitions are created dynamically
Hive Static Partition:
Static Load Partition : Here, we specify the partition name to which the data
needs to be loaded
Static Insert Partition : Here, we first create a non partitioned table and then
insert the data from this non partitioned table into a partitioned table
Download the files that are uploaded in the resources section for this lecture.
We will be using those files. First, let's create some directories:
mkdir /home/cloudera/country
mkdir /home/cloudera/country/India
mkdir /home/cloudera/country/America
mkdir /home/cloudera/country/countries
create external table normal_table
(
state varchar(255),
capital varchar(255),
language varchar(255)
)
row format delimited
fields terminated by ','
30
stored as textfile
location '/user/cloudera/normal_dir';
This table is without partition.
load the data:
load data local inpath '/home/cloudera/country/India' into table normal_table ;
load data local inpath '/home/cloudera/country/America' into table normal_table ;
hdfs dfs -ls /user/cloudera/normal_dir
31
--querying the table:
select * from partitioned_table;
select * from partitioned_table where country='IND';
Static Insert partition: Inserting the data from a non-partitioned table to a partitioned
table.We have s single file countries that contain both America and India data.
We want to insert individual countries to each partitioned file at hdfs. Load will not
work as we have a single file. So static load wont work. In this case we use static insert
partition. So what we do here is first we create a non-partitioned table. We will load all
the data into this table. Now this table has all the data . This table will now be our
source. We will now create our partitioned table. And from the non-partitioned table,
we will insert into the partition table by specifying the partition name. By this we can
create the partitioned table with the required partitions. Again, since we are manually
providing the partition here, this is static and since we are using insert instead of load,
this is static insert partition. Lets see that in action.
First we must create a non-partitioned table to load this data into. Lets create a non-
partitioned table.
create external table non_partitioned_table
(
state varchar(255),
capital varchar(255),
language varchar(255),
country varchar(255)
)
row format delimited
fields terminated by ','
stored as textfile
location '/user/cloudera/non_partitioned_dir1';
33
First data is loaded into a non partitioned table and then the data from non
partitioned table will be inserted into the partitioned table dynamically.
This is very similar to the static insert. The only difference is in static insert,
we tell the partition name that the data needs to go. Here in dynamic
partition those partitions are automatically created from the query result.
Let's jump into an example to understand better.
Let's create a normal table:
create external table source_country
(
state varchar(255),
capital varchar(255),
language varchar(255),
country varchar(255)
)
row format delimited
fields terminated by ','
stored as textfile
location '/user/cloudera/source_country';
34
capital varchar(255),
language varchar(255)
)
partitioned by (country varchar(100))
row format delimited
fields terminated by ','
stored as textfile
location '/user/cloudera/target_country';
Lets insert data from non-partitioend table to partition table. In our source
table, we have cntry. In target table, we have the partitioned column as
country. Dynamic partition will group all the cntry data from source and
create a directory per group. Now while inserting from source to target, we
need to tell hive the column that needs to be used for partition. We tell it by
specifying that as the last column in the select statement.
insert into target_country partition(country)
select state,capital,language,country
from source_country;
Now this will give an error. By default dynamic partition are not enabled. So set
property;
set hive.exec.dynamic.partition.mode=nonstrict;
now execute the query
We can create sub partitions as well. We have the non-parttioend table :
source_country. So lets use this table. Now we will create a partitioned table:
create external table target_country_sub
(
state varchar(255),
35
capital varchar(255),
language varchar(255)
)
partitioned by (country varchar(100),lang varchar(100))
row format delimited
fields terminated by ','
stored as textfile
location '/user/cloudera/target_country_sub';
insert into target_country_sub partition(country,lang)
select state,capital,language,country,language
from source_country;
hdfs dfs -ls /user/cloudera/target_country_sub
By default, each map reduce job will allow 100 partitions (ie 100 folders, including
partitions and subpartitions) to be created. If you want to change this property:
set hive.exec.max.dynamic.partitions=500;
36
--create a hive external table
create external table customers
(
customer_id INT,
customer_fname varchar(50),
customer_lname varchar(50),
customer_email varchar(50),
customer_password varchar(50),
customer_street varchar(255),
customer_city varchar(50),
customer_state varchar(50),
customer_zipcode varchar(50)
)
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as parquet
location '/user/cloudera/customers_dir_order';
Now you can query this hive table.
37
Hive Buckets
Bucketing in hive is the concept of breaking data down into ranges, which are
known as buckets, to give extra structure to the data so it may be used for
more efficient queries.
For bucketing to happen, we need to enforce below properties:
set hive.exec.dynamic.partition.mode=nonstrict
set hive.enforce.bucketing=true
--create a non partitioned table
create external table orders
(
order_id int,
order_date timestamp,
order_customer_id int,
order_status varchar(50)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/cloudera/datasets/orders';
Now lets create a partitioned table. We will create buckets for this table.
create table orders_bucket
(
order_id int,
38
order_date timestamp,
order_customer_id int
)
partitioned by (order_status STRING)
clustered BY (order_id) INTO 5 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/cloudera/orders_bucket';
//insert from non partitioned to partitioend atable
insert into orders_bucket partition(order_status) select
order_id,order_date,order_customer_id,order_status from orders;
40
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/cloudera/test3';
select * from test_1;
This is good. Now lets insert two more records to mysql:
insert into test(Id,Name,city) values (105,'Robin','Mumbai');
insert into test(Id,Name,city) values (106,'Anjali','Cochin');
lets execute our sqoop job:
sqoop job --exec myJob1
Check the data:
hdfs dfs -ls /user/cloudera/test3
Query hive table:
select * from test_1;
This is working as expected. Now Imagine at source, hey have dropped the
column: Name. Now they are only sending id and city.
alter table test drop column Name;
insert into test(Id,city) values (107,'London');
insert into test(Id,city) values (108,'Hyderabad');
Now lets run our sqoop job:
Check the data:
41
hdfs dfs -ls /user/cloudera/test3
Query hive table:
select * from test_1;
You will observe that the data is not in proper order. It will not throw any
error but you will see data mismatch here.
Now, to handle the schema changes, instead of using the text file, we can
store the data in avro file.
Let's redo this again. Let's use table : test_stg.
insert into test_stg(Id,Name,city) values (100,'John','New York');
insert into test_stg(Id,Name,city) values (101,'Pooja','Mumbai');
insert into test_stg(Id,Name,city) values (103,'Michael','London');
insert into test_stg(Id,Name,city) values (104,'Jessy','Cochin');
44
row format delimited
fields terminated by ','
lines terminated by '\n'
stored as textfile ;
describe formatted demo1.customers;
load data local inpath '/home/cloudera/datasets/customers.csv' into table
customers;
select * from customers;
Now run this file as :
hive -f file1.hql
You will observe that all the commands within the file are executed one by
one.
Joins in Hive
Hive supports sql joins like the inner join, left outer, right outer, full outer
joins. In addition to these, hive supports few join strategies.
Lets first create a table to use the join concepts.
create external table order_items
(
order_item_id int,
order_item_order_id int,
order_item_product_id int,
order_item_quantity int,
order_item_subtotal double,
order_item_product_price double
)
47
on (o.order_id=od.order_item_order_id);
2) Bucket Map Join
For Bucket Map join to take place, it has to satisfy many criteria.
First, both the join tables must be bucketed and it must be bucketed on the
join key. Also the number of buckets between two tables must be multiple of
each other. For example if table1 has 2 buckets then table 2 must have
buckets in multiples of 2 i.e it should have either 2,4,6,8 .. .buckets
And we also need to setup below properties:
set hive.optimize.bucketmapjoin = true
set hive.enforce.bucketing = true;
eg:
SELECT
/*+ MAPJOIN(table2) */ table1.emp_id,
table1.emp_name,
table2.job_title
FROM table1 inner JOIN table2
ON table1.emp_id = table2.emp_id;
49
MSCK Repair
In all our examples, we have created a partitioned table and then loaded
data into those partitions. What if we already have the data in hdfs and we
want to create a partitioned table on that data ? We haven't tried this yet.
We have created a non partitioned external table on the existing data and
we were able to query data from hive. But we never tried with a partitioned
table. Let's try that.
We have already created a partitioned table in earlier classes. Let's make
use of one of them.
let's use partitioned_table;
describe formatted partitioned_table;
We have the data here : /user/cloudera/partitioned_dir
So lets create another table on top of this data:
create external table partitioned_table_existing
(
state varchar(255),
capital varchar(255),
language varchar(255)
)
partitioned by (country varchar(100))
row format delimited
fields terminated by ','
stored as textfile
location '/user/cloudera/partitioned_dir';
50
lets query the table:
select * from partitioned_table_existing; -- no data. Why ?
When we create a partitioned table, partitions are generated and stored in
the Hive metastore.if we try to create a partitioned table on the existing
data, partitions are not automatically registered in the Hive metastore.We
need to recover the partitions
For this we use:
msck repair table partitioned_table_existing
Now query the table. You should see the data
select * from partitioned_table_existing;
51
Performance Tuning in Hive
Hive Performance Tuning:
1. Partitions : Using partitions will minimize scanning on the entire data
which would improve the query performance
2. Bucketing :Use bucketing when you are going to query on particular
column frequently
3. Map Joins, Skew Joins: Use the appropriate joins based on the data.
4. Vectorization: By default, hive will process 1 row at a time. i.e 1 mapper
for 1 row. If you enable Vectorization, hive mapper will process 1024
rows at a time. The disadvantage here is that it will occupy lot of
memory.
5. set hive.vectorized.execution.enable=true
6. Hive Parallel execution :If you have queries that are independent of
each other, you can make them run in parallel by enabling the Hive
Parallel execution.
Hive vs SQL
1) Hive is schema on read while sql is schema on write
2) Hive is for analytics while SQL is for transcational
3) Hive is a datawarehouse and SQL is a database
4) SQL supports only structured data while hive supports structured and
semi structured data
52
Installing and setting up Spark and Scala -
Download links
Download the softwares from the below links and follow the instructions from the video
session for setting up Scala and Spark.
1. Download JDK 1.8 version:
Paste the below link in your browser
https://www.oracle.com/java/technologies/javase/javase8-archive-
downloads.html
Now search for jdk-8u172-windows-x64.exe and click on it to download
2. Download Spark
Paste the below link in your browser to download.
https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.6.tgz
3. Download scala:
Paste the below link in your browser to download.
http://downloads.typesafe.com/scalaide-pack/4.7.0-vfinal-oxygen-212-
20170929/scala-SDK-4.7.0-vfinal-2.12-win32.win32.x86_64.zip
4) Winutils
Paste the below link in your browser to download.
https://github.com/steveloughran/winutils/raw/master/hadoop-
2.7.1/bin/winutils.exe
53
SCALA
Scala Collections :
Scala collections are containers that hold sequenced linear set of elements .
The Scala collections can be mutable and immutable.
Scala.collection.mutable contains all mutable collections. If you want to use
the mutable collections, you must import this package
Scala.collection.immutable contains all immutable collections. Scala by
default imports this package.
Here are the collection types:
Set : Set stores Unique elements. Does not maintain any order.
Elements can be of different datatypes
Eg: val games=Set(“cricket”,”Football”,”Hockey”)
54
Eg : var vector2 = Vector(5,2,6,3)
Queue: Queue implements a data structure that allows inserting and
retrieving elements in a first-in-first-out (FIFO) manner.
In scala, Queue is implemented as a pair of lists. One is used to insert
the elements and second to contain deleted elements. Elements are
added to the first list and removed from the second list.
Eg : var queue = Queue(1,5,6,2,3,9,5,2,5)
55
var arr = Array(1,2,3,4,5)
Collection Methods
Here are some of the collection Methods:
1. Map : Takes a function and applies that function to every element in
the collection.
Eg: val mul=num.map(x=>x*2)
4. count: Count will have filter condition.It will give the count of
elements that satisfies the filter condition.
Val cc=num.count(x=>x%2)
56
7. partition: Grouping the elements. You specify some condition.
Elements satisfying that condition will be grouped into one partition
and the remaining elements into another group.
val part_even=num.partition(x=>x%2==0)
It will create two lists. One with even elements and other with non-
even elements
9. foldLeft,foldRight:
foldLeft and foldRight does what reduceLeft and reduceRight does.
The only difference is foldLeft and foldRight will have an initial value
val fold_left=name.foldLeft(“robin")(_+_)
10. scanLeft,scanRight:
Same as fold. The basic difference is , in fold we have the final output
while in scan we get the intermediate results as well as output.
val scan_right=name.scanRight("ron")(_+_)
57
SPARK
RDD Basics - Reading and Writing a File
Download the BigBasket file by clicking on the below link:
https://www.kaggle.com/datasets/surajjha101/bigbasket-entire-product-list-28k-
datapoints?select=BigBasket+Products.csv
Once downloaded and saved, follow the instructions in the video
https://github.com/logpai/loghub/blob/master/Hadoop/Hadoop_2k.log
sales.txt words.txt
https://www.kaggle.com/datasets/rohitsahoo/employee
Once the file is downloaded, follow the instructions provided in the video session
The spark XML jars are already uploaded to the Resource section in the video
lecture. The same can be downloaded using the below links:
58
commons-io-2.8.0.jar :
https://mvnrepository.com/artifact/commons-io/commons-io/2.8.0
txw2-2.3.3.jar :
https://mvnrepository.com/artifact/org.glassfish.jaxb/txw2/2.3.3
xmlschema-core-2.2.5.jar :
https://mvnrepository.com/artifact/org.apache.ws.xmlschema/xmlschema-
core/2.2.5
Read xml jar : https://mvnrepository.com/artifact/com.databricks/spark-
xml_2.11/0.11.0
https://www.kaggle.com/datasets/shivamb/bank-customer-segmentation
5. case when: Acts like a case statement in sql , if then else in programming
language
59
Working with Fixed Width File
The files required for this are uploaded in the Resources section in the Video
Lecture. Download the files and follow the instructions as per the video.
Below is the complete code for Fixed Width File scenario:
package sparkPractise
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
object fixedWidth {
def dml(column1:Array[Row]):List[String]=
{
var final_dml:String=""
var last_dml:List[String]=Nil
for(i<-column1)
{
val data=i.mkString(",")
val columnName=data.split(",")(0)
val pos=data.split(",")(1)
val len=data.split(",")(2)
final_dml=s"substring(fixed_data,$pos,$len) as $columnName"
last_dml=last_dml :+ final_dml
//println(last_dml)
60
}
last_dml
}
def main(args:Array[String]):Unit={
val conf=new SparkConf().setAppName("para").setMaster("local[*]")
val sc=new SparkContext(conf)
sc.setLogLevel("Error")
val spark=SparkSession.builder().getOrCreate()
import spark.implicits._
val schemaFile=spark.read.format("json").load("file:///C:/data/input/dml.json")
val col_df=schemaFile.collect()
val dd=dml(col_df)
val inputFile=spark.read.format("csv").option("header", "false")
.load("file:///C:/data/input/input.txt")
.withColumnRenamed("_c0", "fixed_data")
println("**************Extracting the data********")
val df=inputFile.selectExpr(dd:_*)
df.show()
}
}
61
Parameterize using Config File
For parameterizing using the config file, we need to download a jar. You can download
this jar from the below link:
https://search.maven.org/artifact/com.typesafe/config/1.3.4/jar
The same can be downloaded from the resources in the video Lecture above.
Follow the instructions for adding the jar and developing the code as well as deploying
it.
The config files are added to the Resources in the above Lecture.
Here is the code for configFile1.scala
package sparkPractise
import com.typesafe.config.{Config,ConfigFactory}
object configFile1 {
def main(arg:Array[String]):Unit=
{
ConfigFactory.invalidateCaches()
println("**************Application.conf**********")
val config=ConfigFactory.load("Application.conf").getConfig("MyProject")
println(config.getString("spark.app-name"))
println(config.getString("spark.master"))
println(config.getString("mysql.username"))
println(config.getString("mysql.password"))
println("**************Application.properties**********")
val config1=ConfigFactory.load("Application.properties")
println(config1.getString("dev.input.base.dir"))
println(config1.getString("dev.output.base.dir"))
62
println(config1.getString("prod.input.base.dir"))
println(config1.getString("prod.output.base.dir"))
println(config1.getString("input"))
println(config1.getString("output"))
}
}
//dev
if(arg(0)=="dev")
{
63
val read_csv_df=spark.read.format("csv").option("header","true")
.load(config.getString("dev.input.base.dir"))
read_csv_df.write.format("csv").option("header","true")
.mode("overwrite")
.save(config.getString("dev.output.base.dir"))
}
//prod
if(arg(0)=="prod")
{
val read_csv_df=spark.read.format("csv").option("header","true")
.load(config.getString("prod.input.base.dir"))
read_csv_df.write.format("csv").option("header","true")
.mode("overwrite")
.save(config.getString("prod.output.base.dir"))
}
}
}
def main(args:Array[String]):Unit={
val spark=SparkSession.builder().getOrCreate()
val input_data=Source.fromURL("https://randomuser.me/api/0.8/?results=10").mkString
val rdd=sc.parallelize(List(input_data))
val df=spark.read.json(rdd)
println("*************Raw data************")
df.printSchema()
df.show(false)
println("*************Flatten Array************")
val flat_array_df=df.withColumn("results", explode(col("results")))
flat_array_df.printSchema()
flat_array_df.show(false)
65
println("*************Flatten Struct************")
val flat_df=flat_array_df.select(
"nationality", "results.user.cell","results.user.dob","results.user.gender",
"results.user.location.city","results.user.location.state","results.user.name.first",
"seed","version"
)
flat_df.printSchema()
flat_df.show(false)
}
}
def main(args:Array[String]):Unit=
{
val conf=new SparkConf().setAppName("weburl").setMaster("local[*]")
val sc=new SparkContext(conf)
sc.setLogLevel("Error")
val spark=SparkSession.builder().getOrCreate()
//convert rdd to df
67
val df=spark.read.json(rdd)
//flatten array
println("****************Flatten Array***********************")
val web_api_exploded_df=df.withColumn("results", explode(col("results")))
web_api_exploded_df.printSchema()
web_api_exploded_df.show()
println("****************web_api_exploded_df.schema***********************")
// flattenStructSchema(web_api_exploded_df.schema)
// println(web_api_exploded_df.schema.fields)
// web_api_exploded_df.schema.fields.foreach(f=>println(f.name+","+f.dataType))
val
web_api_flat=web_api_exploded_df.select(flattenStructSchema(web_api_exploded_df.
schema):_*)
web_api_flat.printSchema()
web_api_flat.show(false)
}
}
68
Working with HBase
To work with Hbase, open cloudera and fire the below commands in order:
package sparkPractise
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
69
import org.apache.spark.sql.execution.datasources.hbase._
object spark_Hbase_Integration {
def main(args:Array[String]):Unit=
{
val conf=new SparkConf().setAppName("com").setMaster("local[*]")
val sc=new SparkContext(conf)
sc.setLogLevel("Error")
val spark=SparkSession.builder().getOrCreate()
import spark.implicits._
def catalog_1=s"""{
"table":{"namespace":"hrdata","name":"employee"},
"rowkey":"rowkey",
"columns":{
"rowid":{"cf":"rowkey","col":"rowkey","type":"string"},
"id":{"cf":"cf_emp","col":"eid","type":"string"},
"name":{"cf":"cf_emp","col":"ename","type":"string"}
}
}""".stripMargin
val df=spark.read.options(Map(HBaseTableCatalog.tableCatalog->catalog_1))
.format("org.apache.spark.sql.execution.datasources.hbase").load()
df.show(false)
70
}
}
package sparkPractise
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.execution.datasources.hbase._
object hbase_write {
def main(args:Array[String]):Unit=
{
val conf=new SparkConf().setAppName("com").setMaster("local[*]")
val sc=new SparkContext(conf)
sc.setLogLevel("Error")
val spark=SparkSession.builder().getOrCreate()
import spark.implicits._
}""".stripMargin
val df=sel_df.write.options(
Map(HBaseTableCatalog.tableCatalog-
>catalog_write,HBaseTableCatalog.newTable->"4")
).format("org.apache.spark.sql.execution.datasources.hbase").save()
}
}
72
Cassandra setup and working with Cassandra
We need to install two softwares:
1. Datastax
2. Cassandra
Once the softwares are downloaded, install them as per the instructions from
the Video Lecture.
73
Cassandra Spark Integration
To integrate Spark with Cassandra, we need to download the below two jars:
https://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-
connector_2.11/2.3.1
https://mvnrepository.com/artifact/com.twitter/jsr166e/1.1.0
These jars are also uploaded to the resources section. So you can download
from here as well. Once downloaded, follow the instructions from the video
lecture.
package sparkPractise
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.explode
object cassandra_write {
def main(args:Array[String]):Unit=
{
val spark=SparkSession.builder().getOrCreate()
import spark.implicits._
val df=spark.read.format("csv").option("header",
"true").option("delimiter", "|").load("file:///C:/data/India_1.txt")
//df.show(false)
df.write.format("org.apache.spark.sql.cassandra")
74
.option("spark.cassandra.connection.host","localhost")
.option("spark.cassandra.connection.port","9042")
.option("keyspace","practise")
.option("table","country")
.mode("append")
.save()
}
}
package sparkPractise
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.explode
object cassandra_read {
def main(args:Array[String]):Unit=
{
val spark=SparkSession.builder().getOrCreate()
import spark.implicits._
val df=spark.read.format("org.apache.spark.sql.cassandra")
.options(Map("table"->"country","keyspace"->"practise"))
.load()
df.show()
}
75
Apache NIFI Installation
Download Apache NIFI from the below URL:
https://archive.apache.org/dist/nifi/1.6.0/nifi-1.6.0-bin.zip
To check whether nifi is running or not, just open cmd and ping jps. You
should see nifi running.
76
kafka installation and topic creation
For running kafka services, we need to install Kafka and Zookeeper.
For downloading kafka, go to the kafka download page. Below is the kafka
download page:
https://kafka.apache.org/downloads
Follow the instructions from the video lecture to install and run kafka.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming._
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
object spark_Kafka_Integration {
77
def main(args:Array[String]):Unit={
sc.setLogLevel("Error")
val spark=SparkSession.builder().getOrCreate()
import spark.implicits._
//kafka params
val kparams = Map[String, Object]("bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "new_consumer",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (true: java.lang.Boolean))
val topics=Array("new_topic")
stream.foreachRDD(x=>
if(!x.isEmpty())
{
val df=x.toDF().show()
}
)
78
//start streaming
ssc.start()
ssc.awaitTermination()
}
}
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
object sparkFileStreamingStructured {
def main(args:Array[String]):Unit={
79
sc.setLogLevel("Error")
val spark=SparkSession.builder().getOrCreate()
import spark.implicits._
val dml=StructType(Array(
StructField("id",StringType,true),
StructField("name",StringType,true)
));
val df=spark.readStream.format("csv").schema(dml)
.load("file:///C:/data/streaming/src_data")
df.writeStream.format("console")
.option("checkpointLocation", "file:///C:/data/streaming/check_point")
.start().awaitTermination()
}
}
80
Spark Kafka Integration
package sparkPractise
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
object sparkwithKafka {
def main(args:Array[String]):Unit={
val spark=SparkSession.builder().getOrCreate()
import spark.implicits._
val readkafka=spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "read_topic")
.load().withColumn("value", expr("cast(value as string)"))
.selectExpr("concat(value,',Hello') as value")
//write to kafkatopic
readkafka.writeStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "write_topic").option("checkpointLocation",
"file:///C:/data/streaming/check_point")
.start().awaitTermination()
}
}
81