Questions For CCA175

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Questions for CCA175

Table of Contents
Flume
Problem Scenario 1 :
Problem Scenario 2 :

Spark
Problem Scenario 1 :
Problem Scenario 2:
Problem Scenario 3 :
Problem Scenario 4 :
Problem Scenario 5:
Problem Scenario 6 :
Problem Scenario 7 :
Problem Scenario 8 :
Problem Scenario 9:
Problem Scenario 10 :
Problem Scenario 11:
Problem Scenario 12 :
Problem Scenario 13 :
Problem Scenario 14 :
Problem Scenario 15 :
Problem Scenario 16 :
Problem Scenario 17 :

HDFS
Problem Scenario 1 :

Sqoop
Problem Scenario 1 :
Problem Scenario 2:
Problem Scenario 3 :
Problem Scenario 4:
Problem Scenario 5 :
Problem Scenario 6 :

Hive
Problem Scenario 1 :

Flume

Problem Scenario 1 :
 
You have been given below comma separated employee 
information. 
 
Data Set: 
name,salary,sex,age 
alok,100000,male,29 
jatin,105000,male,32 
yogesh,134000,male,39 
ragini,112000,female,35 
jyotsana,129000,female,39 
valmiki,123000,male,29 
 
Requirements: 
 
Use the netcat service on port 44444, and nc above data line by line. Please do the 
following activities. 
 
1. Create a flume conf file using fastest channel, which write data in hive warehouse 
directory, in a table called flumemaleemployee (Create hive table as well tor given data). 
2. While importing, make sure only male employee data is stored. 
 
 
 
 
Answer: 
 
Step 1 : Create hive table for flumemaleemployee. 
 
CREATE TABLE flumemaleemployee ( name string, salary int, sex string, age int ) ROW FORMAT 
DELIMITED FIELDS TERMINATED BY ',';  
 
step 2 : Create flume configuration file, with below configuration for source, sink and channel 
and save it in flume4.conf.  
 
#Define source , sink, channel and agent. 
 
agent1.sources = source1  
agent1.sinks = sink1  
agent1.channels = channel1  
 
# Describe/configure source1  
agent1.sources.source1.type = netcat  
agent1.sources.source1.bind = 127.0.0.1  
agent1.sources.source1.port = 44444  
 
#Define interceptors  
agent1.sources.source1.interceptors=i1  
agent1.sources.source1.interceptors.i1.type=regex_filter  
agent1.sources.source1.interceptors.i1.regex=female  
agent1.sources.source1.interceptors.i1.excludeEvents=true  
 
# Describe sink1  
agent1.sinks.sink1.channel = memory-channel  
agent1.sinks.sink1.type = hdfs  
agent1.sinks.sink1.hdfs.path = /user/hive/warehouse/flumemaleemployee  
 
hdfs-agent.sinks.hdfs-write.hdfs.writeFormat=Text  
agent1.sinks.sink1.hdfs.fileType = Data Stream  
 
# Now we need to define channel1 property.  
 
agent1.channels.channel1.type = memory  
agent1.channels.channel1.capacity = 1000  
agent1.channels.channel1.transactionCapacity = 100  
 
# Bind the source and sink to the channel  
 
agent1.sources.source1.channels = channel1  
agent1.sinks.sink1.channel = channel1  
 
step 3 : Run below command which will use this configuration file and append data in hdfs. Start 
flume service:  
 
flume-ng agent -conf /home/cloudera/flumeconf -conf-file 
/home/cloudera/flumeconf/flume4.conf --name agent1  
 
Step 4 : Open another terminal and use the netcat service, nc localhost 44444  
 
Step 5 : Enter data line by line.  
alok,100000,male,29  
jatin,105000,male,32  
yogesh,134000,male,39  
ragini,112000,female,35  
jyotsana,129000,female,39  
valmiki.123000.male.29  
 
Step 6 : Open hue and check the data is available in hive table or not.  
 
Step 7 : Stop flume service by pressing ctrl+c  
 
Step 8 : Calculate average salary on hive table using below query.  
 
You can use either hive command line tool or hue. select avg(salary) from flumeemployee; 
 

Problem Scenario 2 :
 
You need to implement near real time solutions for collecting 
information when submitted in file with below information. 
 
Data 
echo "IBM,100,20160104" >> /tmp/spooldir/bb/.bb.txt 
echo "IBM,103,20160105" >> /tmp/spooldir/bb/.bb.txt 
mv /tmp/spooldir/bb/.bb.txt /tmp/spooldir/bb/bb.txt 
After few mins 
echo "IBM,100.2,20160104" >> /tmp/spooldir/dr/.dr.txt 
echo "IBM,103.1,20160105" >> /tmp/spooldir/dr/.dr.txt 
mv /tmp/spooldir/dr/.dr.txt /tmp/spooldir/dr/dr.txt 
 
Requirements: 
You have been given below directory location (if not available than create it) /tmp/spooldir . 
You have a financial subscription for getting stock prices from BloomBerg as well as 
Reuters and using ftp you download every hour new files from their respective ftp site in 
directories /tmp/spooldir/bb and /tmp/spooldir/dr respectively. 
As soon as file committed in this directory that needs to be available in hdfs in 
/tmp/flume/finance location in a single directory. 
 
Write a flume configuration file named flume7.conf and use it to load data in hdfs with 
following additional properties . 
1. Spool /tmp/spooldir/bb and /tmp/spooldir/dr 
2. File prefix in hdfs should be events 
3. File suffix should be .log 
4. If file is not committed and in use than it should have _ as prefix. 
5. Data should be written as text to hdfs 
 
Answer: 
 
Step 1 : Create directory mkdir /tmp/spooldir/bb mkdir /tmp/spooldir/dr Step 2 : Create flume 
configuration file, with below configuration for agent1.sources = source1 source2 agent1 .sinks 
= sink1 agent1.channels = channel1 agent1 .sources.source1.channels = channel1 agentl 
.sources.source2.channels = channell agent1 .sinks.sinkl.channel = channell agent1 
.sources.source1.type = spooldir agent1 .sources.sourcel.spoolDir = /tmp/spooldir/bb agent1 
.sources.source2.type = spooldir agent1 .sources.source2.spoolDir = /tmp/spooldir/dr agent1 
.sinks.sink1.type = hdfs agent1 .sinks.sink1.hdfs.path = /tmp/flume/finance 
agent1-sinks.sink1.hdfs.filePrefix = events agent1.sinks.sink1.hdfs.fileSuffix = .log agent1 
.sinks.sink1.hdfs.inUsePrefix = _ agent1 .sinks.sink1.hdfs.fileType = Data Stream 
agent1.channels.channel1.type = file Step 4 : Run below command which will use this 
configuration file and append data in hdfs. Start flume service: flume-ng agent -conf 
/home/cloudera/flumeconf -conf-file /home/cloudera/fIumeconf/fIume7.conf --name agent1 
Step 5 : Open another terminal and create a file in /tmp/spooldir/ echo "IBM,100,20160104" 
/tmp/spooldir/bb/.bb.txt echo "IBM,103,20160105" /tmp/spooldir/bb/.bb.txt mv 
/tmp/spooldir/bb/.bb.txt /tmp/spooldir/bb/bb.txt After few mins echo "IBM,100.2,20160104" 
/tmp/spooldir/dr/.dr.txt echo "IBM,103.1,20160105" /tmp/spooldir/dr/.dr.txt mv 
/tmp/spooldir/dr/.dr.txt /tmp/spooldir/dr/dr.txt 
 

Problem Scenario 3 :
 
You have been given log generating service as below. 
Start_logs (It will generate continuous logs) 
Tail_logs (You can check , what logs are being generated) 
Stop_logs (It will stop the log service) 
Path where logs are generated using above service : /opt/gen_logs/logs/access.log 
Now write a flume configuration file named flume3.conf , using that configuration file dumps 
logs in HDFS file system in a directory called flumeflume3/%Y/%m/%d/%H/%M 
Means every minute new directory should be created). Please us the interceptors to 
provide timestamp information, if message header does not have header info. 
And also note that you have to preserve existing timestamp, if message contains it. Flume 
channel should have following property as well. After every 100 message it should be 
committed, use non-durable/faster channel and it should be able to hold maximum 1000 
events. 
 
Solution : Step 1 : Create flume configuration file, with below configuration for source, sink and 
channel. #Define source , sink , channel and agent, agent1 .sources = source1 agent1 .sinks = 
sink1 agent1.channels = channel1 # Describe/configure source1 agent1 .sources.source1.type 
= exec agentl.sources.source1.command = tail -F /opt/gen logs/logs/access.log #Define 
interceptors agent1 .sources.source1.interceptors=i1 agent1 
.sources.source1.interceptors.i1.type=timestamp agent1 
.sources.source1.interceptors.i1.preserveExisting=true ## Describe sink1 agent1 
.sinks.sink1.channel = memory-channel agent1 .sinks.sink1.type = hdfs agent1 
.sinks.sink1.hdfs.path = flume3/%Y/%m/%d/%H/%M agent1 .sinks.sjnkl.hdfs.fileType = Data 
Stream # Now we need to define channel1 property. agent1.channels.channel1.type = memory 
agent1.channels.channel1.capacity = 1000 agent1.channels.channel1.transactionCapacity = 
100 # Bind the source and sink to the channel Agent1.sources.source1.channels = channel1 
agent1.sinks.sink1.channel = channel1 Step 2 : Run below command which will use this 
configuration file and append data in hdfs. Start log service using : start_logs Start flume 
service: flume-ng agent -conf /home/cloudera/flumeconf -conf-file 
/home/cloudera/flumeconf/flume3.conf -DfIume.root.logger=DEBUG,INFO,console name 
agent1 Wait for few mins and than stop log service. stop logs 

Spark

Problem Scenario 1 :
 
You have been given below code snippet. 
val a = sc.parallelize(1 to 9, 3) operationl 
Write a correct code snippet for operationl which will produce desired output, shown below. 
Array[(String, Seq[lnt])] = Array((even,ArrayBuffer(2, 4, G, 8)), (odd,ArrayBuffer(1, 3, 5, 7, 
9))) 
 
 
Solution : a.groupBy(x => {if (x % 2 == 0) "even" else "odd" }).collect 
 
Problem Scenario 2:
 
You have given a file named spark6/user.csv. 
Data is given below: 
user.csv 
id,topic,hits 
Rahul,scala,120 
Nikita,spark,80 
Mithun,spark,1 
myself,cca175,180 
Now write a Spark code in scala which will remove the header part and create RDD of 
values as below, for all rows. And also if id is myself" than filter out row. 
Map(id -> om, topic -> scala, hits -> 120) 
 
Solution : Step 1 : Create file in hdfs (We will do using Hue). However, you can first create in 
local filesystem and then upload it to hdfs. Step 2 : Load user.csv file from hdfs and create 
PairRDDs val csv = sc.textFile("spark6/user.csv") Step 3 : split and clean data val 
headerAndRows = csv.map(line => line.split(",").map(_.trim)) Step 4 : Get header row val header 
= headerAndRows.first Step 5 : Filter out header (We need to check if the first val matches the 
first header name) val data = headerAndRows.filter(_(0) != header(O)) Step 6 : Splits to map 
(header/value pairs) val maps = data.map(splits => header.zip(splits).toMap) step 7: Filter out 
the user "myself val result = maps.filter(map => mapf'id") != "myself") Step 8 : Save the output as 
a Text file. result.saveAsTextFile("spark6/result.txt") 

Problem Scenario 3 :
 
You have been given data in json format as below. 
{"first_name":"Ankit", "last_name":"Jain"} 
{"first_name":"Amir", "last_name":"Khan"} 
{"first_name":"Rajesh", "last_name":"Khanna"} 
{"first_name":"Priynka", "last_name":"Chopra"} 
{"first_name":"Kareena", "last_name":"Kapoor"} 
{"first_name":"Lokesh", "last_name":"Yadav"} 
Do the following activity 
1. create employee.json tile locally. 
2. Load this tile on hdfs 
3. Register this data as a temp table in Spark using Python. 
4. Write select query and print this data. 
5. Now save back this selected data in json format. 
 
Solution : Step 1 : create employee.json tile locally. vi employee.json (press insert) past the 
content. Step 2 : Upload this tile to hdfs, default location hadoop fs -put employee.json val 
employee = sqlContext.read.json("/user/cloudera/employee.json") 
employee.write.parquet("employee. parquet") val parq_data = 
sqlContext.read.parquet("employee.parquet") parq_data.registerTempTable("employee") val 
allemployee = sqlContext.sql("SELeCT' FROM employee") all_employee.show() import 
org.apache.spark.sql.SaveMode prdDF.write..format("orc").saveAsTable("product ore table"} 
//Change the codec. sqlContext.setConf("spark.sql.parquet.compression.codec","snappy") 
employee.write.mode(SaveMode.Overwrite).parquet("employee.parquet") 
 

Problem Scenario 4 :
 
You have given a file as below. 
spark75/f ile1.txt 
File contain some text. As given Below 
spark75/file1.txt 
Apache Hadoop is an open-source software framework written in Java for distributed 
storage and distributed processing of very large data sets on computer clusters built from 
commodity hardware. All the modules in Hadoop are designed with a fundamental 
assumption that hardware failures are common and should be automatically handled by the 
framework 
The core of Apache Hadoop consists of a storage part known as Hadoop Distributed File 
System (HDFS) and a processing part called MapReduce. Hadoop splits files into large 
blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers 
packaged code for nodes to process in parallel based on the data that needs to be 
processed. 
his approach takes advantage of data locality nodes manipulating the data they have 
access to to allow the dataset to be processed faster and more efficiently than it would be 
in a more conventional supercomputer architecture that relies on a parallel file system 
where computation and data are distributed via high-speed networking 
For a slightly more complicated task, lets look into splitting up sentences from our 
documents into word bigrams. A bigram is pair of successive tokens in some sequence. 
We will look at building bigrams from the sequences of words in each sentence, and then 
try to find the most frequently occuring ones. 
The first problem is that values in each partition of our initial RDD describe lines from the 
file rather than sentences. Sentences may be split over multiple lines. The glom() RDD 
method is used to create a single entry for each document containing the list of all lines, we 
can then join the lines up, then resplit them into sentences using "." as the separator, using 
flatMap so that every object in our RDD is now a sentence. 
A bigram is pair of successive tokens in some sequence. Please build bigrams from the 
sequences of 
 
 
Step 1 : Create all three tiles in hdfs (We will do using Hue}. However, you can first create in 
local filesystem and then upload it to hdfs. Step 2 : The first problem is that values in each 
partition of our initial RDD describe lines from the file rather than sentences. Sentences may be 
split over multiple lines. The glom() RDD method is used to create a single entry for each 
document containing the list of all lines, we can then join the lines up, then resplit them into 
sentences using "." as the separator, using flatMap so that every object in our RDD is now a 
sentence. sentences = sc.textFile("spark75/file1.txt") \ .glom() \ map(lambda x: " ".join(x)) \ 
.flatMap(lambda x: x.spllt(".")) Step 3 : Now we have isolated each sentence we can split it into a 
list of words and extract the word bigrams from it. Our new RDD contains tuples containing the 
word bigram (itself a tuple containing the first and second word) as the first value and the 
number 1 as the second value. bigrams = sentences.map(lambda x:x.split()) \ .flatMap(lambda 
x: [((x[i],x[i+1]),1)for i in range(0,len(x)-1)]) Step 4 : Finally we can apply the same reduceByKey 
and sort steps that we used in the wordcount example, to count up the bigrams and sort them in 
order of descending frequency. In reduceByKey the key is not an individual word but a bigram. 
freq_bigrams = bigrams.reduceByKey(lambda x,y:x+y)\ map(lambda x:(x[1],x[0])) \ 
sortByKey(False) freq_bigrams.take(10) 
 

Problem Scenario 5:
 
You have been given below code snippet. 
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3) 
val b = a.keyBy(_.length) 
val c = sc.parallelize(Ust("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3) 
val d = c.keyBy(_.length) 
operation1 
Write a correct code snippet for operationl which will produce desired output, shown below. 
Array[(lnt, (Option[String], String))] = Array((6,(Some(salmon),salmon)), 
(6,(Some(salmon),rabbit}}, (6,(Some(salmon),turkey)), (6,(Some(salmon),salmon)), 
(6,(Some(salmon),rabbit)), (6,(Some(salmon),turkey)), (3,(Some(dog),dog)), 
(3,(Some(dog),cat)), (3,(Some(dog),gnu)), (3,(Some(dog),bee)), (3,(Some(rat), 
(3,(Some(rat),cat)), (3,(Some(rat),gnu)), (3,(Some(rat),bee)), (4,(None,wo!f)), 
(4,(None,bear))) 
 
 
solution : b.rightOuterJqin(d).collect rightOuterJoin [Pair] : Performs an right outer join using 
two key-value RDDs. Please note that the keys must be generally comparable to make this work 
correctly. 
 
Problem Scenario 6 :
 
You have been given below code snippet. 
val a = sc.parallelize(l to 100. 3) 
operation1 
Write a correct code snippet for operationl which will produce desired output, shown below. 
Array [Array [I nt]] = Array(Array(1, 2, 3,4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19, 20, 
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33), 
Array(34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 
56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66), 
Array(67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 
89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)) 
 
Solution : a.glom.collect glom Assembles an array that contains all elements of the partition and 
embeds it in an RDD. Each returned array contains the contents of one panition 
 

Problem Scenario 7 :
 
You have given following two files 
1. Content.txt: Contain a huge text file containing space separated words. 
2. Remove.txt: Ignore/filter all the words given in this file (Comma Separated). 
Write a Spark program which reads the Content.txt file and load as an RDD, remove all the 
words from a broadcast variables (which is loaded as an RDD of words from Remove.txt). 
And count the occurrence of the each word and save it as a text file in HDFS. 
Content.txt 
Hello this is ABCTech.com 
This is TechABY.com 
Apache Spark Training 
This is Spark Learning Session 
Spark is faster than MapReduce 
Remove.txt 
Hello, is, this, the 
 
Step 1 : Create all three files in hdfs in directory called spark2 (We will do using Hue). However, 
you can first create in local filesystem and then upload it to hdfs Step 2 : Load the Content.txt 
file val content = sc.textFile("spark2/Content.txt") //Load the text file Step 3 : Load the 
Remove.txt file val remove = sc.textFile("spark2/Remove.txt") //Load the text file Step 4 : Create 
an RDD from remove, However, there is a possibility each word could have trailing spaces, 
remove those whitespaces as well. We have used two functions here flatMap, map and trim. val 
removeRDD= remove.flatMap(x=> x.splitf',") ).map(word=>word.trim)//Create an array of words 
Step 5 : Broadcast the variable, which you want to ignore val bRemove = 
sc.broadcast(removeRDD.collect().toList) // It should be array of Strings Step 6 : Split the 
content RDD, so we can have Array of String. val words = content.flatMap(line => line.split(" ")) 
Step 7 : Filter the RDD, so it can have only content which are not present in "Broadcast Variable". 
val filtered = words.filter{case (word) => !bRemove.value.contains(word)} Step 8 : Create a 
PairRDD, so we can have (word,1) tuple or PairRDD. val pairRDD = filtered.map(word => 
(word,1)) Step 9 : Nowdo the word count on PairRDD. val wordCount = pairRDD.reduceByKey(_ + 
_) Step 10 : Save the output as a Text file. wordCount.saveAsTextFile("spark2/result.txt") 
 
 

Problem Scenario 8 :
 
Write down a Spark script using Python, 
In which it read a file "Content.txt" (On hdfs) with following content. 
After that split each row as (key, value), where key is first word in line and entire line as 
value. 
Filter out the empty lines. 
And save this key value in "problem86" as Sequence file(On hdfs) 
Part 2 : Save as sequence file , where key as null and entire line as value. Read back the 
stored sequence files. 
Content.txt 
Hello this is ABCTECH.com 
This is XYZTECH.com 
Apache Spark Training 
This is Spark Learning Session 
Spark is faster than MapReduce 
 
 
Solution : Step 1 : # Import SparkContext and SparkConf from pyspark import SparkContext, 
SparkConf Step 2: #load data from hdfs contentRDD = sc.textFile(MContent.txt") Step 3: #filter 
out non-empty lines nonemptyjines = contentRDD.filter(lambda x: len(x) > 0) Step 4: #Split line 
based on space (Remember : It is mandatory to convert is in tuple} words = 
nonempty_lines.map(lambda x: tuple(x.split('', 1))) words.saveAsSequenceFile("problem86") 
Step 5: Check contents in directory problem86 hdfs dfs -cat problem86/part* Step 6 : Create 
key, value pair (where key is null) nonempty_lines.map(lambda line: (None, 
Mne}).saveAsSequenceFile("problem86_1") Step 7 : Reading back the sequence file data using 
spark. seqRDD = sc.sequenceFile("problem86_1") Step 8 : Print the content to validate the same. 
for line in seqRDD.collect(): print(line) 
 
Problem Scenario 9:
 
You have been given 2 files , with the content as given Below 
(spark12/technology.txt) 
(spark12/salary.txt) 
(spark12/technology.txt) 
first,last,technology 
Amit,Jain,java 
Lokesh,kumar,unix 
Mithun,kale,spark 
Rajni,vekat,hadoop 
Rahul,Yadav,scala 
(spark12/salary.txt) 
first,last,salary 
Amit,Jain,100000 
Lokesh,kumar,95000 
Mithun,kale,150000 
Rajni,vekat,154000 
Rahul,Yadav,120000 
Write a Spark program, which will join the data based on first and last name and save the 
joined results in following format, first Last.technology.salary 
 
 
Solution : Step 1 : Create 2 files first using Hue in hdfs. Step 2 : Load all file as an RDD val 
technology = sc.textFile(Msparkl2/technology.txt").map(e => e.splitf',")) val salary = 
sc.textFile("spark12/salary.txt").map(e => e.split(".")) Step 3 : Now create Key.value pair of data 
and join them. val joined = 
technology.map(e=>((e(0),e(1)),e(2))).join(salary.map(e=>((e(0),e(1)),e(2)))) Step 4 : Save the 
results in a text file as below. joined.repartition(1).saveAsTextFile("spark12/multiColumn 
Joined.txt") 
 

Problem Scenario 10 :
 
You have been given below code snippet. 
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle")) 
val b = a.map(x => (x.length, x)) 
operation1 
Write a correct code snippet for operationl which will produce desired output, shown below. 
Array[(lnt, String)] = Array((4,lion), (7,panther), (3,dogcat), (5,tigereagle)) 
 
olution : b.foidByKey("")(_ + J.collect foldByKey [Pair] Very similar to fold, but performs the 
folding separately for each key of the RDD. This function is only available if the RDD consists of 
two-component tuples Listing Variants def foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, 
V}] def foldByKey(zeroValue: V, numPartitions: lnt)(func: (V, V) => V): RDD[(K, V)] def 
foldByKey(zeroValue: V, partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V}] 
 

Problem Scenario 11:


 
You have been given below code snippet. 
val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1)) 
Operation_xyz 
Write a correct code snippet for Operation_xyz which will produce below output. 
scalaxollection.Map[lnt,Long] = Map(5 -> 1, 8 -> 1, 3 -> 1, 6 -> 1, 1 -> S, 2 -> 3, 4 -> 2, 7 -> 
1) 
 
 
Solution : b.countByValue countByValue Returns a map that contains all unique values of the 
RDD and their respective occurrence counts. (Warning: This operation will finally aggregate the 
information in a single reducer.) Listing Variants def countByValue(): Map[T, Long] 
 

Problem Scenario 12 :
 
You have been given below code snippet. 
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2) val b = 
a.keyBy(_.length) 
operation1 
Write a correct code snippet for operationl which will produce desired output, shown below. 
Array[(lnt, Seq[String])] = Array((4,ArrayBuffer(lion)), (6,ArrayBuffer(spider)), 
(3,ArrayBuffer(dog, cat)), (5,ArrayBuffer(tiger, eagle}}} 
 
Solution : b.groupByKey.collect groupByKey [Pair] Very similar to groupBy, but instead of 
supplying a function, the key-component of each pair will automatically be presented to the 
partitioner. Listing Variants def groupByKeyQ: RDD[(K, lterable[V]}] def 
groupByKey(numPartittons: Int): RDD[(K, lterable[V] )] def groupByKey(partitioner: Partitioner): 
RDD[(K, lterable[V])] 
 
Problem Scenario 13 :
 
You have been given below code snippet. 
val a = sc.parallelize(1 to 10, 3) 
operation1 
b.collect 
Output 1 
Array[lnt] = Array(2, 4, 6, 8,10) 
operation2 
Output 2 
Array[lnt] = Array(1,2, 3) 
Write a correct code snippet for operation1 and operation2 which will produce desired 
output, shown above. 
 
Solution : valb = a.filter(_%2==0) a.filter(_ < 4).collect filter Evaluates a boolean function for each 
data item of the RDD and puts the items for which the function returned true into the resulting 
RDD. When you provide a filter function, it must be able to handle all data items contained in the 
RDD. Scala provides so-called partial functions to deal with mixed data types (Tip: Partial 
functions to deal are very useful if you have some data which may be bad and you do not want 
to handle but for the good data (matching data) you want to apply some Kind of map function. 
The following article is good. It teaches you about partial functions in a very nice way and 
explains why case has to be used for partial functions:article) Examples for mixed data without 
partial functions val b = sc.parallelize(1 to 8) b.filter(_ < 4)xollect res15: Arrayjlnt] = Array(1, 2, 3) 
val a = sc.parallelize(List("cat'\ "horse", 4.0, 3.5, 2, "dog")) a.filter(_<4).collect error: value < is not 
a member of Any 
 

Problem Scenario 14 :
 
You have been given below code snippet. 
val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2) 
val b = sc.parallelize(1 to a.count.tolnt, 2) 
val c = a.zip(b) 
operation1 
Write a correct code snippet for operationl which will produce desired output, shown below. 
Array[(String, Int)] = Array((owl,3), (gnu,4), (dog,1), (cat,2>, (ant,5)) 
 
Solution : c.sortByKey(false).collect sortByKey [Ordered] : This function sorts the input RDD's 
data and stores it in a new RDD. "The output RDD is a shuffled RDD because it stores data that 
is output by a reducer which has been shuffled. The implementation of this function is actually 
very clever. First, it uses a range partitioner to partition the data in ranges within the shuffled 
RDD. Then it sorts these ranges individually with mapPartitions using standard sort 
mechanisms. 
 

Problem Scenario 15 :
 
You have been given below code snippet. 
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2) 
val b = a.map(x => (x.length, x)) 
operation1 
Write a correct code snippet for operationl which will produce desired output, shown below. 
Array[(lnt, String}] = Array((4,lion), (3,dogcat), (7,panther), (5,tigereagle)) 
 
solution : b.reduceByKey(_ + _).collect reduceByKey JPair] : This function provides the 
well-known reduce functionality in Spark. Please note that any function f you provide, should be 
commutative in order to generate reproducible results. 
 

Problem Scenario 16 :
 
You have been given below patient data in csv format, 
patientID,name,dateOfBirth,lastVisitDate 
1001,Ah Teck,1991-12-31,2012-01-20 
1002,Kumar,2011-10-29,2012-09-20 
1003,Ali,2011-01-30,2012-10-21 
Accomplish following activities. 
1. Find all the patients whose lastVisitDate between current time and '2012-09-15' 
2. Find all the patients who born in 2011 
3. Find all the patients age 
4. List patients whose last visited more than 60 days ago 
5. Select patients 18 years old or younger 
 
 
Solution : Step 1: hdfs dfs -mkdir sparksql3 hdfs dfs -put patients.csv sparksql3/ Step 2 : Now in 
spark shell // SQLContext entry point for working with structured data val sqlContext = 
neworg.apache.spark.sql.SQLContext(sc) // this is used to implicitly convert an RDD to a 
DataFrame. import sqlContext.impIicits._ // Import Spark SQL data types and Row. import 
org.apache.spark.sql._ // load the data into a new RDD val patients = 
sc.textFilef'sparksqIS/patients.csv") // Return the first element in this RDD patients.first() 
//define the schema using a case class case class Patient(patientid: Integer, name: String, 
dateOfBirth:String , lastVisitDate: String) // create an RDD of Product objects val patRDD = 
patients.map(_.split(M,M)).map(p => Patient(p(0).tolnt,p(1),p(2),p(3))) patRDD.first() 
patRDD.count(} // change RDD of Product objects to a DataFrame val patDF = patRDD.toDF() // 
register the DataFrame as a temp table patDF.registerTempTable("patients"} // Select data from 
table val results = sqlContext.sql(......SELECT* FROM patients '.....) // display dataframe in a 
tabular format results.show() //Find all the patients whose lastVisitDate between current time 
and '2012-09-15' val results = sqlContext.sql(......SELECT * FROM patients WHERE 
TO_DATE(CAST(UNIX_TIMESTAMP(lastVisitDate, 'yyyy-MM-dd') AS TIMESTAMP)) BETWEEN 
'2012-09-15' AND current_timestamp() ORDER BY lastVisitDate......) results.showQ /.Find all the 
patients who born in 2011 val results = sqlContext.sql(......SELECT * FROM patients WHERE 
YEAR(TO_DATE(CAST(UNIXJTlMESTAMP(dateOfBirth, 'yyyy-MM-dd') AS TIMESTAMP))) = 2011 
......) results. show() //Find all the patients age val results = sqlContext.sql(......SELECT name, 
dateOfBirth, datediff(current_date(), TO_DATE(CAST(UNIX_TIMESTAMP(dateOfBirth, 
'yyyy-MM-dd') AS TlMESTAMP}}}/365 AS age FROM patients Mini > results.show() //List 
patients whose last visited more than 60 days ago -- List patients whose last visited more than 
60 days ago val results = sqlContext.sql(......SELECT name, lastVisitDate FROM patients WHERE 
datediff(current_date(), TO_DATE(CAST(UNIX_TIMESTAMP[lastVisitDate, 'yyyy-MM-dd') AS 
T1MESTAMP))) > 60......); results. showQ; -- Select patients 18 years old or younger SELECT' 
FROM patients WHERE TO_DATE(CAST(UNIXJTlMESTAMP(dateOfBirth, 'yyyy-MM-dd') AS 
TIMESTAMP}) > DATE_SUB(current_date(),INTERVAL 18 YEAR); val results = 
sqlContext.sql(......SELECT' FROM patients WHERE 
TO_DATE(CAST(UNIX_TIMESTAMP(dateOfBirth, 'yyyy-MM--dd') AS TIMESTAMP)) > 
DATE_SUB(current_date(), T8*365)......); results. showQ; val results = sqlContext.sql(......SELECT 
DATE_SUB(current_date(), 18*365) FROM patients......); results.show(); 
 
 

Problem Scenario 17 :
 
You have been given below Python code snippet, with intermediate 
output. 
We want to take a list of records about people and then we want to sum up their ages and 
count them. 
So for this example the type in the RDD will be a Dictionary in the format of {name: NAME, 
age:AGE, gender:GENDER}. 
The result type will be a tuple that looks like so (Sum of Ages, Count) 
people = [] 
people.append({'name':'Amit', 'age':45,'gender':'M'}) 
people.append({'name':'Ganga', 'age':43,'gender':'F'}) 
people.append({'name':'John', 'age':28,'gender':'M'}) 
people.append({'name':'Lolita', 'age':33,'gender':'F'}) 
people.append({'name':'Dont Know', 'age':18,'gender':'T'}) 
peopleRdd=sc.parallelize(people) //Create an RDD 
peopleRdd.aggregate((0,0), seqOp, combOp) //Output of above line : 167, 5) 
Now define two operation seqOp and combOp , such that 
seqOp : Sum the age of all people as well count them, in each partition. combOp : 
Combine results from all partitions. 
 
Solution : seqOp = (lambda x,y: (x[0] + y['age'],x[1] + 1)) combOp = (lambda x,y: (x[0] + y[0], x[1] + 
y[1])) 
 

Problem Scenario 18 :
 
You have given three files as below. 
spark3/sparkdir1/file1.txt 
spark3/sparkd ir2ffile2.txt 
spark3/sparkd ir3Zfile3.txt 
Each file contain some text. 
spark3/sparkdir1/file1.txt 
Apache Hadoop is an open-source software framework written in Java for distributed 
storage and distributed processing of very large data sets on computer clusters built from 
commodity hardware. All the modules in Hadoop are designed with a fundamental 
assumption that hardware failures are common and should be automatically handled by the 
framework 
spark3/sparkdir2/file2.txt 
The core of Apache Hadoop consists of a storage part known as Hadoop Distributed File 
System (HDFS) and a processing part called MapReduce. Hadoop splits files into large 
blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers 
packaged code for nodes to process in parallel based on the data that needs to be 
processed. 
spark3/sparkdir3/file3.txt 
his approach takes advantage of data locality nodes manipulating the data they have 
access to to allow the dataset to be processed faster and more efficiently than it would be 
in a more conventional supercomputer architecture that relies on a parallel file system 
where computation and data are distributed via high-speed networking 
Now write a Spark code in scala which will load all these three files from hdfs and do the 
word count by filtering following words. And result should be sorted by word count in 
reverse order. 
Filter words ("a","the","an", "as", "a","with","this","these","is","are","in", "for", 
"to","and","The","of") 
Also please make sure you load all three files as a Single RDD (All three files must be 
loaded using single API call). 
You have also been given following codec 
import org.apache.hadoop.io.compress.GzipCodec 
Please use above codec to compress file, while saving in hdfs. 
 
Solution : Step 1 : Create all three files in hdfs (We will do using Hue). However, you can first 
create in local filesystem and then upload it to hdfs. Step 2 : Load content from all files. val 
content = 
sc.textFile("spark3/sparkdir1/file1.txt,spark3/sparkdir2/file2.txt,spark3/sparkdir3/file3. txt") 
//Load the text file Step 3 : Now create split each line and create RDD of words. val flatContent = 
content.flatMap(word=>word.split(" ")) step 4 : Remove space after each word (trim it) val 
trimmedContent = f1atContent.map(word=>word.trim) Step 5 : Create an RDD from remove, all 
the words that needs to be removed. val removeRDD = sc.parallelize(List("a","theM,ManM, "as", 
"a","with","this","these","is","are'\"in'\ "for", "to","and","The","of")) Step 6 : Filter the RDD, so it can 
have only content which are not present in removeRDD. val filtered = 
trimmedContent.subtract(removeRDD} Step 7 : Create a PairRDD, so we can have (word,1) tuple 
or PairRDD. val pairRDD = filtered.map(word => (word,1)) Step 8 : Now do the word count on 
PairRDD. val wordCount = pairRDD.reduceByKey(_ + _) Step 9 : Now swap PairRDD. val swapped 
= wordCount.map(item => item.swap) Step 10 : Now revers order the content. val sortedOutput 
= swapped.sortByKey(false) Step 11 : Save the output as a Text file. 
sortedOutput.saveAsTextFile("spark3/result") Step 12 : Save compressed output. import 
org.apache.hadoop.io.compress.GzipCodec 
sortedOutput.saveAsTextFile("spark3/compressedresult", classOf[GzipCodec]) 
 

Problem Scenario 19 :
 
You have been given MySQL DB with following details. 
user=retail_dba 
password=cloudera 
database=retail_db 
table=retail_db.orders 
table=retail_db.order_items 
jdbc URL = jdbc:mysql://quickstart:3306/retail_db 
Columns of products table : (product_id | product categoryid | product_name | 
product_description | product_prtce | product_image ) 
Please accomplish following activities. 
1. Copy "retaildb.products" table to hdfs in a directory p93_products 
2. Filter out all the empty prices 
3. Sort all the products based on price in both ascending as well as descending order. 
4. Sort all the products based on price as well as product_id in descending order. 
5. Use the below functions to do data ordering or ranking and fetch top 10 elements top() 
takeOrdered() sortByKey() 
 
Solution : Step 1 : Import Single table . sqoop import --connect 
jdbc:mysql://quickstart:3306/retail_db -username=retail_dba - password=cloudera 
-table=products -target-dir=p93_products -m 1 Note : Please check you dont have space 
between before or after '=' sign. Sqoop uses the MapReduce framework to copy data from 
RDBMS to hdfs Step 2 : Step 2 : Read the data from one of the partition, created using above 
command, hadoop fs -cat p93_products/part-m-00000 Step 3 : Load this directory as RDD using 
Spark and Python (Open pyspark terminal and do following). productsRDD = 
sc.textFile("p93_products") Step 4 : Filter empty prices, if exists #filter out empty prices lines 
nonemptyjines = productsRDD.filter(lambda x: len(x.split(",")[4]) > 0) Step 5 : Now sort data 
based on product_price in order. 
sortedPriceProducts=nonempty_lines.map(lambdaline:(float(line.split(",")[4]),line.split(",")[2] 
)).sortByKey() for line in sortedPriceProducts.collect(): print(line) Step 6 : Now sort data based 
on product_price in descending order. sortedPriceProducts=nonempty_lines.map(lambda line: 
(float(line.split(",")[4]),line.split(",")[2])).sortByKey(False) for line in sortedPriceProducts.collect(): 
print(line) Step 7 : Get highest price products name. 
sortedPriceProducts=nonemptyJines.map(lambda line : (float(line.split(",")[4]),line- 
split(,,,,,)[2]))-sortByKey(False).take(1) print(sortedPriceProducts) Step 8 : Now sort data based 
on product_price as well as product_id in descending order. #Dont forget to cast string #Tuple 
as key ((price,id),name) sortedPriceProducts=nonemptyJines.map(lambda line : ((float(line 
print(sortedPriceProducts) Step 9 : Now sort data based on product_price as well as product_id 
in descending order, using top() function. #Dont forget to cast string #Tuple as key 
((price,id),name) sortedPriceProducts=nonemptyJines.map(lambda line: ((float(line.s^^ 
print(sortedPriceProducts) Step 10 : Now sort data based on product_price as ascending and 
product_id in ascending order, using takeOrdered{) function. #Dont forget to cast string #Tuple 
as key ((price,id),name) sortedPriceProducts=nonemptyJines.map(lambda line: 
((float(line.split(","}[4]},int(line.split(","}[0]}},line.split(","}[2]}}.takeOrdered(10, lambda tuple : 
(tuple[0][0],tuple[0][1])) Step 11 : Now sort data based on product_price as descending and 
product_id in ascending order, using takeOrdered() function. #Dont forget to cast string #Tuple 
as key ((price,id},name) #Using minus(-) parameter can help you to make descending ordering , 
only for numeric value. sortedPrlceProducts=nonemptylines.map(lambda line: 
((float(line.split(","}[4]},int(line.split(","}[0]}},line.split(","}[2]}}.takeOrdered(10, lambda tuple : 
(-tuple[0][0],tuple[0][1]}} 
 

Problem Scenario 20 :
 
You have to run your Spark application on yarn with each executor 
Maximum heap size to be 512MB and Number of processor cores to allocate on each 
executor will be 1 and Your main application required three values as input arguments V1 
V2 V3. 
Please replace XXX, YYY, ZZZ 
./bin/spark-submit -class com.hadoopexam.MyTask --master yarn-cluster--num-executors 3 
--driver-memory 512m XXX YYY lib/hadoopexam.jarZZZ 
 
Solution XXX: -executor-memory 512m YYY: -executor-cores 1 ZZZ : V1 V2 V3 Notes : 
spark-submit on yarn options Option Description archives Comma-separated list of archives to 
be extracted into the working directory of each executor. The path must be globally visible 
inside your cluster; see Advanced Dependency Management. executor-cores Number of 
processor cores to allocate on each executor. Alternatively, you can use the 
spark.executor.cores property, executor-memory Maximum heap size to allocate to each 
executor. Alternatively, you can use the spark.executor.memory-property. num-executors Total 
number of YARN containers to allocate for this application. Alternatively, you can use the 
spark.executor.instances property. queue YARN queue to submit to. For more information, see 
Assigning Applications and Queries to Resource Pools. Default: default. 
 

Problem Scenario 21 :
 
You have been given below code snippet (calculating an average 
score}, with intermediate output. 
type ScoreCollector = (Int, Double) 
type PersonScores = (String, (Int, Double)) 
val initialScores = Array(("Fred", 88.0), ("Fred", 95.0), ("Fred", 91.0), ("Wilma", 93.0), 
("Wilma", 95.0), ("Wilma", 98.0)) 
val wilmaAndFredScores = sc.parallelize(initialScores).cache() 
val scores = wilmaAndFredScores.combineByKey(createScoreCombiner, scoreCombiner, 
scoreMerger) 
val averagingFunction = (personScore: PersonScores) => { val (name, (numberScores, 
totalScore)) = personScore (name, totalScore / numberScores) 
val averageScores = scores.collectAsMap(}.map(averagingFunction) 
Expected output: averageScores: scala.collection.Map[String,Double] = Map(Fred -> 
91.33333333333333, Wilma -> 95.33333333333333) 
Define all three required function , which are input for combineByKey method, e.g. 
(createScoreCombiner, scoreCombiner, scoreMerger). And help us producing required 
results. 
 
 
Solution : val createScoreCombiner = (score: Double) => (1, score) val scoreCombiner = 
(collector: ScoreCollector, score: Double) => { val (numberScores. totalScore) = collector 
(numberScores + 1, totalScore + score) } val scoreMerger= (collector-!: ScoreCollector, 
collector2: ScoreCollector) => { val (numScoresl. totalScorel) = collector! val (numScores2, 
tota!Score2) = collector (numScoresl + numScores2, totalScorel + totalScore2) } 
 
Problem Scenario 22 :
 
You have been given belwo list in scala (name,sex,cost) for each 
work done. 
List( ("Deeapak" , "male", 4000), ("Deepak" , "male", 2000), ("Deepika" , "female", 
2000),("Deepak" , "female", 2000), ("Deepak" , "male", 1000) , ("Neeta" , "female", 2000)) 
Now write a Spark program to load this list as an RDD and do the sum of cost for 
combination of name and sex (as key) 
 
Solution : Step 1 : Create an RDD out of this list val rdd = sc.parallelize(List( ("Deeapak" , "male", 
4000}, ("Deepak" , "male", 2000), ("Deepika" , "female", 2000),("Deepak" , "female", 2000), 
("Deepak" , "male", 1000} , ("Neeta" , "female", 2000}}} Step 2 : Convert this RDD in pair RDD val 
byKey = rdd.map({case (name,sex,cost) => (name,sex)->cost}) Step 3 : Now group by Key val 
byKeyGrouped = byKey.groupByKey Step 4 : Nowsum the cost for each group val result = 
byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)} Step 5 : Save the results 
result.repartition(1).saveAsTextFile("spark12/result.txt") 
 

Problem Scenario 23 :
 
You have been given below code snippet. 
val aul = sc.parallelize(List (("a" , Array(1,2)), ("b" , Array(1,2)))) 
val au2 = sc.parallelize(List (("a" , Array(3)), ("b" , Array(2)))) 
Apply the Spark method, which will generate below output. 
Array[(String, Array[lnt])] = Array((a,Array(1, 2)), (b,Array(1, 2)), (a(Array(3)), (b,Array(2))) 
 
Solution: au1.union(au2) 
 

Problem Scenario 24 :
 
You have been given below code snippet. 
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"}, 3} 
val b = a.keyBy(_.length) 
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","woif","bear","bee"), 3) 
val d = c.keyBy(_.length) 
operation1 
Write a correct code snippet for operationl which will produce desired output, shown below. 
Array[(lnt, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)), 
(6,(salmon,turkey)), (6,(salmon,salmon)), (6,(salmon,rabbit)), 
(6,(salmon,turkey)), (3,(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)), 
(3,(rat,cat)), (3,(rat,gnu)), (3,(rat,bee))) 
 
solution: b.join(d).collect join [Pair]: Performs an inner join using two key-value RDDs. Please 
note that the keys must be generally comparable to make this work. keyBy : Constructs 
two-component tuples (key-value pairs) by applying a function on each data item. The result of 
the function becomes the data item becomes the key and the original value of the newly created 
tuples. 
 
 

HDFS

Problem Scenario 1 :
 
There is a parent organization called "ABC Group Inc", which has two child companies 
named Tech Inc and MPTech. 
Both companies employee information is given in two separate text file as below. Please do 
the following activity for employee details. 
Tech Inc.txt 
1,Alok,Hyderabad 
2,Krish,Hongkong 
3,Jyoti,Mumbai 
4,Atul,Banglore 
5,Ishan,Gurgaon 
MPTech.txt 
6,John,Newyork 
7,alp2004,California 
8,tellme,Mumbai 
9,Gagan21,Pune 
10,Mukesh,Chennai 
1. Which command will you use to check all the available command line options on HDFS 
and How will you get the Help for individual command. 
2. Create a new Empty Directory named Employee using Command line. And also create 
an empty file named in it Techinc.txt 
3. Load both companies Employee data in Employee directory (How to override existing file 
in HDFS). 
4. Merge both the Employees data in a Single tile called MergedEmployee.txt, merged tiles 
should have new line character at the end of each file content. 
5. Upload merged file on HDFS and change the file permission on HDFS merged file, so 
that owner and group member can read and write, other user can read the file. 
6. Write a command to export the individual file as well as entire directory from HDFS to 
local file System. 
 
Solution : Step 1 : Check All Available command hdfs dfs Step 2 : Get help on Individual 
command hdfs dfs -help get Step 3 : Create a directory in HDFS using named Employee and 
create a Dummy file in it called e.g. Techinc.txt hdfs dfs -mkdir Employee Now create an emplty 
file in Employee directory using Hue. Step 4 : Create a directory on Local file System and then 
Create two files, with the given data in problems. Step 5 : Now we have an existing directory with 
content in it, now using HDFS command line , overrid this existing Employee directory. While 
copying these files from local file System to HDFS. cd /home/cloudera/Desktop/ hdfs dfs -put -f 
Employee Step 6 : Check All files in directory copied successfully hdfs dfs -Is Employee Step 7 : 
Now merge all the files in Employee directory, hdfs dfs -getmerge -nl Employee 
MergedEmployee.txt Step 8 : Check the content of the file. cat MergedEmployee.txt Step 9 : 
Copy merged file in Employeed directory from local file ssytem to HDFS. hdfs dfs - put 
MergedEmployee.txt Employee/ Step 10 : Check file copied or not. hdfs dfs -Is Employee Step 
11 : Change the permission of the merged file on HDFS hdfs dfs -chmpd 664 
Employee/MergedEmployee.txt Step 12 : Get the file from HDFS to local file system, hdfs dfs 
-get Employee Employee_hdfs 
 

Sqoop

Problem Scenario 1 :
 
You have been given following mysql database details as well as 
other info. 
user=retail_dba 
password=cloudera 
database=retail_db 
jdbc URL = jdbc:mysql://quickstart:3306/retail_db 
Please accomplish following. 
1. Import departments table in a directory. 
2. Again import departments table same directory (However, directory already exist hence 
it should not overrride and append the results) 
3. Also make sure your results fields are terminated by '|' and lines terminated by '\n\ 
 
Solutions : Step 1 : Clean the hdfs file system, if they exists clean out. hadoop fs -rm -R 
departments hadoop fs -rm -R categories hadoop fs -rm -R products hadoop fs -rm -R orders 
hadoop fs -rm -R order_items hadoop fs -rm -R customers Step 2 : Now import the department 
table as per requirement. sqoop import \ -connect jdbc:mysql://quickstart:330G/retaiI_db \ 
--username=retail_dba \ -password=cloudera \ -table departments \ -target-dir=departments \ 
-fields-terminated-by '|' \ -lines-terminated-by '\n' \ -ml Step 3 : Check imported data. hdfs dfs -Is 
departments hdfs dfs -cat departments/part-m-00000 Step 4 : Now again import data and needs 
to appended. sqoop import \ -connect jdbc:mysql://quickstart:3306/retail_db \ 
--username=retail_dba \ -password=cloudera \ -table departments \ -target-dir departments \ 
-append \ -tields-terminated-by '|' \ -lines-termtnated-by '\n' \ -ml Step 5 : Again Check the results 
hdfs dfs -Is departments hdfs dfs -cat departments/part-m-00001 

Problem Scenario 2:
 
You have been given MySQL DB with following details. 
user=retail_dba 
password=cloudera 
database=retail_db 
table=retail_db.categories 
jdbc URL = jdbc:mysql://quickstart:3306/retail_db 
Please accomplish following activities. 
1. Import data from categories table, where category=22 (Data should be stored in 
categories subset) 
2. Import data from categories table, where category>22 (Data should be stored in 
categories_subset_2) 
3. Import data from categories table, where category between 1 and 22 (Data should be 
stored in categories_subset_3) 
4. While importing catagories data change the delimiter to '|' (Data should be stored in 
categories_subset_S) 
5. Importing data from catagories table and restrict the import to category_name,category 
id columns only with delimiter as '|' 
6. Add null values in the table using below SQL statement ALTER TABLE categories 
modify category_department_id int(11); INSERT INTO categories values 
(eO.NULL.'TESTING'); 
7. Importing data from catagories table (In categories_subset_17 directory) using '|' 
delimiter and categoryjd between 1 and 61 and encode null values for both string and non 
string columns. 
8. Import entire schema retail_db in a directory categories_subset_all_tables 
 
Solution: Step 1: Import Single table (Subset data} Note: Here the ' is the same you find on - key 
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba - 
password=cloudera -table=categories ~warehouse-dir= categories_subset --where 
\'category_id\=22 --m 1 Step 2 : Check the output partition hdfs dfs -cat 
categoriessubset/categories/part-m-00000 Step 3 : Change the selection criteria (Subset data) 
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba - 
password=cloudera -table=categories ~warehouse-dir= categories_subset_2 --where 
\category_id\\>22 -m 1 Step 4 : Check the output partition hdfs dfs -cat 
categories_subset_2/categories/part-m-00000 Step 5 : Use between clause (Subset data) 
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba - 
password=cloudera -table=categories ~warehouse-dir=categories_subset_3 --where 
"\category_id\' between 1 and 22" --m 1 Step 6 : Check the output partition hdfs dfs -cat 
categories_subset_3/categories/part-m-00000 Step 7 : Changing the delimiter during import. 
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail dba - 
password=cloudera -table=categories -warehouse-dir=:categories_subset_6 --where 
"/categoryjd / between 1 and 22" -fields-terminated-by='|' -m 1 Step 8 : Check the.output partition 
hdfs dfs -cat categories_subset_6/categories/part-m-00000 Step 9 : Selecting subset columns 
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba - 
password=cloudera -table=categories --warehouse-dir=categories subset col -where "/category 
id/ between 1 and 22" -fields-terminated-by=T -columns=category name,category id --m 1 Step 
10 : Check the output partition hdfs dfs -cat categories_subset_col/categories/part-m-00000 
Step 11 : Inserting record with null values (Using mysql} ALTER TABLE categories modify 
category_department_id int(11); INSERT INTO categories values ^NULL/TESTING'); select" from 
categories; Step 12 : Encode non string null column sqoop import --connect 
jdbc:mysql://quickstart:3306/retail_db --username=retail dba - password=cloudera 
-table=categories --warehouse-dir=categortes_subset_17 -where "\"category_id\" between 1 and 
61" -fields-terminated-by=,|' --null-string-N' -null-non- string=,N' --m 1 Step 13 : View the content 
hdfs dfs -cat categories_subset_17/categories/part-m-00000 Step 14 : Import all the tables 
from a schema (This step will take little time) sqoop import-all-tables -connect 
jdbc:mysql://quickstart:3306/retail_db -- username=retail_dba -password=cloudera 
-warehouse-dir=categories_si Step 15 : View the contents hdfs dfs -Is 
categories_subset_all_tables Step 16 : Cleanup or back to originals. delete from categories 
where categoryid in (59,60); ALTER TABLE categories modify category_department_id int(11) 
NOTNULL; ALTER TABLE categories modify category_name varchar(45) NOT NULL; desc 
categories; 
 

Problem Scenario 3 :
 
You have been given following mysql database details as well as 
other info. 
user=retail_dba 
password=cloudera 
database=retail_db 
jdbc URL = jdbc:mysql://quickstart:3306/retail_db 
Now accomplish following activities. 
1. Import departments table from mysql to hdfs as textfile in departments_text directory. 
2. Import departments table from mysql to hdfs as sequncefile in departments_sequence 
directory. 
3. Import departments table from mysql to hdfs as avro file in departments avro directory. 
4. Import departments table from mysql to hdfs as parquet file in departments_parquet 
directory. 
 
Solution : Step 1 : Import departments table from mysql to hdfs as textfile sqoop import \ 
-connect jdbc:mysql://quickstart:3306/retail_db \ ~username=retail_dba \ -password=cloudera 
\ -table departments \ -as-textfile \ -target-dir=departments_text verify imported data hdfs dfs 
-cat departments_text/part" Step 2 : Import departments table from mysql to hdfs as 
sequncetlle sqoop import \ -connect jdbc:mysql://quickstart:330G/retaiI_db \ 
~username=retail_dba \ -password=cloudera \ --table departments \ -as-sequencetlle \ 
-~target-dir=departments sequence verify imported data hdfs dfs -cat 
departments_sequence/part* Step 3 : Import departments table from mysql to hdfs as 
sequncetlle sqoop import \ -connect jdbc:mysql://quickstart:330G/retaiI_db \ 
~username=retail_dba \ --password=cloudera \ --table departments \ --as-avrodatafile \ 
--target-dir=departments_avro verify imported data hdfs dfs -cat departments avro/part* Step 4 : 
Import departments table from mysql to hdfs as sequncetlle sqoop import \ -connect 
jdbc:mysql://quickstart:330G/retaiI_db \ ~username=retail_dba \ --password=cloudera \ -table 
departments \ -as-parquetfile \ -target-dir=departments_parquet verify imported data hdfs dfs 
-cat departmentsparquet/part* 

Problem Scenario 4:
 
You have been given MySQL DB with following details. 
user=retail_dba 
password=cloudera 
database=retail_db 
table=retail_db.categories 
jdbc URL = jdbc:mysql://quickstart:3306/retail_db 
Please accomplish following activities. 
1. Connect MySQL DB and check the content of the tables. 
2. Copy "retaildb.categories" table to hdfs, without specifying directory name. 
3. Copy "retaildb.categories" table to hdfs, in a directory name "categories_target". 
4. Copy "retaildb.categories" table to hdfs, in a warehouse directory name 
"categories_warehouse". 
 
Solution : Step 1 : Connecting to existing MySQL Database mysql --user=retail_dba -- 
password=cloudera retail_db Step 2 : Show all the available tables show tables; Step 3 : 
View/Count data from a table in MySQL select count(1} from categories; Step 4 : Check the 
currently available data in HDFS directory hdfs dfs -Is Step 5 : Import Single table (Without 
specifying directory). sqoop import --connect jdbc:mysql://quickstart:3306/retail_db 
-username=retail_dba - password=cloudera -table=categories Note : Please check you dont have 
space between before or after '=' sign. Sqoop uses the MapReduce framework to copy data 
from RDBMS to hdfs Step 6 : Read the data from one of the partition, created using above 
command, hdfs dfs - catxategories/part-m-00000 Step 7 : Specifying target directory in import 
command (We are using number of mappers =1, you can change accordingly) sqoop import 
-connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -password=cloudera 
~table=categories -target-dir=categortes_target --m 1 Step 8 : Check the content in one of the 
partition file. hdfs dfs -cat categories_target/part-m-00000 Step 9 : Specifying parent directory 
so that you can copy more than one table in a specified target directory. Command to specify 
warehouse directory. sqoop import -.-connect jdbc:mysql://quickstart:3306/retail_db 
--username=retail dba - password=cloudera -table=categories 
-warehouse-dir=categories_warehouse --m 1 
 

Problem Scenario 5 :
 
You have been given following mysql database details as well as 
other info. 
user=retail_dba 
password=cloudera 
database=retail_db 
jdbc URL = jdbc:mysql://quickstart:3306/retail_db 
Please accomplish following activities. 
1. In mysql departments table please insert following record. Insert into departments 
values(9999, '"Data Science"1); 
2. Now there is a downstream system which will process dumps of this file. However, 
system is designed the way that it can process only files if fields are enlcosed in(') single 
quote and separate of the field should be (-} and line needs to be terminated by : (colon). 
3. If data itself contains the " (double quote } than it should be escaped by \. 
4. Please import the departments table in a directory called departments_enclosedby and 
file should be able to process by downstream system. 
 
Solution : Step 1 : Connect to mysql database. mysql --user=retail_dba -password=cloudera 
show databases; use retail_db; show tables; Insert record Insert into departments values(9999, 
'"Data Science"'); select" from departments; Step 2 : Import data as per requirement. sqoop 
import \ -connect jdbc:mysql;//quickstart:3306/retail_db \ ~username=retail_dba \ 
--password=cloudera \ -table departments \ -target-dir /user/cloudera/departments_enclosedby 
\ -enclosed-by V -escaped-by \\ -fields-terminated-by--' -lines-terminated-by : Step 3 : Check the 
result. hdfs dfs -cat/user/cloudera/departments_enclosedby/part" 
 
 

Problem Scenario 6 :
 
You have been given following mysql database details as well as 
other info. 
user=retail_dba 
password=cloudera 
database=retail_db 
jdbc URL = jdbc:mysql://quickstart:3306/retail_db 
Please accomplish following. 
1. Create a table in retailedb with following definition. 
CREATE table departments_new (department_id int(11), department_name varchar(45), 
created_date T1MESTAMP DEFAULT NOW()); 
2. Now isert records from departments table to departments_new 
3. Now import data from departments_new table to hdfs. 
4. Insert following 5 records in departmentsnew table. Insert into departments_new 
values(110, "Civil" , null); Insert into departments_new values(111, "Mechanical" , null); 
Insert into departments_new values(112, "Automobile" , null); Insert into departments_new 
values(113, "Pharma" , null); 
Insert into departments_new values(114, "Social Engineering" , null); 
5. Now do the incremental import based on created_date column. 
 
Solution : Step 1 : Login to musql db mysql --user=retail_dba -password=cloudera show 
databases; use retail db; show tables; Step 2 : Create a table as given in problem statement. 
CREATE table departments_new (department_id int(11), department_name varchar(45), 
createddate T1MESTAMP DEFAULT NOW()); show tables; Step 3 : isert records from 
departments table to departments_new insert into departments_new select a.", null from 
departments a; Step 4 : Import data from departments new table to hdfs. sqoop import \ 
-connect jdbc:mysql://quickstart:330G/retail_db \ ~username=retail_dba \ -password=cloudera 
\ -table departments_new\ --target-dir /user/cloudera/departments_new \ --split-by departments 
Stpe 5 : Check the imported data. hdfs dfs -cat /user/cloudera/departmentsnew/part" Step 6 : 
Insert following 5 records in departmentsnew table. Insert into departments_new values(110, 
"Civil" , null); Insert into departments_new values(111, "Mechanical" , null); Insert into 
departments_new values(112, "Automobile" , null); Insert into departments_new values(113, 
"Pharma" , null); Insert into departments_new values(114, "Social Engineering" , null); commit; 
Stpe 7 : Import incremetal data based on created_date column. sqoop import \ -connect 
jdbc:mysql://quickstart:330G/retaiI_db \ -username=retail_dba \ -password=cloudera \ --table 
departments_new\ -target-dir /user/cloudera/departments_new \ -append \ -check-column 
created_date \ -incremental lastmodified \ -split-by departments \ -last-value "2016-01-30 
12:07:37.0" Step 8 : Check the imported value. hdfs dfs -cat 
/user/cloudera/departmentsnew/part" 
 

Problem Scenario 7 :
 
You have been given MySQL DB with following details. 
user=retail_dba 
password=cloudera 
database=retail_db 
table=retail_db.products 
jdbc URL = jdbc:mysql://quickstart:3306/retail_db 
Columns of products table : (product_id | product_category_id | product_name | 
product_description | product_price | product_image ) 
Please accomplish following activities. 
1. Copy "retaildb.products" table to hdfs in a directory p93_products 
2. Now sort the products data sorted by product price per category, use productcategoryid 
colunm to group by category 
Solution : Step 1 : Import Single table . sqoop import --connect 
jdbc:mysql://quickstart:3306/retail_db -username=retail_dba - password=cloudera 
-table=products --target-dir=p93 Note : Please check you dont have space between before or 
after '=' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfs Step 2 : 
Step 2 : Read the data from one of the partition, created using above command, hadoop fs -cat 
p93_products/part-m-00000 Step 3 : Load this directory as RDD using Spark and Python (Open 
pyspark terminal and do following}. productsRDD = sc.textFile(Mp93_products") Step 4 : Filter 
empty prices, if exists #filter out empty prices lines Nonempty_lines = 
productsRDD.filter(lambda x: len(x.split(",")[4]) > 0) Step 5 : Create data set like (categroyld, 
(id,name,price) mappedRDD = nonempty_lines.map(lambda line: (line.split(",")[1], 
(line.split(",")[0], line.split(",")[2], float(line.split(",")[4])))) tor line in mappedRDD.collect(): 
print(line) Step 6 : Now groupBy the all records based on categoryld, which a key on 
mappedRDD it will produce output like (categoryld, iterable of all lines for a key/categoryld) 
groupByCategroyld = mappedRDD.groupByKey() for line in groupByCategroyld.collect(): 
print(line) step 7 : Now sort the data in each category based on price in ascending order. # 
sorted is a function to sort an iterable, we can also specify, what would be the Key on which we 
want to sort in this case we have price on which it needs to be sorted. 
groupByCategroyld.map(lambda tuple: sorted(tuple[1], key=lambda tupleValue: 
tupleValue[2])).take(5) Step 8 : Now sort the data in each category based on price in descending 
order. # sorted is a function to sort an iterable, we can also specify, what would be the Key on 
which we want to sort in this case we have price which it needs to be sorted. on 
groupByCategroyld.map(lambda tuple: sorted(tuple[1], key=lambda tupleValue: tupleValue[2] , 
reverse=True)).take(5) 
 
Problem Scenario 8 :
 
You have been given following mysql database details as well as 
other info. 
user=retail_dba 
password=cloudera 
database=retail_db 
jdbc URL = jdbc:mysql://quickstart:3306/retail_db 
Please accomplish following. 
1. Create a database named hadoopexam and then create a table named departments in 
it, with following fields. department_id int, 
department_name string 
e.g. location should be 
hdfs://quickstart.cloudera:8020/user/hive/warehouse/hadoopexam.db/departments 
2. Please import data in existing table created above from retaidb.departments into hive 
table hadoopexam.departments. 
3. Please import data in a non-existing table, means while importing create hive table 
named hadoopexam.departments_new 
 
Solution : Step 1 : Go to hive interface and create database. hive create database hadoopexam; 
Step 2. Use the database created in above step and then create table in it. use hadoopexam; 
show tables; Step 3 : Create table in it. create table departments (department_id int, 
department_name string); show tables; desc departments; desc formatted departments; Step 4 : 
Please check following directory must not exist else it will give error, hdfs dfs -Is 
/user/cloudera/departments If directory already exists, make sure it is not useful and than 
delete the same. This is the staging directory where Sqoop store the intermediate data before 
pushing in hive table. hadoop fs -rm -R departments Step 5 : Now import data in existing table 
sqoop import \ -connect jdbc:mysql://quickstart:3306/retail_db \ ~username=retail_dba \ 
-password=cloudera \ --table departments \ -hive-home /user/hive/warehouse \ -hive-import \ 
-hive-overwrite \ -hive-table hadoopexam.departments Step 6 : Check whether data has been 
loaded or not. hive; use hadoopexam; show tables; select" from departments; desc formatted 
departments; Step 7 : Import data in non-existing tables in hive and create table while importing. 
sqoop import \ -connect jdbc:mysql://quickstart:3306/retail_db \ --username=retail_dba \ 
~password=cloudera \ -table departments \ -hive-home /user/hive/warehouse \ -hive-import \ 
-hive-overwrite \ -hive-table hadoopexam.departments_new \ -create-hive-table Step 8 : 
Check-whether data has been loaded or not. hive; use hadoopexam; show tables; select" from 
departments_new; desc formatted departments_new; 
 
Problem Scenario 9 :
 
You have been given following mysql database details as well as 
other info. 
user=retail_dba 
password=cloudera 
database=retail_db 
jdbc URL = jdbc:mysql://quickstart:3306/retail_db 
Please accomplish following. 
1. Create a table in retailedb with following definition. 
CREATE table departments_export (department_id int(11), department_name varchar(45), 
created_date T1MESTAMP DEFAULT NOWQ); 
2. Now import the data from following directory into departments_export table, 
/user/cloudera/departments new 
 
Solution : Step 1 : Login to musql db mysql --user=retail_dba -password=cloudera show 
databases; use retail_db; show tables; step 2 : Create a table as given in problem statement. 
CREATE table departments_export (departmentjd int(11), department_name varchar(45), 
created_date T1MESTAMP DEFAULT NOW()); show tables; Step 3 : Export data from 
/user/cloudera/departmentsnew to new table departments_export sqoop export -connect 
jdbc:mysql://quickstart:3306/retail_db \ -username retaildba \ --password cloudera \ --table 
departments_export \ -export-dir /user/cloudera/departments_new \ -batch Step 4 : Now check 
the export is correctly done or not. mysql -user*retail_dba - password=cloudera show 
databases; use retail _db; show tables; select' from departments_export; 
 
 

Hive

Problem Scenario 1 :
 
You have been given MySQL DB with following details. You have 
been given following product.csv file 
product.csv 
productID,productCode,name,quantity,price 
1001,PEN,Pen Red,5000,1.23 
1002,PEN,Pen Blue,8000,1.25 
1003,PEN,Pen Black,2000,1.25 
1004,PEC,Pencil 2B,10000,0.48 
1005,PEC,Pencil 2H,8000,0.49 
1006,PEC,Pencil HB,0,9999.99 
Now accomplish following activities. 
1. Create a Hive ORC table using SparkSql 
2. Load this data in Hive table. 
3. Create a Hive parquet table using SparkSQL and load data in it. 
 
Solution : Step 1 : Create this tile in HDFS under following directory (Without header} 
/user/cloudera/he/exam/task1/productcsv Step 2 : Now using Spark-shell read the file as RDD 
// load the data into a new RDD val products = 
sc.textFile("/user/cloudera/he/exam/task1/product.csv") // Return the first element in this RDD 
prod u cts.fi rst() Step 3 : Now define the schema using a case class case class 
Product(productid: Integer, code: String, name: String, quantity:lnteger, price: Float) Step 4 : 
create an RDD of Product objects val prdRDD = products.map(_.split(",")).map(p => 
Product(p(0).tolnt,p(1),p(2),p(3}.tolnt,p(4}.toFloat)) prdRDD.first() prdRDD.count() Step 5 : Now 
create data frame val prdDF = prdRDD.toDF() Step 6 : Now store data in hive warehouse 
directory. (However, table will not be created } import org.apache.spark.sql.SaveMode 
prdDF.write.mode(SaveMode.Overwrite).format("orc").saveAsTable("product_orc_table") step 7: 
Now create table using data stored in warehouse directory. With the help of hive. hive show 
tables CREATE EXTERNAL TABLE products (productid int,code string,name string .quantity int, 
price float} STORED AS ore LOCATION 7user/hive/warehouse/product_orc_table'; Step 8 : Now 
create a parquet table import org.apache.spark.sql.SaveMode 
prdDF.write.mode(SaveMode.Overwrite).format("parquet").saveAsTable("product_parquet_ 
table") Step 9 : Now create table using this CREATE EXTERNAL TABLE products_parquet 
(productid int,code string,name string .quantity int, price float} STORED AS parquet LOCATION 
7user/hive/warehouse/product_parquet_table'; Step 10 : Check data has been loaded or not. 
Select * from products; Select * from products_parquet; 
 

Problem Scenario 2 :
 
In Continuation of previous question, please accomplish following 
activities. 
1. Select all the records with quantity >= 5000 and name starts with 'Pen' 
2. Select all the records with quantity >= 5000, price is less than 1.24 and name starts with 
'Pen' 
3. Select all the records witch does not have quantity >= 5000 and name does not starts 
with 'Pen' 
4. Select all the products which name is 'Pen Red', 'Pen Black' 
5. Select all the products which has price BETWEEN 1.0 AND 2.0 AND quantity 
BETWEEN 1000 AND 2000. 
 
Solution : Step 1 : Select all the records with quantity >= 5000 and name starts with 'Pen' val 
results = sqlContext.sql(......SELECT * FROM products WHERE quantity >= 5000 AND name LIKE 
'Pen %.......) results.show() Step 2 : Select all the records with quantity >= 5000 , price is less 
than 1.24 and name starts with 'Pen' val results = sqlContext.sql(......SELECT * FROM products 
WHERE quantity >= 5000 AND price < 1.24 AND name LIKE 'Pen %.......) results. showQ Step 3 : 
Select all the records witch does not have quantity >= 5000 and name does not starts with 'Pen' 
val results = sqlContext.sql('.....SELECT * FROM products WHERE NOT (quantity >= 5000 AND 
name LIKE 'Pen %')......) results. showQ Step 4 : Select all the products wchich name is 'Pen Red', 
'Pen Black' val results = sqlContext.sql('.....SELECT' FROM products WHERE name IN ('Pen Red', 
'Pen Black')......) results. showQ Step 5 : Select all the products which has price BETWEEN 1.0 
AND 2.0 AND quantity BETWEEN 1000 AND 2000. val results = sqlContext.sql(......SELECT * 
FROM products WHERE (price BETWEEN 1.0 AND 2.0) AND (quantity BETWEEN 1000 AND 
2000)......) results. show() 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

You might also like