Bigdataspark Manual (MR-22)
Bigdataspark Manual (MR-22)
(CS)
Course File
Prepared by
G.Varahagiri
Assistant Professor
BIGDATA: SPARK
EXPERIMENT-1:
(ii) Knowing the differencing between single node clusters and multi-node clusters
(iv) Installing and accessing the environments such as hive and sqoop.
EXPERIMENT-3:
(v) Checking the contents of the file(vi) Copying and moving files
Map-reducing
EXPERIMENT-5:
EXPERIMENT-6:
EXPERIMENT-7:
(ii) Loading data to external hive tables from sql tables(or)Structured c.s.v using scoop
(iii) Performingoperationslikefilterationsandupdations
Create a sql table of employees Employee table with id, designation Salary table (salary, dept id)
Create external table in hive with similar schema of above tables, Move data to hive using scoop
and load
The contents into tables filter a new table and write a UDF to encrypt the table with AES-
algorithm,
2
Downloaded by varahagiri geddam
EXPERIMENT-9:
(i) Py spark Definition (Apache Py spark) and difference between Py spark,Scala, pandas
(i) get(filename)
EXPERIMENT-10:
Py spark-RDD
EXPERIMENT-11:
(ii) To remove the words, which are not necessary to analyze this text.
(iii) group By
(iv) What if we want to calculate how many times each word is coming in corpus?
(ii) Using spark conf create spark session to write a data frame to read details in a
c.s.v and later move that c.s.v to another location
There are four main Big Data architecture layers to architecture of Big Data:
1. Data Ingestion
This layer is responsible for collecting and storing data from various sources. In Big Data,
the data ingestion process of extracting data from various sources and loading it into a data
repository. Data ingestion is a key component of a Big Data architecture because it
determines how data will be ingested, transformed, and stored.
2. Data Processing Data processing is the second layer, responsible for collecting, cleaning, and
preparing the data for analysis. This layer is critical for ensuring that the data is high quality and ready
to be used in the future.
3. Data Storage
Data storage is the third layer, responsible for storing the data in a format that can be easily
accessed and analyzed. This layer is essential for ensuring that the data is accessible and
available to the other layers
4. Data Visualization
Data visualization is the fourth layer and is responsible for creating visualizations of the data
that humans can easily understand. This layer is important for making the data accessible.
When we explain traditional and big data analytics architecture reference models, we must
remember that the architecture process plays an important role in Big Data.
Connectors and adapters can quickly connect to any storage system, protocol, or network
and connect to any data format.
2. Data Governance
From the time data is ingested through processing, analysis, storage, and deletion, there
are protections for privacy and security.
3. Managing Systems
A few processes are essential to the architecture of Big Data. First, data must be collected
from various sources. This data must then be processed to ensure its quality and accuracy.
After this, the data must be stored securely and reliably. Finally, the data must be made
accessible to those who need it.
Designing a Big Data Hadoop architecture reference architecture, while complex, follows
the same general procedure:
What do you hope to achieve with your Big Data architecture? Do you want to improve
decision-making, better understand your customers , or find new revenue opportunities?
Once you know what you want to accomplish, you can start planning your architecture.
What data do you have, and where does it come from? You'll need to think about both
structured and unstructured data and internal and external sources.
Many different Big Data technologies are available, so it's important to select the ones that
best meet your needs.
As your data grows, your Big Data solution architecture will need to be able to scale to
accommodate it. This means considering things like data replication and partitioning.
Make sure you have the plan to protect your data, both at rest and in motion. This includes
encrypting sensitive information and using secure authentication methods.
As data grows, it becomes more difficult to manage and process. This can lead to delays
in decision-making and reduced efficiency.
2. Ensuring Data Quality
With so much data, it can be difficult to ensure that it is all accurate and high-quality.
This can lead to bad decisions being made based on incorrect data.
With AWS Big Data architecture comes big expectations. Users expect systems to be
able to handle large amounts of data quickly and efficiently. This can be a challenge for
architects who must design systems that can meet these expectations.
With so much data being stored, there is a greater risk of it being hacked or leaked. This
can jeopardize the security and privacy of those who are using the system.
5. Cost
(ii) Knowingthedifferencingbetweensinglenodeclustersandmulti-nodeclusters
(iii) AccessingWEB-UIandtheportnumber
(iv) Installingandaccessingtheenvironmentssuchashiveandsqoop.
HADOOPINSTALLATIONSTEPS
Hadoop is a framework written in Java for running applications on large clusters of commodity
hardware and in corporate features similar to those of the Google File System(GFS) and of the
Map Reduce computing paradigm.Hadoop’sHDFSisahighlyfault-tolerant distributed file system
and, like Hadoop in general, designed tobe deployed on low-cost hard ware. It provides high
throughput access to application data and is suitable for applications that have large datasets.
Hadoop can be installed in 3 Modes Local(Standalone) Pseudo–distributed Fully Distributed
System Settings
User Accounts
unlock
$sudo su
$exit
STEP1:Install JAVA
# Update the source list
$java–version
Java-1.7.0-open jdk i386 (i386means32-bitOS) You can remove the last chars
[email protected]
STEP 2:
1.2.1tar.gz
Enterintohadoop-1.2.1foldermainfolder–conffolder
Configuring~/.bashrc file
$gedit~/.bashrc
<property> <name>fs.default.name</name>
<value>hdfs://localhost:8020 </value> </property>
</configuration>
Conf/mapred-site.xml
<configuration> <property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value> </property>
</configuration>
hdfs-site.xml <configuration>
</configuration>
Hadoop-env.sh
exportJAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk
STEP 6 : Format name node ivcse@hadoop:~$hadoop name
node–format
STEP7 :StarttheProcesses
hduser@ubuntu:~$ start-dfs.sh
hduser@ubuntu:~$start-mapred.sh
hduser@ubuntu:~$ start-all.sh
hduser@ubuntu:~$ jps
hduser@ubuntu:~$stop-all.sh
Usage: hadoopfs
Usage: hadoop fs -mkdir [-p] <paths> Takes path uri’s as argument and creates directories.
Options:
The-p option behavior is much like Unix mkdir-p,creatingparent directories along the path.
Example:
hadoop fs-mkdirhdfs://nn1.example.com/user/hadoop/dir
hdfs://nn2.example.com/user/hadoop/dir
Options:
lsr
Usage: hadoopfs-lsr<args>
cat
get
Usage:hadoopfs-get<src><localdst>
put
andwritestodestinationfilesystem.
Usage:hadoopfs-copyFromLocal<localsrc>URI
Similar to put command, except that the source is restricted to a local file reference.
CopyToLocal
Usage:hadoopfs-copyToLocalURI<localdst>
Similar to get command, except that the destination is restricted to a local file reference.
Count
Usage:hadoopfs-count<paths>
Count the number of directories, files and bytes under the paths that match the specified file pattern.
The output columns with-count are: DIR_COUNT,FILE_COUNT, CONTENT_SIZE,
PATHNAME
cp
Usage:hadoopfs-cp[-f][-p][URI...]< dest>
Example:
mv
In which case the destination needs to be a directory. Moving files across file systems is not
permitted.
Example:
hdfs://nn.example.com/file3hdfs://nn.example.com/dir1
rm
Usage:hadoopfs-rm[-f][-r|-R]URI[URI...]
The-R option deletes the directory and any content under it recursively.
rmdir
Usage:hadoopfs-rmdir[--ignore-fail-on-non-empty]URI[URI...]
--ignore-fail-on-non-empty: When using wildcards, do not fail if a directory still contains files.
Example:
rmr
df
Usage:hadoopfs-df[-h]URI[URI...]
The-h option will format file sizes in a“human-readable”fashion (e.g 64.0m instead of
67108864)
du
Usage:hadoopfs-duURI[URI...]
Displays sizes of files and directories contained in the given directory or the length of a file in case
its just a file.
Example:
help
Usage:hadoopfs-help
setrep
Changes the replication factor of a file. If path is a directory then the command recursively
changes the replication factor of all files under the directory tree rooted at path. Options:
The-w flag requests that the command wait for the replication to complete. This can potentially
take a Very longtime.
tail
Usage:hadoopfs-tail URI
checksum
ile:///etc/hosts
Usage:hadoopfs-chgrp[-R]GROUPURI[URI...]
Change group association of files. The user must be the owner of files, or else a super-user.
Additional information is in the Permissions Guide. Options
The-R option will make the change recursively through the directory structure.
chmod
Usage:hadoopfs-chmod[-R]<MODE[,MODE]...|
OCTALMODE>URI[URI...]
Change the permissions of files. With -R,make the change recursively through the directory
structure. The user must be the owner of the file, or else a super-user. Additional information is in
the Permissions Guide.
Options
The-R option will make the change recursively through the directory structure.
chown
Usage:hadoopfs-chown[-R][OWNER][:[GROUP]]URI[URI ]
Change the owner of files. The user must be a super-user. Additional information is in the
Permissions Guide.
Options
The-R option will make the change recursively through the directory structure.
23
1. Scalability
2. Flexibility
25
WORDCOUNT MAPPER:
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WordMapper extends MapReduceBase implements
Mapper<LongWritable,
Text, Text, IntWritable> {
// Map function
public void map(LongWritable key, Text value, Output Collector<Text,
IntWritable> output, Reporter rep) throws IOException
{
String line = value.toString ();
// Splitting the line on spaces
for (String word : line.split(" "))
{
if (word.length() > 0)
{
output.collect(new Text(word), new IntWritable(1));
}
}
}
}
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
21
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class Word Reducer extends MapReduceBase implements Reducer<Text,
IntWritable, Text, IntWritable> {
// Reduce function
public void reduce(Text key, Iterator<IntWritable> value,
Output Collector<Text, IntWritable> output,
Reporter rep) throws I Exception
{
int count = 0;
// Counting the frequency of each words
while (value.hasNext())
{
IntWritable i = value.next();
count += i.get();
}
output. Collect(key, new IntWritable (count));
}
}
/input/Processfile.txt /output
parsing not performed. Implement the Tool interface and execute your application with
job_1677210793042_0001
application_1677210793042_0001
http://quickstart.cloudera:8088/proxy/application_1677210793042_0001/
false
Hive Commands
Hive supports Data definition Language (DDL), Data Manipulation Language (DML) and
User defined functions.
Hive DDL Commands
create database
drop database
create table
drop table
alter table
create index
create views
Hive DML Commands
Select
Where
lOMoARcPSD|22754375
ROLLNO:
48
Group By
Order By
Load Data
Join:
o Inner Join
o Left Outer Join
o Right Outer Join
o Full Outer Join
importjava.io.IOException;
importorg.apache.hadoop.conf.Configured; import
org.apache.hadoop.fs.Path;
importorg.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.Text;
org.apache.hadoop.mapred.FileOutputFormat; import
org.apache.hadoop.mapred.JobClient;
importorg.apache.hadoop.mapred.JobConf; import
org.apache.hadoop.util.Tool;
importorg.apache.hadoop.util.ToolRunner;
publicclassWordCountextendsConfiguredimplementsTool{
28
if(args.length<2)
FileOutputFormat.setOutputPath(conf,newPath(args[1]));
conf.setMapperClass(WordMapper.class);
conf.setReducerClass(WordReducer.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf);
return0;
//MainMethod
29
intexitCode=ToolRunner.run(newWordCount(),args); System.out.println(exitCode);
WORDCOUNTMAPPER
importjava.io.IOException;
import org.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.LongWritable; import
org.apache.hadoop.io.Text;
org.apache.hadoop.mapred.Mapper;
org.apache.hadoop.mapred.Reporter;
publicclassWordMapperextendsMapReduceBaseimplements Mapper<LongWritable,
Text,Text,IntWritable>{
//Mapfunction
publicvoidmap(LongWritablekey,Textvalue,OutputCollector<Text,
30
Stringline=value.toString();
for(Stringword:line.split(""))
if(word.length()>0)
output.collect(newText(word),newIntWritable(1));
WORDCOUNTREDUCER
importjava.io.IOException; import
java.util.Iterator;
importorg.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.Text;
importorg.apache.hadoop.mapred.MapReduceBase;
31
org.apache.hadoop.mapred.Reducer;
importorg.apache.hadoop.mapred.Reporter;
publicclassWordReducerextendsMapReduceBaseimplementsReducer<Text, IntWritable,
Text, IntWritable> {
//Reducefunction
output,
Reporterrep)throws IOException
intcount= 0;
//Countingthefrequencyofeachwords while
(value.hasNext())
IntWritablei=value.next(); count
+= i.get();
output.collect(key,newIntWritable(count));
32
EXECUTIONSTEPSANDOUTPUT
[cloudera@quickstart~]$ls
hi
hello
histudents
welecome to VITW
welcme to CSE
welecometoSPARKlab
lab
[cloudera@quickstart~]$cat/home/cloudera/inputfile.txt
hi
hello
hi students
welecometoVITW
welcme to CSE
welecometoSPARKlab
lab
33
Found 7 items
drwxr-xr-x-clouderasupergroup 02023-02-1620:55/Tejaswi
[cloudera@quickstart~]$hdfsdfs-mkdir/input
Found8items
drwxr-xr-x-clouderasupergroup 02023-02-1620:55/Tejaswi
drwxr-xr-x-clouderasupergroup 02023-02-2321:14/input
[cloudera@quickstart~]$hdfsdfs-put/home/cloudera/inputfile.txt/input
java.lang.InterruptedException
34
java.lang.Thread.join(Thread.java:1281)
atjava.lang.Thread.join(Thread.java:1355)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.ja
va:967)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705
)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)
[cloudera@quickstart~]$hdfsdfs-put/home/cloudera/inputfile.txt/input/ put:
[cloudera@quickstart~]$cat>/home/cloudera/Processfile.txt hi
hello
hi students
hellostudents
[cloudera@quickstart~]$hdfsdfs-put/home/cloudera/Processfile.txt/input/
cat:`/input':Isa directory
[cloudera@quickstart~]$hdfsdfs-cat/input/Processfile.txt hi
hello
hi students
hellostudents
35
23/02/2321:20:58INFOclient.RMProxy:ConnectingtoResourceManagerat/0.0.0.0:8032
23/02/2321:20:59INFOclient.RMProxy:ConnectingtoResourceManagerat/0.0.0.0:8032
23/02/2321:20:59WARNmapreduce.JobResourceUploader:Hadoopcommand-lineoption
parsing not performed. Implement the Tool interface and execute your application with
ToolRunner to remedy this.
23/02/2321:20:59INFOmapreduce.JobSubmitter:Submittingtokensforjob:
job_1677210793042_0001
23/02/2321:21:00INFOimpl.YarnClientImpl:Submittedapplication
application_1677210793042_0001
23/02/2321:21:00INFOmapreduce.Job:Runningjob:job_1677210793042_0001
23/02/2321:21:08INFOmapreduce.Job:map0%reduce0%
23/02/2321:21:19INFOmapreduce.Job:map100%reduce0%
23/02/2321:21:26INFOmapreduce.Job:map100%reduce100%
23/02/2321:21:26INFOmapreduce.Job:Jobjob_1677210793042_0001completedsuccessfully
FileSystemCounters
FILE:Numberofbytesread=78
FILE:Numberoflargereadoperations=0
36
HDFS:Numberofreadoperations=9
HDFS:Numberoflargereadoperations=0
JobCounters
Launchedreducetasks=1
Totaltimespentbyallreduce tasks(ms)=3826
Totalvcore-millisecondstakenbyallreducetasks=3826
Totalmegabyte-millisecondstakenbyallmaptasks=16248832
Totalmegabyte-millisecondstakenbyallreducetasks=3917824
Map-Reduce Framework
Mapoutputrecords=6
Mapoutputmaterializedbytes=84
Combineinputrecords=0
37
Spilled Records=12
Shuffled Maps =2
Failed Shuffles=0
MergedMapoutputs=2
GCtimeelapsed(ms)=175
CPUtimespent(ms)=1230
Physicalmemory(bytes)snapshot=559370240
Totalcommittedheapusage(bytes)=392372224
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
FileInputFormatCounters
Bytes Read=54
FileOutputFormatCounters
Bytes Written=24
38
[cloudera@quickstart~]$hdfsdfs-ls/output
Found 2 items
-rw-r--r--1clouderasupergroup 02023-02-2321:21/output/_SUCCESS
[cloudera@quickstart~]$hdfsdfs-cat/output/part-00000
hello 2
hi 2
students 2
[cloudera@quickstart~]$
39
Downloaded by varahagiri geddam
EXPERIMENT-5:
ImplementingMatrix-MultiplicationwithHadoopMap-reduce
MATRRIXDRIVERCODE
packagecom.lendap.hadoop;
import org.apache.hadoop.conf.*;
importorg.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
importorg.apache.hadoop.mapreduce.*;
org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
importorg.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
publicclassMatrixMultiply{
publicstaticvoidmain(String[]args)throwsException{ if
(args.length != 2) {
System.err.println("Usage:MatrixMultiply<in_dir><out_dir>"); System.exit(2);
Configurationconf=newConfiguration();
//Misanm-by-nmatrix;Nisann-by-pmatrix.
40
conf.set("n","100");
Jobjob=newJob(conf,"MatrixMultiply");
job.setJarByClass(MatrixMultiply.class);
job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class);
job.setMapperClass(Map.class); job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
new Path(args[1]));
job.waitForCompletion(true);
41
package com.lendap.hadoop;
importorg.apache.hadoop.conf.*;
importorg.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
importorg.apache.hadoop.mapreduce.Mapper;
importjava.io.IOException;
publicclassMap
extendsorg.apache.hadoop.mapreduce.Mapper<LongWritable,Text,Text,Text>{ @Override
publicvoidmap(LongWritablekey,Textvalue,Contextcontext) throws
IOException, InterruptedException {
Configurationconf=context.getConfiguration();
int m = Integer.parseInt(conf.get("m"));
intp=Integer.parseInt(conf.get("p"));
//(M,i,j, Mij);
String[]indicesAndValue=line.split(","); Text
TextoutputValue=newText();
if(indicesAndValue[0].equals("M"))
42
// outputKey.set(i,k);
outputValue.set(indicesAndValue[0]+","+indicesAndValue[2]
+ ","+indicesAndValue[3]);
// outputValue.set(M,j,Mij);
context.write(outputKey, outputValue);
}else{
//(N, j,k,Njk);
for(inti= 0;i<m;i++){
outputValue.set("N,"+indicesAndValue[1]+","
+ indicesAndValue[3]);
context.write(outputKey, outputValue);
MATRRIXREDUCERCODE
package com.lendap.hadoop;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapreduce.Reducer;
43
import java.util.HashMap;
publicclassReduce
extendsorg.apache.hadoop.mapreduce.Reducer<Text,Text,Text,Text>{ @Override
publicvoidreduce(Textkey,Iterable<Text>values,Contextcontext) throws
IOException, InterruptedException {
String[]value;
//key=(i,k),
//Values=[(M/N,j,V/W),..]
HashMap<Integer,Float>hashA=newHashMap<Integer,Float>();
HashMap<Integer,Float>hashB=newHashMap<Integer,Float>(); for
value= val.toString().split(",");
if(value[0].equals("M")){
hashA.put(Integer.parseInt(value[1]),Float.parseFloat(value[2]));
}else{
hashB.put(Integer.parseInt(value[1]),Float.parseFloat(value[2]));
intn=Integer.parseInt(context.getConfiguration().get("n"));
floatm_ij;
floatn_jk;
44
m_ij=hashA.containsKey(j) ?hashA.get(j):0.0f;
if(result!=0.0f){
context.write(null,
newText(key.toString()+","+Float.toString(result)));
Output:
Beforewritingthecodelet’sfirstcreatematricesandputtheminHDFS.
CreatetwofilesM1,M2andputthematrixvalues.
(speratecolumnswithspacesandrowswithalinebreak)
ForthisexampleIamtakingmatricesas:
123 7 8
456 9 10
1112
PuttheabovefilestoHDFSatlocation/user/clouders/matrices/
hdfsdfs-mkdir/user/cloudera/matrice
hdfsdfs-put/path/to/M1/user/cloudera/matrices/
hdfsdfs-put/path/to/M2/user/cloudera/matrices/
45
Abovecommandshouldoutputtheresultantmatrix
46
(ii) Loadingdata
Employeetablewithid,designation Salary
HiveCommands
HivesupportsDatadefinitionLanguage(DDL),DataManipulationLanguage(DML)andUserdefine
dfunctions.
HiveDDLCommands
createdatabasedr
opdatabasecreate
tabledroptable
altertablecre
ateindexcreat
eviews
HiveDMLCommands
Select
Where
47
rderByLoa
dDataJoin:
o InnerJoin
o LeftOuterJoin
o RightOuterJoin
o FullOuterJoin
HiveDDLCommands
CreateDatabaseStatement
hive>createdatabasecse;
OK
Timetaken:0.129seconds
hive>createdatabasebigdata;OK
Timetaken:0.051seconds
hive>showdatabases;
OK
bigdatac
sedefaul
t
Timetaken:0.013seconds,Fetched:3row(s)
48
hive>dropdatabasecse;OK
Timetaken:0.134secondshive>sho
wdatabases;
OK
bigdatad
efault
Timetaken:0.155seconds,Fetched:2row(s)
Createatable
hive>createtableemployee(namestring,idint,salfloat,designationstring)rowformatdelimitedfieldster
minatedby','linesterminatedby'\n'storedastextfile;
OK
Timetaken:2.261seconds
hive>showtables;
OK
employee
Timetaken:0.046seconds,Fetched:1row(s)
hive>descemployee;
OK
name string
id int
sal float
designation string
Timetaken:0.096seconds,Fetched:4row(s)
49
[cloudera@quickstart~]$hdfsdfs–mkdir/bigdata
[cloudera@quickstart~]$hdfsdfs-ls/ Found 7
items
[cloudera@quickstart~]$hdfsdfs-put/home/cloudera/Desktop/sample.txt
/bigdata
[cloudera@quickstart~]$hdfsdfs-ls/bigdata Found 1
items
LOADDATATOHIVE
table default.employee
Tabledefault.employeestats:[numFiles=1,totalSize=59] OK
Timetaken:0.803seconds
AlteringandDroppingTables
HIVE>createtablestudent(idint,namestring);
50
Timetaken:0.483seconds
Hive>showtables;
OK
employee
student
Timetaken:0.085seconds,Fetched:2row(s)
hive>altertablestudentrenametocse; OK
Timetaken:0.118seconds
hive>showtables; OK
cse
employee
Timetaken:0.024seconds,Fetched:2row(s)
hive>select*fromemployee;
OK
lucky 1 123.0asst.prof
veda 2 456.0child
abhi 3 789.0business
Timetaken:0.933seconds,Fetched:3row(s)
51
RelationalOperators
hive>selectavg(sal)fromemployee;
QueryID=cloudera_20230414221212_7d3c1250-d45b-4441-88e7-9127d8f389c2 Total jobs = 1
LaunchingJob1outof1
Numberofreducetasksdeterminedatcompiletime:1
hive.exec.reducers.bytes.per.reducer=<number>
Inorderto limitthemaximumnumberofreducers:
sethive.exec.reducers.max=<number>
Inordertosetaconstantnumberofreducers:
setmapreduce.job.reduces=<number>
52
2023-04-1422:13:02,273Stage-1map=100%,reduce=100%,Cumulative CPU
2.11sec
job_1681530689884_0001
MapReduceJobsLaunched:
TotalMapReduceCPUTimeSpent:2seconds110msec OK
456.0
TotalMapReduceCPUTimeSpent:1seconds740msec OK
789999.0
TotalMapReduceCPUTimeSpent:1seconds710msec OK
4623578.0
53
Hive>SELECT*FROMemployeeWHERESalary>=800000;
OK
hpriya 345 4623578.0 manager
hddsj256433522323.0 svbahdg
Timetaken:0.763seconds,Fetched:2row(s)
i)PysparkDefinition(ApachePyspark)anddifferencebetweenPyspark,Scala, pandas
WhatisPySpark?
Apache Spark is written in Scala programming language. PySpark has been released in order to
support the collaboration of Apache Spark and Python, it
actuallyisaPythonAPIforSpark.Inaddition,PySpark, helpsyouinterfacewith
ResilientDistributedDatasets(RDDs)inApacheSparkandPythonprogramming language. This has
been achieved bytaking advantage of the Py4j library.
PySparkLogo
(ii)Pysparkfilesandclassmethods
(i)get(filename)
(ii)getrootdirectory()
PySparkprovidesthefacilitytouploadyourfilesusingsc.addFile.Wecanalsogetthepathofworkingdire
ctoryusingSparkFiles.get.Moreover,toresolvethepathtothefilesaddedthroughSparkContext.add
File(),thefollowingtypesofclassmethodareavailableinSparkFiles,suchas:
o get(filename)
o getrootdirectory()
Let'slearnabouttheclassmethodindetail.
55
o get(filename)
importos
classSparkFiles(object):
"""
ResolvespathstofilesaddedthroughL{SparkContext.addFile()<pyspark.conte
xt.SparkContext.addFile>}.SparkFilesconsistofonlyclassmethods;usersshou
ldnotcreateSparkFilesinstances.
"""
root_directory=Noneis_runnin
g_on_worker=Falsesc=None
definit(self):
raiseNotImplementedError("DonotconstructSparkFilesobjects")@classmet
hod
defget(cls,filename):
"""
GettheabsolutepathofafileaddedthroughC{SparkContext.addFile()}."""
path=os.path.join(SparkFiles.getRootDirectory(),filename)
returnos.path.abspath(path)@class
method
defgetRootDirectory(cls):
o getrootdirectory()
importos
classSparkFiles(object):
"""
56
57
Pyspark-RDD’S
(i) whatisRDD’s?
(ii) waystoCreateRDD
(i) parallelizedcollections
(ii) externaldataset
(iii) existingRDD’s
(iv) SparkRDD’soperations
(Count,foreach(),Collect,join,Cache()
waystoCreateRDD
TherearefollowingwaystoCreateRDDinSpark.Suchas1.Usingparallelized
collection 2. From existing Apache Spark RDD & 3. From external datasets.
58
Downloaded by varahagiri geddam
TocreateanRDDusingparallelize()
Theparallelize()methodofthesparkcontext isusedtocreatea Resilient Distributed
Dataset (RRD) from an iterable or a collection.
Syntax
sparkContext.parallelize(iterable,numSlices)
Parameters
iterable:This isaniterableoracollectionfromwhichanRDDhastobecreated.
numSlices:Thisisanoptionalparameter thatindicatesthenumber
ofslicestocutthe RDDinto.Thenumber
ofslicescanbemanuallyprovidedbysettingthisparameter. Otherwise, the spark
will set this to the default parallelism that is inferred from the cluster.
ThismethodreturnsanRDD
Codeexample
importpyspark
frompyspark.sqlimportSparkSession
spark=SparkSession.builder.appName('educative-answers').config("spark.some.config.option",
"some-value").getOrCreate()
collection=[("James","Smith","USA","CA"),
("Michael","Rose","USA","NY"),
("Robert","Williams","USA","CA"),
("Maria","Jones","USA","FL")
]
sc=spark.sparkContext
rdd = sc.parallelize(collection)
rdd_elements = rdd.collect()
print("Numberofpartitions-",rdd.getNumPartitions())
print("-" * 8)
numSlices=8
rdd=sc.parallelize(collection,numSlices)
59
print("Numberofpartitions-",rdd.getNumPartitions())
output:
RDDwithdefaultslices-[('James','Smith','USA','CA'),('Michael','Rose','USA','NY'),('Robert',
'Williams','USA','CA'),('Maria','Jones','USA','FL')]Numberofpartitions-4--------
RDDwithdefaultslices-[('James','Smith','USA','CA'),('Michael','Rose','USA','NY'),('Robert',
'Williams','USA','CA'),('Maria','Jones','USA','FL')]Numberofpartitions–8
Fromexternaldatasets(Referencingadatasetinexternalstoragesystem)
IfanystoragesourcesupportedbyHadoop, including our localfilesystemit
cancreate RDDs fromit. Apachesparkdoes supportsequencefiles, textfiles, andanyother
Hadoop input format.
WecancreatetextfileRDDsbysparkcontext’stextfilemethod. This
methodusestheURLfor thefile(either a local path on the machine or database or a hdfs://,
s3n://, etc URL). It also reads whole as a collection of lines.
Wecancopythefileoftheworkernodes.Wecanalsouseanetworkmountedthesharedfile
system.
CSV(Stringpath)
Json(Stringpath)
Textfile(Stringpath)
CSV(StringPath)
whichreturnsdataset<Row> asaresult.
Example:
importorg.apache.spark.sql.SparkS
Array[String]):Unit = { object
DataFormat {
Valspark=SparkSession.builder.appName("ExtDataEx1").master("local").getOrCreate()
Downloaded by varahagiri geddam
60
json(StringPath)
JSONfile(oneobjectperline)whichreturnsDataset<Row>asaresult.
Example:
valdataRDD=spark.read.json("path/of/json/file").rdd
textfile(StringPath)
atextfilewhichreturnsDatasetofastringasaresult.
Example:
valdataRDD=spark.read.textFile("path/of/text/file").rdd
FromexistingApacheSparkRDDs
Theprocessofcreatinganotherdatasetfromtheexistingonesmeanstransformation.
Asaresult,transformationalwaysproduces newRDD.Astheyareimmutable, no
changes take place in it if once created. This property maintains the consistency over
the cluster.
Example:
valwords=spark.sparkContext.parallelize(Seq("sun","rises","in","the","east","and","sets","i
n",“the”,
"west"))
valwordPair=words.map(w=>(w.charAt(0),w))
wordPair.foreach(println)
toCreateRDD’soperations
ToapplyoperationsontheseRDD's,therearetwoways−
Transformation
Action
Afewoperationsonwords
count()
NumberofelementsintheRDDisreturned.
count.py
from pyspark import SparkContext
sc=SparkContext("local","counta
pp") words = sc.parallelize (
["scala",
"java",
"hadoop",
"spark",
"akka",
"sparkvshadoop",
"pyspark",
"pysparkandspark"]
)
counts=words.count()
print"NumberofelementsinRDD->%i"%(counts)
count.py
Command−Thecommandforcount()is−
$SPARK_HOME/bin/spark-submitcount.py
Output−Theoutputfortheabovecommandis−
NumberofelementsinRDD→8
collect()
collect.py
from pyspark import SparkContext
sc=SparkContext("local","Collectapp"
) words = sc.parallelize (
["scala",
"java",
Downloaded by varahagiri geddam
62
Command−Thecommandforcollect()is−
$SPARK_HOME/bin/spark-submitcollect.py
Output−Theoutputfortheabovecommandis−
ElementsinRDD->[
'scala',
'java',
'hadoop',
'spark',
'akka',
'sparkvshadoo
p', 'pyspark',
'pysparkandspark'
]
foreach(f)
Returns onlythose elements which meet the condition of the function inside foreach. In the
followingexample, wecallaprint functioninforeach, whichprintsalltheelementsintheRDD.
foreach.py
from pyspark import SparkContext
sc=SparkContext("local","ForEachapp"
) words = sc.parallelize (
["scala",
"java",
"hadoop",
"spark",
"akka",
"sparkvshadoop",
"pyspark",
"pysparkandspark"]
)
deff(x): print(x)
fore=words.foreach(f)
foreach.py
$SPARK_HOME/bin/spark-submitforeach.py
Output−Theoutputfortheabovecommandis−
scala
java
hadoo
p
spark
akka
sparkvshadoo
p pyspark
pysparkandspark
join(other,numPartitions=None)
It returns RDD with a pair ofelements with the matching keys and all the values for that
particularkey.Inthe followingexample,therearetwopairofelementsintwo different RDDs. After
joining these two RDDs, we get an RDD with elements having matching keys and their values.
join.py
from pyspark import SparkContext
sc=SparkContext("local","Joinapp")
x=sc.parallelize([("spark",1),("hadoop",4)])
y=sc.parallelize([("spark",2),
("hadoop",5)]) joined = x.join(y)
final= joined.collect()
print"JoinRDD->%s"%(final)
join.py
Command−Thecommandforjoin(other,numPartitions=None)is−
$SPARK_HOME/bin/spark-submitjoin.py
Output−Theoutputfortheabovecommandis−
cache()
Command−Thecommandforcache()is−
$SPARK_HOME/bin/spark-submitcache.py
Wordsgotcached->True