0% found this document useful (0 votes)
21 views106 pages

Bigdataspark Manual (MR-22)

The document outlines a Big Data Spark Lab course for III B.Tech II Semester, detailing various experiments related to Big Data analytics, Hadoop architecture, and Spark. It includes practical tasks such as data loading into HDFS, file management, MapReduce implementation, and using PySpark for data transformations. Additionally, it covers the installation and configuration of Hadoop, as well as the concepts of data architecture and governance in Big Data systems.

Uploaded by

Varaha Giri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views106 pages

Bigdataspark Manual (MR-22)

The document outlines a Big Data Spark Lab course for III B.Tech II Semester, detailing various experiments related to Big Data analytics, Hadoop architecture, and Spark. It includes practical tasks such as data loading into HDFS, file management, MapReduce implementation, and using PySpark for data transformations. Additionally, it covers the installation and configuration of Hadoop, as well as the concepts of data architecture and governance in Big Data systems.

Uploaded by

Varaha Giri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 106

Department of Computer Science and Engineering

(CS)

Course File

III. B.Tech II Semester


Subject: Big Data Spark Lab

Prepared by

G.Varahagiri
Assistant Professor
BIGDATA: SPARK
EXPERIMENT-1:

To Study of Big Data Analytics and Hadoop Architecture

(i) Know the concept of big data architecture

(ii) Know the concept of Hadoop architecture

EXPERIMENT-2: Loading Data Set into HDFS for Spark Analysis

Installation of Hadoop and cluster management

(i) Installing Hadoop single node cluster in ubuntu environment

(ii) Knowing the differencing between single node clusters and multi-node clusters

(iii) Accessing WEB-UI and the port number

(iv) Installing and accessing the environments such as hive and sqoop.

EXPERIMENT-3:

File management tasks & Basic Linux commands

(i) Creating a directory in HDFS

(ii) Moving forth and back to directories

(iii) Listing directory contents

(iv) Uploading and downloading a file in HDFS

(v) Checking the contents of the file(vi) Copying and moving files

(vii) Copying and moving files between local to HDFS environment

(viii) Removing files and paths


(ix) Displaying few lines of a file

(x) Display the aggregate length of a file

(xi) Checking the permissions of a file

Downloaded by varahagiri geddam


(xii) Zipping and unzipping the files with &without permission pasting it to a location

(xiii) Copy, Paste commands


EXPERIMENT-4:

Map-reducing

(i) Definition of Map-reduce

(ii) Its stages and terminologies

(iii) Word-count program to understand map-reduce (Mapper phase, Reducer phase,


Driver code)

EXPERIMENT-5:

Implementing Matrix-Multiplication with Hadoop Map-reduce

EXPERIMENT-6:

Compute Average Salary and Total Salary by Gender for an Enterprise.

EXPERIMENT-7:

(i) Creating hive tables (External and internal)

(ii) Loading data to external hive tables from sql tables(or)Structured c.s.v using scoop

(iii) Performingoperationslikefilterationsandupdations

(iv) Performing Join (inner, outer etc)

(v) Writing User defined function on hive tables


EXPERIMENT-8:

Create a sql table of employees Employee table with id, designation Salary table (salary, dept id)

Create external table in hive with similar schema of above tables, Move data to hive using scoop
and load

The contents into tables filter a new table and write a UDF to encrypt the table with AES-
algorithm,

Decrypt it with key to show contents

2
Downloaded by varahagiri geddam
EXPERIMENT-9:

(i) Py spark Definition (Apache Py spark) and difference between Py spark,Scala, pandas

(ii) Py spark files and class methods

(i) get(filename)

(ii) get root directory()

EXPERIMENT-10:

Py spark-RDD

(i) What is RDD’s?

(ii) ways to Create RDD

(i) parallelized collections

(ii) external dataset


(iii) existing RDD’s

(iv) Spark RDD’s operations (Count, for each(),Collect, join,Cache()

EXPERIMENT-11:

Perform py spark transformations

(i) Map and flat Map

(ii) To remove the words, which are not necessary to analyze this text.

(iii) group By

(iv) What if we want to calculate how many times each word is coming in corpus?

(v) .How do I perform a task(saycountthewords‘spark’and‘apache’inrdd3) separately on each


Partition and get the output of the task performed in these partition?

(vi) unions of RDD

(vii) join two pairs of RDD Based upon their key

Downloaded by varahagiri geddam


EXPERIMENT-12:

Py spark sparkconf-Attributes and applications

(i) What is Py spark spark conf()

(ii) Using spark conf create spark session to write a data frame to read details in a
c.s.v and later move that c.s.v to another location

Downloaded by varahagiri geddam


EXPERIMENT-1:

(i) Know the concept of big data architecture


The term "Big Data architecture" refers to the systems and software used to manage Big
Data. A Big Data architecture must be able to handle the scale, complexity, and variety of
Big Data. It must also be able to support the needs of different users, who may want to
access and analyze the data differently.

There are four main Big Data architecture layers to architecture of Big Data:

1. Data Ingestion

This layer is responsible for collecting and storing data from various sources. In Big Data,
the data ingestion process of extracting data from various sources and loading it into a data
repository. Data ingestion is a key component of a Big Data architecture because it
determines how data will be ingested, transformed, and stored.

2. Data Processing Data processing is the second layer, responsible for collecting, cleaning, and
preparing the data for analysis. This layer is critical for ensuring that the data is high quality and ready
to be used in the future.

3. Data Storage
Data storage is the third layer, responsible for storing the data in a format that can be easily
accessed and analyzed. This layer is essential for ensuring that the data is accessible and
available to the other layers

4. Data Visualization

Data visualization is the fourth layer and is responsible for creating visualizations of the data
that humans can easily understand. This layer is important for making the data accessible.

Downloaded by varahagiri geddam


Big Data Architecture Processes

When we explain traditional and big data analytics architecture reference models, we must
remember that the architecture process plays an important role in Big Data.

1. Connecting to Data Sources

Connectors and adapters can quickly connect to any storage system, protocol, or network
and connect to any data format.

2. Data Governance

From the time data is ingested through processing, analysis, storage, and deletion, there
are protections for privacy and security.

3. Managing Systems

Contemporary Lambda architecture Big Data is often developed on large-scale


distributed clusters, which are highly scalable and require constant monitoring via
centralized management interfaces.

4. Protecting Quality of Service

The Quality-of-Service framework supports the definition of data quality, ingestion


frequency, compliance guidelines, and sizes.

A few processes are essential to the architecture of Big Data. First, data must be collected
from various sources. This data must then be processed to ensure its quality and accuracy.
After this, the data must be stored securely and reliably. Finally, the data must be made
accessible to those who need it.

How to Build a Big Data Architecture?

Designing a Big Data Hadoop architecture reference architecture, while complex, follows
the same general procedure:

1. Define Your Objectives

What do you hope to achieve with your Big Data architecture? Do you want to improve
decision-making, better understand your customers , or find new revenue opportunities?
Once you know what you want to accomplish, you can start planning your architecture.

2. Consider Your Data Sources

What data do you have, and where does it come from? You'll need to think about both
structured and unstructured data and internal and external sources.

Downloaded by varahagiri geddam


3. Choose the Right Tools

Many different Big Data technologies are available, so it's important to select the ones that
best meet your needs.

4. Plan for Scalability

As your data grows, your Big Data solution architecture will need to be able to scale to
accommodate it. This means considering things like data replication and partitioning.

5. Keep Security in Mind

Make sure you have the plan to protect your data, both at rest and in motion. This includes
encrypting sensitive information and using secure authentication methods.

6. Test and Monitor

Once your architecture in Big Data is in place, it is important to test it to ensure it is


working as expected. You should also monitor your system on an ongoing basis to
identify any potential issues.

The Challenges of Big Data Architecture

There are many challenges to Big Data analytics architecture, including:

1. Managing Data Growth

As data grows, it becomes more difficult to manage and process. This can lead to delays
in decision-making and reduced efficiency.
2. Ensuring Data Quality

With so much data, it can be difficult to ensure that it is all accurate and high-quality.
This can lead to bad decisions being made based on incorrect data.

3. Meeting Performance Expectations

With AWS Big Data architecture comes big expectations. Users expect systems to be
able to handle large amounts of data quickly and efficiently. This can be a challenge for
architects who must design systems that can meet these expectations.

4. Security and Privacy

With so much data being stored, there is a greater risk of it being hacked or leaked. This
can jeopardize the security and privacy of those who are using the system.

5. Cost

Downloaded by varahagiri geddam


Big Data solution architectures can be expensive to set up and maintain. This can be a
challenge for organizations that want to use Big Data storage architecture but do not
have the budget for it.

6. Know the concept of Hadoop architecture

HDFS(Hadoop Distributed File System) is utilized for storage permission. It is mainly


designed for working on commodity Hardware devices (inexpensive devices), working
on a distributed file system design. HDFS is designed in such a way that it believes more
in storing the data in a large chunk of blocks rather than storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage layer and
the other devices present in that Hadoop cluster. Data storage Nodes in HDFS.
 Name Node(Master)
 Data Node(Slave)
Name Node: Name Node works as a Master in a Hadoop cluster that guides the Data
node(Slaves). Name node is mainly used for storing the Metadata i.e. the data about the
data. Meta Data can be the transaction logs that keep track of the user’s activity in a Hadoop
cluster.
Meta Data can also be the name of the file, size, and the information about the
location(Block number, Block ids) of Data node that Name node stores to find the closest
Data Node for Faster Communication. Name node instructs the Data Nodes with the
operation like delete, create, Replicate, etc.
Data Node: Data Nodes works as a Slave Data Nodes are mainly utilized for storing the
data in a Hadoop cluster, the number of Data Nodes can be from1 to 500 or even more than
that. The more number of Data Node, the Hadoop cluster will be able to store more data. So
it is advised that the Data Node should have
High storing capacity to store a large number of file blocks.

Downloaded by varahagiri geddam


EXPERIMENT-2: Loading Data Set into HDFS for Spark Analysis

Installation of Hadoop and cluster management

(i) Installing Hadoop single node cluster in ubuntu environment

(ii) Knowingthedifferencingbetweensinglenodeclustersandmulti-nodeclusters

(iii) AccessingWEB-UIandtheportnumber

(iv) Installingandaccessingtheenvironmentssuchashiveandsqoop.

HADOOPINSTALLATIONSTEPS
Hadoop is a framework written in Java for running applications on large clusters of commodity
hardware and in corporate features similar to those of the Google File System(GFS) and of the
Map Reduce computing paradigm.Hadoop’sHDFSisahighlyfault-tolerant distributed file system
and, like Hadoop in general, designed tobe deployed on low-cost hard ware. It provides high
throughput access to application data and is suitable for applications that have large datasets.
Hadoop can be installed in 3 Modes Local(Standalone) Pseudo–distributed Fully Distributed

INSTALLING hadoop-1.2.1in Ubuntu 14.02(Pseudo Mode)


Adding a dedicated Hadoop system user
We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is

recommendedbecauseithelpstoseparatetheHadoopinstallationfrom other software applications and


user accounts running on the same machine
Settings

System Settings

User Accounts

unlock

click+button Choose user as administrator or Standard

You can set password here—Click on Account Disabled


click password
confirm password
click on change

From Console , open terminal


#To create a group & amp;user

$sudo add group hadoop

Downloaded by varahagiri geddam


$sudo add user—in group hadoop hduser
# To change password

$sudo su

$Enter login password

$sudo passwd new username

Enter password & amp; confirm password

$exit

#to come out of terminal

STEP1:Install JAVA
# Update the source list

$sudo apt-get update

#Check whether java installed/not

$java–version

#Install Sun Java 6JDK

$sudo apt-get install open jdk-7-jdk

The full JDK which will be placed in /usr/lib/jvm/java

Java-1.7.0-open jdk i386 (i386means32-bitOS) You can remove the last chars

copy it to java-1.7.0- ivcse@


hadoopcd/usr/lib/jvm/

[email protected]
STEP 2:

Install SSH $sudo apt-get install ssh $ssh localhost


#To disable password $ ssh-keygen -t dsa -P "" –f
~/.ssh/id_dsa $cat ~/.ssh/id_dsa.pub >>
~/.ssh/authorized_keys Switch to new user STEP 3
: Download Apache Tarbal from Apache Google
type Apache Mirrors Select the first site

Downloaded by varahagiri geddam


( apache.org) all apache softwares will be
displayed select hadoop folder displays 3 type

Chukwa\,common\ core\ select core\

select stable version for hadoop1.2.1 i.e., stable1


select hadoop

1.2.1tar.gz

Create a new folder (say ABC) in home directory

Copyhadoop-1.2.1tar.gzintoABCfromdownloadsfolderand extract it.

STEP4: CONFIGURATION (5files)

Enterintohadoop-1.2.1foldermainfolder–conffolder

Important files--hadoop-env.sh,core-site.xml,hdfs-site.xml,map- red.xml

Configuring~/.bashrc file

$gedit~/.bashrc

export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk export


HADOOP_HOME=/home/ivcse/ABC/hadoop-1.2.1
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH export
HADOOP_HOME=/home/username/folder/hadoop-1.2.1
Conf/core-site.xml <configuration>

<property> <name>fs.default.name</name>
<value>hdfs://localhost:8020 </value> </property>
</configuration>
Conf/mapred-site.xml
<configuration> <property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value> </property>
</configuration>
hdfs-site.xml <configuration>

Downloaded by varahagiri geddam


<property>
<name>dfs.replication</name>
<value>1</value> </property>

</configuration>
Hadoop-env.sh

exportJAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk
STEP 6 : Format name node ivcse@hadoop:~$hadoop name
node–format
STEP7 :StarttheProcesses
hduser@ubuntu:~$ start-dfs.sh
hduser@ubuntu:~$start-mapred.sh
hduser@ubuntu:~$ start-all.sh
hduser@ubuntu:~$ jps
hduser@ubuntu:~$stop-all.sh

Downloaded by varahagiri geddam


EXPERIMENT-3:

File management tasks & Basic linux commands

 Print hadoop version

Usage: hadoop version

 fs: The File System(FS)shell includes various shell-like commands that


directly interact with the Hadoop

Distributed File System(HDFS) as wellas otherfile systems that Hadoop supports.

Usage: hadoopfs

 mkdir: Create a directory in HDFS at given path(s).

Usage: hadoop fs -mkdir [-p] <paths> Takes path uri’s as argument and creates directories.
Options:
The-p option behavior is much like Unix mkdir-p,creatingparent directories along the path.

Example:

hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2

hadoop fs-mkdirhdfs://nn1.example.com/user/hadoop/dir

hdfs://nn2.example.com/user/hadoop/dir

 ls: List the contents of a directory


Usage: hadoop fs-ls[-d][-h][-R] & lt; args & gt;

Options:

-d: Directories are listed as plain files.

-h:Format file sizes in a human-readable fashion (eg64.0minsteadof 67108864).

-R:Recursively list sub directories encountered. Example:

hadoop fs -ls /user/hadoop/file1

 lsr

Usage: hadoopfs-lsr<args>

Downloaded by varahagiri geddam


Recursive version of ls.

Note: This command is deprecated. Instead use hadoop fs-ls-R

 cat

Usage: hadoopfs-cat URI[URI...]

Copies source paths to std out. Example:


hadoopfs-cathdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2

hadoop fs -cat file:///file3 /user/hadoop/file4

 get
Usage:hadoopfs-get<src><localdst>

Copy files to the local file system.Example:

hadoop fs -get /user/hadoop/file local file

hadoop fs-get hdfs://nn.example.com/user/hadoop/file local file

 put

Usage:hadoopfs- put< localsrc >...<dst>

Copysinglesrc,ormultiplesrcsfromlocalfilesystemtothedestination file system. Also reads input


from stdin

andwritestodestinationfilesystem.

hadoop fs -put local file /user/hadoop/hadoop file

hadoop fs -put localfile1 localfile2 /user/hadoop/hadoop dir

hadoop fs -put localfilehdfs://nn.example.com/hadoop/hadoopfile

hadoop fs-put-hdfs://nn.example.com/hadoop/hadoop file Reads the input from stdin.

 Copy From Local

Usage:hadoopfs-copyFromLocal<localsrc>URI

Similar to put command, except that the source is restricted to a local file reference.

Downloaded by varahagiri geddam


Options:

The-f option will overwrite the destination if it already exists.

 CopyToLocal

Usage:hadoopfs-copyToLocalURI<localdst>

Similar to get command, except that the destination is restricted to a local file reference.

 Count

Usage:hadoopfs-count<paths>

Count the number of directories, files and bytes under the paths that match the specified file pattern.
The output columns with-count are: DIR_COUNT,FILE_COUNT, CONTENT_SIZE,
PATHNAME

 cp

Usage:hadoopfs-cp[-f][-p][URI...]< dest>

Copy files from source to destination (Both files should be in HDFS).


This command allows multiple sources as well in which case the destination must be a directory.
Options:
The-f option will overwrite the destination if it already exists. The -p option will

preserve file attributes.

Example:

hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2

hadoop fs -cp /user/hadoop/file1/user/hadoop/file2 /user/hadoop/dir

 mv

Usage: hadoopfs-mvURI [URI...]<dest>

Moves files from source to destination (bothshouldbeinHDFS).This command allows multiple


sources as well

In which case the destination needs to be a directory. Moving files across file systems is not
permitted.

Example:

hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2

Downloaded by varahagiri geddam


hadoopfs-mvhdfs://nn.example.com/file1 hdfs://nn.example.com/file2

hdfs://nn.example.com/file3hdfs://nn.example.com/dir1

 rm

Usage:hadoopfs-rm[-f][-r|-R]URI[URI...]

Delete files specified as args. Options:


The-f option will not display a diagnostic message or modify the exit status to reflect an error if
the file does not exist.

The-R option deletes the directory and any content under it recursively.

The-r option is equivalent to-R. Example:

hadoop fs -rmhdfs://nn.example.com/file /user/hadoop/empty dir

 rmdir

Usage:hadoopfs-rmdir[--ignore-fail-on-non-empty]URI[URI...]

Delete a directory. Options:

--ignore-fail-on-non-empty: When using wildcards, do not fail if a directory still contains files.

Example:

hadoop fs –rm dir /user/hadoop/empty dir

 rmr

Usage: hadoopfs-rmr URI[URI ...]

Recursive version of delete.

Note: This command is deprecated. Instead use hadoopfs-rm-r

 df

Usage:hadoopfs-df[-h]URI[URI...]

Displays free space. Options:

The-h option will format file sizes in a“human-readable”fashion (e.g 64.0m instead of
67108864)

Downloaded by varahagiri geddam


Example:

hadoop dfs -df /user/hadoop/dir1

 du

Usage:hadoopfs-duURI[URI...]

Displays sizes of files and directories contained in the given directory or the length of a file in case
its just a file.

Example:

hadoop fs-du/user/hadoop/dir1/user/hadoop/file1 hdfs://nn.example.com/user/hadoop/dir1

 help

Usage:hadoopfs-help

 setrep

Usage: hadoopfs-setrep[-R][-w]< num Replicas > & lt; path >

Changes the replication factor of a file. If path is a directory then the command recursively
changes the replication factor of all files under the directory tree rooted at path. Options:

The-w flag requests that the command wait for the replication to complete. This can potentially
take a Very longtime.

The-R flag is accepted for backwards compatibility. It has no effect. Example:

hadoop fs -setrep -w3 /user/hadoop/dir1

 tail

Usage:hadoopfs-tail URI

Displays last kilobyte of the file to std out. Example:

hadoop fs -tail pathname

 checksum

Usage: hadoop fs -checksum URI Returns the checksuminformationofafile. Example:


hadoopfs-check sumhdfs://nn1.example.com/file1 hadoop fs -checksum f

ile:///etc/hosts

Downloaded by varahagiri geddam


 chgrp

Usage:hadoopfs-chgrp[-R]GROUPURI[URI...]

Change group association of files. The user must be the owner of files, or else a super-user.
Additional information is in the Permissions Guide. Options
The-R option will make the change recursively through the directory structure.

 chmod

Usage:hadoopfs-chmod[-R]<MODE[,MODE]...|
OCTALMODE>URI[URI...]

Change the permissions of files. With -R,make the change recursively through the directory
structure. The user must be the owner of the file, or else a super-user. Additional information is in
the Permissions Guide.

Options

The-R option will make the change recursively through the directory structure.

 chown

Usage:hadoopfs-chown[-R][OWNER][:[GROUP]]URI[URI ]

Change the owner of files. The user must be a super-user. Additional information is in the
Permissions Guide.

Options

The-R option will make the change recursively through the directory structure.

23

Downloaded by varahagiri geddam


EXPERIMENT-4:
Map-reducing
(i) Definition of Map-reduce
(ii) Its stages and terminologies
(iii) Word-count program to understand map-reduce
(Mapper phase, Reducer phase, Driver code)
What is Map Reduce?
A Map Reduce is a data processing tool which is used to process the data parallelly in a distributed
form. It was developed in 2004, on the basis of paper titled as "Map Reduce: Simplified Data
Processing on Large Clusters," published by Google.
The Map Reduce is a paradigm which has two phases, the mapper phase, and the reducer phase. In the
Mapper, the input is given in the form of a key-value pair.
The output of the Mapper is fed to the reducer as input. The reducer runs only after the Mapper is over.
The reducer too takes input in key-value format, and the output of reducer is the final output.
Let’s discuss the Map Reduce phases to get a better understanding of its architecture: The Map Reduce
task is mainly divided into 2 phases i.e. Map phase and Reduce phase.
1. Map: As the name suggests its main use is to map the input data in key value pairs. The input to the
map may be a key-value pair where the key can be the id of some kind of address and value is the
actual value that it keeps.
The Map() function will be executed in its memory repository on each of these input key-value pairs
and generates the intermediate key-value pair which works as input for the Reducer or Reduce()
function.
2. Reduce: The intermediate key-value pairs that work as input for Reducer are shuffled and sort and
send to the Reduce() function. Reducer aggregate or group the data based on its key-value pair as per
the reducer algorithm written by the developer.

Advantages of Map Reduce

1. Scalability

2. Flexibility

3. Security and authentication

4. Faster processing of data

5. Very simple programming model

6. Availability and resilient nature

25

Downloaded by varahagiri geddam


WORDCOUNT DRIVER:
import java.io.IOException;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class Word Count extends Configured implements Tool {
26
public int run(String args[]) throws IOException
{
if (args.length < 2)
{
System.out.println("Please give valid inputs");
return -1;
}
JobConf conf = new JobConf(WordCount. Class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WordMapper.class);
conf.setReducerClass(WordReducer.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
return 0;
}
// Main Method

Downloaded by varahagiri geddam


public static void main(String args[]) throws Exception
{
int exitCode = ToolRunner.run(new WordCount(), args);
System.out.println(exitCode);
}
}

 WORDCOUNT MAPPER:

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WordMapper extends MapReduceBase implements
Mapper<LongWritable,
Text, Text, IntWritable> {
// Map function
public void map(LongWritable key, Text value, Output Collector<Text,
IntWritable> output, Reporter rep) throws IOException
{
String line = value.toString ();
// Splitting the line on spaces
for (String word : line.split(" "))
{
if (word.length() > 0)
{
output.collect(new Text(word), new IntWritable(1));
}
}
}
}

Downloaded by varahagiri geddam


 WORDCOUNT REDUCER

import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
21
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class Word Reducer extends MapReduceBase implements Reducer<Text,
IntWritable, Text, IntWritable> {
// Reduce function
public void reduce(Text key, Iterator<IntWritable> value,
Output Collector<Text, IntWritable> output,
Reporter rep) throws I Exception
{
int count = 0;
// Counting the frequency of each words
while (value.hasNext())
{
IntWritable i = value.next();
count += i.get();
}
output. Collect(key, new IntWritable (count));

}
}

 EXECUTION STEPS AND OUTPUT


[cloudera@quickstart ~]$ ls
cloudera-manager eclipse Music test.txt
cm_api.py enterprise-deployment. son parcels Videos
Desktop express-deployment. son Pictures WordCount.jar
Documents Kerberos Public workspace
Downloads lib Templates
[cloudera@quickstart ~]$ cat > /home/cloudera/inputfile.txt
hi
hello
hi students
Welecome to VITW
welcome to CSE

Downloaded by varahagiri geddam


welecome to SPARK lab
lab
[cloudera@quickstart ~]$ cat /home/cloud era/inputfile.txt
hi
hello
hi students
welecome to VITW
Welcme to CSE
welecome to SPARK lab
lab
[cloudera@quickstart ~]$ hdfs dfs -ls /
Found 7 items
drwxr-xr-x - cloudera super group 0 2023-02-16 20:55 /Tejaswi
drwxrwxrwx - hdfs supergroup 0 2017-10-23 10:29 /benchmarks
drwxr-xr-x - hbase super group 0 2023-02-16 20:21 /hbase
drwxr-xr-x - solr solr 0 2017-10-23 10:32 /solr
drwxrwxrwt - hdfs super group 0 2023-02-16 20:21 /tmp
drwxr-xr-x - hdfs super group 0 2017-10-23 10:31 /user
drwxr-xr-x - hdfs super group 0 2017-10-23 10:31 /var
[cloudera@quickstart ~]$ hdfd dfs -mkdir /input
bash: hdfd: command not found
[cloudera@quickstart ~]$ hdfs dfs -mkdir /input
[cloudera@quickstart ~]$ hdfs dfs -ls /
Found 8 items
drwxr-xr-x - cloudera super group 0 2023-02-16 20:55 /Tejaswi
drwxrwxrwx - hdfs super group 0 2017-10-23 10:29 /benchmarks
drwxr-xr-x - hbase super group 0 2023-02-16 20:21 /hbase
drwxr-xr-x - cloudera super group 0 2023-02-23 21:14 /input
drwxr-xr-x - solr solr 0 2017-10-23 10:32 /solr
drwxrwxrwt - hdfs super group 0 2023-02-16 20:21 /tmp
drwxr-xr-x - hdfs super group 0 2017-10-23 10:31 /user
drwxr-xr-x - hdfs super group 0 2017-10-23 10:31 /var
[cloudera@quickstart ~]$ hdfs dfs -put /home/cloudera/inputfile.txt /input

Downloaded by varahagiri geddam


23/02/23 21:15:41 WARN hdfs.DFSClient: Caught exception
java.lang.Interrupted Exception
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1281)
at java.lang.Thread.join(Thread.java:1355)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder (DFSOutputStream.ja
va:967)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock (DFSOutputStream.java:705
)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run (DFSOutputStream.java:894)
[cloudera@quickstart ~]$ hdfs dfs -put /home/cloudera/inputfile.txt /input/
put: `/input/inputfile.txt': File exists
[cloudera@quickstart ~]$ cat > /home/cloudera/Processfile.txt
hi
hello
hi students
hello students
[cloudera@quickstart ~]$ hdfs dfs -put /home/cloudera/Processfile.txt /input/
[cloudera@quickstart ~]$ hdfs dfs -cat /home/cloudera/Processfile.txt /input/
cat: `/home/cloudera/Processfile.txt': No such file or directory
cat: `/input': Is a directory
[cloudera@quickstart ~]$ hdfs dfs -cat /input/Processfile.txt
hi
hello
hi students
hello students

Downloaded by varahagiri geddam


[cloudera@quickstart ~]$ hadoop jar /home/cloudera/WordCount.jar WordCount

/input/Processfile.txt /output

23/02/23 21:20:58 INFO client.RMProxy: Connecting to Resource Manager at /0.0.0.0:8032

23/02/23 21:20:59 INFO client.RMProxy: Connecting to Resource Manager at /0.0.0.0:8032

23/02/23 21:20:59 WARN mapreduce.JobResourceUploader: Hadoop command-line option

parsing not performed. Implement the Tool interface and execute your application with

Tool Runner to remedy this.

23/02/23 21:20:59 INFO mapred.FileInputFormat: Total input paths to process : 1

23/02/23 21:20:59 INFO mapreduce.JobSubmitter: number of splits:2

23/02/23 21:20:59 INFO mapreduce.JobSubmitter: Submitting tokens for job:

job_1677210793042_0001

23/02/23 21:21:00 INFO impl.YarnClientImpl: Submitted application

application_1677210793042_0001

23/02/23 21:21:00 INFO mapreduce.Job: The url to track the job:

http://quickstart.cloudera:8088/proxy/application_1677210793042_0001/

23/02/23 21:21:00 INFO mapreduce.Job: Running job: job_1677210793042_0001

23/02/23 21:21:08 INFO mapreduce.Job: Job job_1677210793042_0001 running in uber mode :

false

23/02/23 21:21:08 INFO mapreduce.Job: map 0% reduce 0%

23/02/23 21:21:19 INFO mapreduce.Job: map 100% reduce 0%

23/02/23 21:21:26 INFO mapreduce.Job: map 100% reduce 100%

23/02/23 21:21:26 INFO mapreduce.Job: Job job_1677210793042_0001 completed successfully

23/02/23 21:21:26 INFO mapreduce.Job: Counters: 49

Downloaded by varahagiri geddam


File System Counters

FILE: Number of bytes read=78

FILE: Number of bytes written=430742

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0


HDFS: Number of bytes read=264
HDFS: Number of bytes written=24
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=15868
Total time spent by all reduces in occupied slots (ms)=3826
Total time spent by all map tasks (ms)=15868
Total time spent by all reduce tasks (ms)=3826
Total vcore-milliseconds taken by all map tasks=15868
Total vcore-milliseconds taken by all reduce tasks=3826
Total megabyte-milliseconds taken by all map tasks=16248832
Total megabyte-milliseconds taken by all reduce tasks=3917824
Map-Reduce Framework
Map input records=4
Map output records=6
Map output bytes=60
Map output materialized bytes=84
Input split bytes=210
Combine input records=0

Downloaded by varahagiri geddam


Combine output records=0
Reduce input groups=3
Reduce shuffle bytes=84
Reduce input records=6
Reduce output records=3
Spilled Records=12
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=175
CPU time spent (ms)=1230
Physical memory (bytes) snapshot=559370240
Virtual memory (bytes) snapshot=4519432192
Total committed heap usage (bytes)=392372224
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=54
File Output Format Counters
Bytes Written=240
[cloudera@quickstart ~]$ hdfs dfs -ls /output
Found 2 items
-rw-r--r-- 1 cloudera super group 0 2023-02-23 21:21 /output/_SUCCESS
-rw-r--r-- 1 cloudera super group 24 2023-02-23 21:21 /output/part-00000
[cloudera@quickstart ~]$ hdfs dfs -cat /output/part-r-00000
cat: `/output/part-r-00000': No such file or directory
[cloudera@quickstart ~]$ hdfs dfs -cat /output/part-00000
hello 2

Downloaded by varahagiri geddam


hi 2
students 2
[cloudera@quickstart ~]$

Downloaded by varahagiri geddam


EXPERIMENT-5:
Implementing Matrix-Multiplication with Hadoop Map-reduce
MATRRIX DRIVER CODE
package com.lendap.hadoop;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class MatrixMultiply {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println ("Usage: MatrixMultiply <in_dir> <out_dir>");
System.exit (2);
}
Configuration conf = new Configuration ();
// M is an m-by-n matrix; N is an n-by-p matrix.
lOMoARcPSD|22754375
ROLLNO:
41
conf.set("m", "1000");
conf.set("n", "100");
conf.set("p", "1000");
@Suppress Warnings ("deprecation")
Job job = new Job(conf, "MatrixMultiply");
job.setJarByClass (MatrixMultiply. Class);
job.setOutputKeyClass (Text.class);
job.setOutputValueClass (Text.class);
job.setMapperClass (Map.class);
job.setReducerClass (Reduce. class);\

Downloaded by varahagiri geddam


job.setInputFormatClass (TextInputFormat.class);
job.setOutputFormatClass (TextOutputFormat.class);
FileInputFormat.addInputPath (job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion (true);
}
}
lOMoARcPSD|22754375
ROLLNO:
42
MATRRIX MAPPER CODE
package com.lendap.hadoop;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce. Mapper;
import java.io.IOException;
public class Map
extends org.apache.hadoop.mapreduce. Mapper<LongWritable, Text, Text, Text> {
@Override
public void map(LongWritable key, Text value, Context context)
throws I Exception, Interrupted Exception {
Configuration conf = context.getConfiguration();
int m = Integer.parseInt(conf.get("m"));
int p = Integer.parseInt(conf.get("p"));
String line = value.toString ();
// (M, i, j, Mij);
String[] indicesAndValue = line.split(",");
Text output Key = new Text();
Text output Value = new Text();
if (indicesAndValue[0].equals("M")) {
for (int k = 0; k < p; k++) {
lOMoARcPSD|22754375

Downloaded by varahagiri geddam


ROLLNO:
43
}
} else {
outputKey.set (indicesAndValue [1] + "," + k);
// outputKey.set (i,k);
outputValue.set (indicesAndValue [0] + "," + indicesAndValue [2]
+ "," + indicesAndValue [3]);
// outputValue.set (M,j,Mij);
context.write (output Key, output Value);
// (N, j, k, Njk);
for (int i = 0; i < m; i++) {
outputKey.set (i + "," + indicesAndValue [2]);
outputValue.set ("N," + indicesAndValue [1] + ","
+ IndicesAndValue [3]);
context.write (output Key, output Value);
}
}
}
}
MATRRIX REDUCER CODE
package com.lendap.hadoop;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce. Reducer;
lOMoARcPSD|22754375
ROLLNO:
44
import java.io.IOException;
import java.util.HashMap;
public class Reduce
extends org.apache.hadoop.mapreduce. Reducer<Text, Text, Text, Text> {
@Override
public void reduce(Text key, Iterable<Text> values, Context context)

Downloaded by varahagiri geddam


throws I Exception, Interrupted Exception {
String[] value;
//key=(i,k),
//Values = [(M/N,j,V/W),..]
HashMap<Integer, Float> hashA = new HashMap<Integer, Float>();
HashMap<Integer, Float> hashB = new HashMap<Integer, Float>();
for (Text val : values) {
value = val.toString().split(",");
if (value[0].equals("M")) {
hashA.put(Integer.parseInt(value[1]), Float.parseFloat(value[2]));
} else {
hashB.put(Integer.parseInt(value[1]), Float.parseFloat(value[2]));
}
}
int n = Integer.parseInt(context.getConfiguration().get("n"));
float result = 0.0f;
float m_ij;
float n_jk;
Downloaded by varahagiri geddam ([email protected])
lOMoARcPSD|22754375
ROLLNO:
45
for (int j = 0; j < n; j++) {
m_ij = hashA.containsKey(j) ? hashA.get(j) : 0.0f;
n_jk = hashB.containsKey(j) ? hashB.get(j) : 0.0f;
result += m_ij * n_jk;
}
if (result != 0.0f) {
context.write(null,
new Text(key.toString() + "," + Float.toString(result)));
}
}
}
Output:
Downloaded by varahagiri geddam
Before writing the code let’s first create matrices and put them in
HDFS.
Create two files M1, M2 and put the matrix values. (sperate
columns with spaces and rows with a line break)
For this example I am taking matrices as:

12378
4 5 6 9 10
11 12
 Put the above files to HDFS at location /user/clouders/matrices/
hdfs dfs -mkdir /user/cloudera/matrice
hdfs dfs -put /path/to/M1 /user/cloudera/matrices/
hdfs dfs -put /path/to/M2 /user/cloudera/matrices/
Downloaded by varahagiri geddam ([email protected])
lOMoARcPSD|22754375
ROLLNO:
46
Hadoop do its mapping and reducing work. After the successful completion of
the above process view the output by:
hdfs dfs -cat /user/cloudera/mat_output/*
Above command should output the resultant matrix
Downloaded by varahagiri geddam ([email protected])
lOMoARcPSD|22754375
ROLLNO:
47

Downloaded by varahagiri geddam


EXPERIMENT-7 & 8:
(i)Create a sql table of employees Employee table with id, designation Salary table (salary ,dept id)
(ii) Loading data
Create a sql table of employees
Employee table with id, designation
Salary table (salary, dept id)

Hive Commands
Hive supports Data definition Language (DDL), Data Manipulation Language (DML) and
User defined functions.
Hive DDL Commands
create database
drop database
create table
drop table
alter table
create index
create views
Hive DML Commands
Select
Where
lOMoARcPSD|22754375
ROLLNO:
48
Group By
Order By
Load Data
Join:
o Inner Join
o Left Outer Join
o Right Outer Join
o Full Outer Join

Downloaded by varahagiri geddam


Hive DDL Commands
Create Database Statement
 hive>create database cse;
OK
Time taken: 0.129 seconds
 hive> create database bigdata;
OK
Time taken: 0.051 seconds
 hive> show databases;
OK
bigdata
cse
default
Time taken: 0.013 seconds, Fetched: 3 row(s)
lOMoARcPSD|22754375
ROLLNO:
49
Drop database
hive> drop database cse;
OK
Time taken: 0.134 seconds
hive> show databases;
OK
bigdata
default
Time taken: 0.155 seconds, Fetched: 2 row(s)
Create a table
hive> create table employee(name string,id int,sal float, designation string)row format
delimited fields terminated by ','lines terminated by '\n' stored as text file;
OK
Time taken: 2.261 seconds
hive> show tables;
OK

Downloaded by varahagiri geddam


employee
Time taken: 0.046 seconds, Fetched: 1 row(s)
hive> desc employee;
OK
name string
id int
sal float
designation string
Time taken: 0.096 seconds, Fetched: 4 row(s)
lOMoARcPSD|22754375
ROLLNO:
50
Load data from local file system to HDFS then load that data to HIVE
 [cloudera@quickstart ~]$ hdfs dfs –mkdir /bigdata
 [cloudera@quickstart ~]$ hdfs dfs -ls /
Found 7 items
drwxrwxrwx - hdfs supergroup 0 2017-10-23 10:29 /benchmarks
drwxr-xr-x - cloudera supergroup 0 2023-04-14 21:41 /bigdata
drwxr-xr-x - hbase supergroup 0 2023-04-14 20:52 /hbase
drwxr-xr-x - solr solr 0 2017-10-23 10:32 /solr
drwxrwxrwt - hdfs supergroup 0 2023-03-03 21:03 /tmp
drwxr-xr-x - hdfs supergroup 0 2017-10-23 10:31 /user
drwxr-xr-x - hdfs supergroup 0 2017-10-23 10:31 /var
 [cloudera@quickstart ~]$ hdfs dfs -put /home/cloudera/Desktop/sample.txt
/bigdata
 [cloudera@quickstart ~]$ hdfs dfs -ls /bigdata
Found 1 items
-rw-r--r-- 1 cloudera supergroup 59 2023-04-14 21:42 /bigdata/sample.txt
LOAD DATA TO HIVE
hive> load data inpath '/bigdata/sample.txt' into table employee;
Loading data to table default.employee
Table default.employee stats: [numFiles=1, totalSize=59]

Downloaded by varahagiri geddam


OK
Time taken: 0.803 seconds
Altering and Dropping Tables
 HIVE>create table student(id int,name string);
lOMoARcPSD|22754375
ROLLNO:
51
OK
Time taken: 0.483 seconds
 Hive>show tables;
OK
employee
student
Time taken: 0.085 seconds, Fetched: 2 row(s)
 hive> alter table student rename to cse;
OK
Time taken: 0.118 seconds
 hive> show tables;
OK
cse
employee
Time taken: 0.024 seconds, Fetched: 2 row(s)
 hive> select *from employee;
OK
lucky 1 123.0 asst.prof
veda 2 456.0 child
abhi 3 789.0 business
Time taken: 0.933 seconds, Fetched: 3 row(s)
lOMoARcPSD|22754375
ROLLNO:
52

Downloaded by varahagiri geddam


1. hive> ALTER TABLE cse ADD COLUMNS (col INT);
2. hive> ALTER TABLE HIVE_TABLE ADD COLUMNS (col1 INT COMMENT 'a co
mment');
3. hive> ALTER TABLE HIVE_TABLE REPLACE COLUMNS (col2 INT, weight STRI NG,
baz INT COMMENT 'baz replaces new_col1');
Relational Operators
 hive> select avg(sal) from employee;
Query ID = cloudera_20230414221212_7d3c1250-d45b-4441-88e7-9127d8f389c2
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1681530689884_0001, Tracking URL =
http://quickstart.cloudera:8088/proxy/application_1681530689884_0001/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1681530689884_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2023-04-14 22:12:43,864 Stage-1 map = 0%, reduce = 0%
lOMoARcPSD|22754375
ROLLNO:
53
2023-04-14 22:12:51,667 Stage-1 map = 100%, reduce = 0%, Cumulative CPU
0.98 sec
2023-04-14 22:13:02,273 Stage-1 map = 100%, reduce = 100%, Cumulative CPU
2.11 sec
Map Reduce Total cumulative CPU time: 2 seconds 110 msec
Ended Job = job_1681530689884_0001
Map Reduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.11 sec HDFS Read: 8237

Downloaded by varahagiri geddam


HDFS Write: 6 S UCCESS
Total Map Reduce CPU Time Spent: 2 seconds 110 msec
OK
456.0
Time taken: 31.393 seconds, Fetched: 1 row(s)
hive> select min(sal) from employee;
SUCCESS
Total Map Reduce CPU Time Spent: 1 seconds 740 msec
OK
789999.0
Time taken: 19.602 seconds, Fetched: 1 row(s)
hive> select max(sal) from employee;
SUCCESS
Total Map Reduce CPU Time Spent: 1 seconds 710 msec
OK
4623578.0
lOMoARcPSD|22754375
ROLLNO:
54
Time taken: 19.673 seconds, Fetched: 1 row(s)
 Hive>SELECT * FROM employee WHERE Salary>=800000;
OK
hpriya 345 4623578.0 manager
hddsj 256433522323.0 svbahdg
Time taken: 0.763 seconds, Fetched: 2 row(s)
lOMoARcPSD|22754375
ROLLNO:
55

Downloaded by varahagiri geddam


EXPERIMENT-9:
i)Pyspark Definition(Apache Pyspark) and difference between Pyspark, Scala, pandas
What is PySpark?
Apache Spark is written in Scala programming language. PySpark has been released in order to support the
collaboration of Apache Spark and Python, it actually is a Python API for Spark. In addition, Py Spark,
helps you interface with
Resilient Distributed Datasets (RDDs) in Apache Spark and Python programming language. This has been
achieved by taking advantage of the Py4j library.
Py Spark Logo
(ii) Py spark files and class methods
(i) get(file name)
(ii) get root directory()
Py Spark provides the facility to upload your files using sc .add File. We can also get the path of
working directory using Spark Files.get. Moreover, to resolve the path to the files added
through Spark Context.addFile(), the following types of class method are available in
SparkFiles, such as:
o get(filename)
o getrootdirectory()
Let's learn about the class method in detail.
Note: SparkFiles only contain the class method that can be used as required. Users should not create
SparkFiles instances.
Downloaded by varahagiri geddam ([email protected])
lOMoARcPSD|22754375
ROLLNO:
56
Class Methods of PySpark SparkFiles
o get(filename)
The get(filename) specifies the path of the file which is added
through SparkContext.addFile().
import os
class SparkFiles(object):
"""
Resolves paths to files added through
L{SparkContext.addFile()<pyspark.context.SparkContext.addFile>}.
Downloaded by varahagiri geddam
SparkFiles consist of only class methods; users should not create SparkFiles
instances.
"""
root_directory = None
is_running_on_worker = False
sc = None
def init (self):
raise NotImplementedError("Do not construct SparkFiles objects")
@classmethod
def get(cls, filename):
"""
Get the absolute path of a file added through C{SparkContext.addFile()}.
"""
path = os.path.join(SparkFiles.getRootDirectory(), filename)
return os.path.abspath(path)
@classmethod
def getRootDirectory(cls):
o getrootdirectory()
This class method specifies the path to the root directory. Basically, it contains the whole file
which is added through the SparkContext.addFile().
import os
class SparkFiles(object):
"""
Downloaded by varahagiri geddam ([email protected])
lOMoARcPSD|22754375
ROLLNO:
57
Resolves paths to files added through
L{SparkContext.addFile()<pyspark.context.SparkContext.addFile>}.
SparkFiles contains only classmethods; users should not create SparkFiles
instances.
"""
root_directory = None
is_running_on_worker = False
Downloaded by varahagiri geddam
sc = None
def init (self):
raise NotImplementedError("Do not construct SparkFiles objects")
@classmethod
def get(cls, filename):
...
@classmethod
def getRootDirectory(cls):
"""
Than Get the root directory which contains files added through
C{SparkContext.addFile()}.
"""
if cls._is_running_on_worker:
return cls._root_directory
else:
# This will have to change if we support multiple SparkContexts:
return cls._sc._jvm.spark.SparkFiles.getRootDirectory()
Downloaded by varahagiri geddam ([email protected])
lOMoARcPSD|22754375
ROLLNO:
58
EXPERIMENT-10 :
Pyspark -RDD’S
(i) what is RDD’s?
(ii) ways to Create RDD
(i) parallelized collections
(ii) external dataset
(iii) existing RDD’s
(iv) Spark RDD’s operations
(Count, foreach(), Collect, join,Cache()
 PySpark is the Python API for Apache Spark, an open source, distributed computing framework
and set of libraries for real-time, large-scale data processing. If you're already familiar with Python
and libraries such as Pandas, then PySpark is a good language to learn to create more scalable
analyses and pipelines.
Downloaded by varahagiri geddam
 PySpark has been released in order to support the collaboration of Apache Spark
and Python, it actually is a Python API for Spark. In addition, PySpark, helps you
interface with Resilient Distributed Datasets (RDDs) in Apache Spark and Python
programming language.
 PySpark is a commonly used tool to build ETL pipelines for large datasets.
 supports all basic data transformation features like sorting, mapping, joins, operations, etc.
 PySpark supports most of Spark's features such as Spark SQL, DataFrame, Streaming, MLlib
(Machine Learning) and Spark Core. {Spark SQL is a Spark module for structured data
processing. It provides a programming abstraction called DataFrame and can also act as
distributed SQL query engine.}
 PySpark can create distributed datasets from any storage source supported by Hadoop,
including our local file system, HDFS, Cassandra, HBase, Amazon S3, etc.
 what is RDD’s? RDD
stands for Resilient Distributed Dataset, these are the elements that run and operate on multiple
nodes to do parallel processing on a cluster. RDDs are immutable elements, which means once
you create an RDD you cannot change it. RDDs are fault tolerant as well, hence in case of any
failure, they recover automatically. You can apply multiple operations on these RDDs to achieve a
certain task.
 ways to Create RDD
There are following ways to Create RDD in Spark.Such as 1. Using parallelized
collection 2. From existing Apache Spark RDD & 3. From external datasets.
Downloaded by varahagiri geddam ([email protected])
lOMoARcPSD|22754375
ROLLNO:
59
 To create an RDD using parallelize()
The parallelize() method of the spark context is used to create a Resilient Distributed
Dataset (RRD) from an iterable or a collection.
Syntax
sparkContext.parallelize(iterable, numSlices)
Parameters
 iterable: This is an iterable or a collection from which an RDD has to be created.
 numSlices: This is an optional parameter that indicates the number of slices to cut the
RDD into. The number of slices can be manually provided by setting this parameter.
Downloaded by varahagiri geddam
Otherwise, the spark will set this to the default parallelism that is inferred from the
cluster.
This method returns an RDD
Code example
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('educative-answers').config("spark.some.config.option",
"some-value").getOrCreate()
collection = [("James","Smith","USA","CA"),
("Michael","Rose","USA","NY"),
("Robert","Williams","USA","CA"),
("Maria","Jones","USA","FL")
]
sc = spark.sparkContext
rdd = sc.parallelize(collection)
rdd_elements = rdd.collect()
print("RDD with default slices - ", rdd_elements)
print("Number of partitions - ", rdd.getNumPartitions())
print("-" * 8)
numSlices = 8
rdd = sc.parallelize(collection, numSlices)
Downloaded by varahagiri geddam ([email protected])
lOMoARcPSD|22754375
ROLLNO:
60
rdd_elements = rdd.collect()
print("RDD with default slices - ", rdd_elements)
print("Number of partitions - ", rdd.getNumPartitions())
output:
RDD with default slices - [('James', 'Smith', 'USA', 'CA'), ('Michael', 'Rose', 'USA', 'NY'), ('Robert',
'Williams', 'USA', 'CA'), ('Maria', 'Jones', 'USA', 'FL')] Number of partitions - 4 --------
RDD with default slices - [('James', 'Smith', 'USA', 'CA'), ('Michael', 'Rose', 'USA', 'NY'), ('Robert',
'Williams', 'USA', 'CA'), ('Maria', 'Jones', 'USA', 'FL')] Number of partitions – 8
 From external datasets (Referencing a dataset in external storage system)
Downloaded by varahagiri geddam
If any storage source supported by Hadoop, including our local file system it can create
RDDs from it. Apache spark does support sequence files, textfiles, and any other Hadoop input
format.
We can create textfile RDDs by sparkcontext’stextfile method. This method uses the URL for
the file (either a local path on the machine or database or a hdfs://, s3n://, etc URL). It also reads
whole as a collection of lines.
Always be careful that the path of the local system and worker node should always be similar.
The file should be available at the same place in the local file system and worker node.
We can copy the file of the worker nodes. We can also use a network mounted the shared file
system.
To load a dataset from an external storage system, we can use data frame reader interface.
External storage system such as file systems, key-value stores. It supports many file formats like:
 CSV (String path)
 Json (String path)
 Textfile (String path)
CSV (String Path)
which returns dataset<Row> as a result.
Example:
import org.apache.spark.sql.SparkSession
def main(args: Array[String]):Unit = {
object DataFormat {
Val spark = SparkSession.builder.appName("ExtDataEx1").master("local").getOrCreate()
Downloaded by varahagiri geddam ([email protected])
lOMoARcPSD|22754375
ROLLNO:
61
valdataRDD = spark.read.csv("path/of/csv/file").rdd
json (String Path)
JSON file (one object per line ) which returns Dataset<Row> as a result.
Example:
valdataRDD = spark.read.json("path/of/json/file").rdd
textfile (String Path)
a text file which returns Dataset of a string as a result.
Example:
Downloaded by varahagiri geddam
valdataRDD = spark.read.textFile("path/of/text/file").rdd
 From existing Apache Spark RDDs
The process of creating another dataset from the existing ones means transformation.
As a result, transformation always produces new RDD. As they are immutable, no
changes take place in it if once created. This property maintains the consistency over the
cluster.
Example:
val words=spark.sparkContext.parallelize(Seq("sun", "rises", "in", "the", "east", "and", "sets", "in", “the”,
"west"))
valwordPair = words.map(w => (w.charAt(0), w))
wordPair.foreach(println)
 to Create RDD’s operations
To apply operations on these RDD's, there are two ways −
 Transformation
 Action
Transformation − These are the operations, which are applied on a RDD to create a new
RDD. Filter, groupBy and map are the examples of transformations.
Action − These are the operations that are applied on RDD, which instructs Spark to
perform computation and send the result back to the driver.
To apply any operation in PySpark, we need to create a PySpark RDD first. The
following code block has the detail of a PySpark RDD Class −
Downloaded by varahagiri geddam ([email protected])
lOMoARcPSD|22754375
ROLLNO:
62
class pyspark.RDD (
jrdd,
ctx,
jrdd_deserializer = AutoBatchedSerializer(PickleSerializer())
)
A few operations on words
count()
Number of elements in the RDD is returned.
count.py
Downloaded by varahagiri geddam
from pyspark import SparkContext
sc = SparkContext("local", "count app")
words = sc.parallelize (
["scala",
"java",
"hadoop",
"spark",
"akka",
"spark vs hadoop",
"pyspark",
"pyspark and spark"]
)
counts = words.count()
print "Number of elements in RDD -> %i" % (counts)
count.py
Command − The command for count() is −
$SPARK_HOME/bin/spark-submit count.py
Output − The output for the above command is −
Number of elements in RDD → 8
collect()
All the elements in the RDD are returned.
collect.py
from pyspark import SparkContext
sc = SparkContext("local", "Collect app")
words = sc.parallelize (
["scala",
"java",
Downloaded by varahagiri geddam ([email protected])
lOMoARcPSD|22754375
ROLLNO:
63
"hadoop",
"spark",
"akka",
Downloaded by varahagiri geddam
"spark vs hadoop",
"pyspark",
"pyspark and spark"]
)
coll = words.collect()
print "Elements in RDD -> %s" % (coll)
collect.py
Command − The command for collect() is −
$SPARK_HOME/bin/spark-submit collect.py
Output − The output for the above command is −
Elements in RDD -> [
'scala',
'java',
'hadoop',
'spark',
'akka',
'spark vs hadoop',
'pyspark',
'pyspark and spark'
]
foreach(f)
Returns only those elements which meet the condition of the function inside foreach. In the
following example, we call a print function in foreach, which prints all the elements in the RDD.
foreach.py
from pyspark import SparkContext
sc = SparkContext("local", "ForEach app")
words = sc.parallelize (
["scala",
"java",
"hadoop",
"spark",
"akka",
"spark vs hadoop",
"pyspark",
Downloaded by varahagiri geddam
"pyspark and spark"]
)
def f(x): print(x)
fore = words.foreach(f)
foreach.py
Downloaded by varahagiri geddam ([email protected])
lOMoARcPSD|22754375
ROLLNO:
64
Command − The command for foreach(f) is −
$SPARK_HOME/bin/spark-submit foreach.py
Output − The output for the above command is −
scala
java
hadoop
spark
akka
spark vs hadoop
pyspark
pyspark and spark
join(other, numPartitions = None)
It returns RDD with a pair of elements with the matching keys and all the values for that
particular key. In the following example, there are two pair of elements in two different RDDs.
After joining these two RDDs, we get an RDD with elements having matching keys and their
values.
join.py
from pyspark import SparkContext
sc = SparkContext("local", "Join app")
x = sc.parallelize([("spark", 1), ("hadoop", 4)])
y = sc.parallelize([("spark", 2), ("hadoop", 5)])
joined = x.join(y)
final = joined.collect()
print "Join RDD -> %s" % (final)
join.py
Downloaded by varahagiri geddam
Command − The command for join(other, numPartitions = None) is −
$SPARK_HOME/bin/spark-submit join.py
Output − The output for the above command is −
Join RDD -> [
('spark', (1, 2)),
('hadoop', (4, 5))
]
cache()
Persist this RDD with the default storage level (MEMORY_ONLY). You can also check if the
RDD is cached or not.
Downloaded by varahagiri geddam ([email protected])
lOMoARcPSD|22754375
ROLLNO:
65
cache.py
from pyspark import SparkContext
sc = SparkContext("local", "Cache app")
words = sc.parallelize (
["scala",
"java",
"hadoop",
"spark",
"akka",
"spark vs hadoop",
"pyspark",
"pyspark and spark"]
)
words.cache()
caching = words.persist().is_cached
print "Words got chached> %s" % (caching)
cache.py
Command − The command for cache() is −
$SPARK_HOME/bin/spark-submit cache.py
Output − The output for the above program is −
Downloaded by varahagiri geddam
Words got cached -> True 22

Downloaded by varahagiri geddam


17

Downloaded by varahagiri geddam


Downloaded by varahagiri geddam
.

Downloaded by varahagiri geddam


Downloaded by varahagiri geddam
Downloaded by varahagiri geddam
12

Downloaded by varahagiri geddam


13

Downloaded by varahagiri geddam


15

Downloaded by varahagiri geddam


27

Downloaded by varahagiri geddam


 WORDCOUNTDRIVER

importjava.io.IOException;

importorg.apache.hadoop.conf.Configured; import

org.apache.hadoop.fs.Path;

importorg.apache.hadoop.io.IntWritable; import

org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.FileInputFormat; import

org.apache.hadoop.mapred.FileOutputFormat; import

org.apache.hadoop.mapred.JobClient;

importorg.apache.hadoop.mapred.JobConf; import

org.apache.hadoop.util.Tool;

importorg.apache.hadoop.util.ToolRunner;

publicclassWordCountextendsConfiguredimplementsTool{

28

Downloaded by varahagiri geddam


publicintrun(Stringargs[])throwsIOException

if(args.length<2)

System.out.println("Pleasegivevalidinputs"); return -1;

JobConf conf = new JobConf(WordCount.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf,newPath(args[1]));

conf.setMapperClass(WordMapper.class);

conf.setReducerClass(WordReducer.class);

conf.setMapOutputKeyClass(Text.class);

conf.setMapOutputValueClass(IntWritable.class);

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf);

return0;

//MainMethod

29

Downloaded by varahagiri geddam


publicstaticvoidmain(Stringargs[])throwsException

intexitCode=ToolRunner.run(newWordCount(),args); System.out.println(exitCode);

 WORDCOUNTMAPPER

importjava.io.IOException;

import org.apache.hadoop.io.IntWritable;

importorg.apache.hadoop.io.LongWritable; import

org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.MapReduceBase; import

org.apache.hadoop.mapred.Mapper;

import org.apache.hadoop.mapred.OutputCollector; import

org.apache.hadoop.mapred.Reporter;

publicclassWordMapperextendsMapReduceBaseimplements Mapper<LongWritable,

Text,Text,IntWritable>{

//Mapfunction

publicvoidmap(LongWritablekey,Textvalue,OutputCollector<Text,

30

Downloaded by varahagiri geddam


IntWritable>output,Reporterrep)throwsIOException

Stringline=value.toString();

// Splitting the line on spaces

for(Stringword:line.split(""))

if(word.length()>0)

output.collect(newText(word),newIntWritable(1));

 WORDCOUNTREDUCER

importjava.io.IOException; import

java.util.Iterator;

importorg.apache.hadoop.io.IntWritable; import

org.apache.hadoop.io.Text;

importorg.apache.hadoop.mapred.MapReduceBase;

31

Downloaded by varahagiri geddam


import org.apache.hadoop.mapred.OutputCollector; import

org.apache.hadoop.mapred.Reducer;

importorg.apache.hadoop.mapred.Reporter;

publicclassWordReducerextendsMapReduceBaseimplementsReducer<Text, IntWritable,

Text, IntWritable> {

//Reducefunction

publicvoidreduce(Textkey,Iterator<IntWritable>value, OutputCollector<Text, IntWritable>

output,

Reporterrep)throws IOException

intcount= 0;

//Countingthefrequencyofeachwords while

(value.hasNext())

IntWritablei=value.next(); count

+= i.get();

output.collect(key,newIntWritable(count));

32

Downloaded by varahagiri geddam


}

EXECUTIONSTEPSANDOUTPUT

[cloudera@quickstart~]$ls

cloudera-managereclipse Music test.txt

cm_api.py enterprise-deployment.jsonparcels Videos

Desktop express-deployment.json PicturesWordCount.jar

Documents kerberos Public workspace

Downloads lib Templates[cloudera@quickstart

~]$ cat > /home/cloudera/inputfile.txt

hi

hello

histudents

welecome to VITW

welcme to CSE

welecometoSPARKlab

lab

[cloudera@quickstart~]$cat/home/cloudera/inputfile.txt

hi

hello

hi students

welecometoVITW

welcme to CSE

welecometoSPARKlab

lab

33

Downloaded by varahagiri geddam


[cloudera@quickstart~]$hdfsdfs-ls/

Found 7 items

drwxr-xr-x-clouderasupergroup 02023-02-1620:55/Tejaswi

drwxrwxrwx-hdfs supergroup 02017-10-2310:29/benchmarks

drwxr-xr-x-hbase supergroup 02023-02-1620:21/hbase

drwxr-xr-x-solr solr 02017-10-2310:32/solr

drwxrwxrwt-hdfs supergroup 02023-02-1620:21 /tmp

drwxr-xr-x-hdfs supergroup 02017-10-2310:31/user

drwxr-xr-x- hdfs supergroup 02017-10-2310:31/var

[cloudera@quickstart ~]$ hdfd dfs -mkdir /input

bash: hdfd: command not found

[cloudera@quickstart~]$hdfsdfs-mkdir/input

[cloudera@quickstart ~]$ hdfs dfs -ls /

Found8items

drwxr-xr-x-clouderasupergroup 02023-02-1620:55/Tejaswi

drwxrwxrwx-hdfs supergroup 02017-10-2310:29/benchmarks

drwxr-xr-x-hbase supergroup 02023-02-1620:21/hbase

drwxr-xr-x-clouderasupergroup 02023-02-2321:14/input

drwxr-xr-x-solr solr 02017-10-2310:32/solr

drwxrwxrwt-hdfs supergroup 02023-02-1620:21 /tmp

drwxr-xr-x-hdfs supergroup 02017-10-2310:31/user

drwxr-xr-x- hdfs supergroup 0 2017-10-23 10:31 /var

[cloudera@quickstart~]$hdfsdfs-put/home/cloudera/inputfile.txt/input

23/02/23 21:15:41 WARN hdfs.DFSClient: Caught exception

java.lang.InterruptedException

34

Downloaded by varahagiri geddam


at java.lang.Object.wait(Native Method)at

java.lang.Thread.join(Thread.java:1281)

atjava.lang.Thread.join(Thread.java:1355)

at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.closeResponder(DFSOutputStream.ja
va:967)

at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.endBlock(DFSOutputStream.java:705
)

at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:894)

[cloudera@quickstart~]$hdfsdfs-put/home/cloudera/inputfile.txt/input/ put:

`/input/inputfile.txt': File exists

[cloudera@quickstart~]$cat>/home/cloudera/Processfile.txt hi

hello

hi students

hellostudents

[cloudera@quickstart~]$hdfsdfs-put/home/cloudera/Processfile.txt/input/

[cloudera@quickstart~]$hdfsdfs-cat /home/cloudera/Processfile.txt /input/

cat: `/home/cloudera/Processfile.txt': No such file or directory

cat:`/input':Isa directory

[cloudera@quickstart~]$hdfsdfs-cat/input/Processfile.txt hi

hello

hi students

hellostudents

35

Downloaded by varahagiri geddam


[cloudera@quickstart~]$hadoopjar/home/cloudera/WordCount.jarWordCount
/input/Processfile.txt/output

23/02/2321:20:58INFOclient.RMProxy:ConnectingtoResourceManagerat/0.0.0.0:8032

23/02/2321:20:59INFOclient.RMProxy:ConnectingtoResourceManagerat/0.0.0.0:8032

23/02/2321:20:59WARNmapreduce.JobResourceUploader:Hadoopcommand-lineoption
parsing not performed. Implement the Tool interface and execute your application with
ToolRunner to remedy this.

23/02/2321:20:59INFOmapred.FileInputFormat:Totalinput pathstoprocess:1 23/02/23

21:20:59 INFO mapreduce.JobSubmitter: number of splits:2

23/02/2321:20:59INFOmapreduce.JobSubmitter:Submittingtokensforjob:
job_1677210793042_0001

23/02/2321:21:00INFOimpl.YarnClientImpl:Submittedapplication
application_1677210793042_0001

23/02/23 21:21:00 INFO mapreduce.Job: The url to track the job:


http://quickstart.cloudera:8088/proxy/application_1677210793042_0001/

23/02/2321:21:00INFOmapreduce.Job:Runningjob:job_1677210793042_0001

23/02/2321:21:08INFOmapreduce.Job:Jobjob_1677210793042_0001runninginuber mode: false

23/02/2321:21:08INFOmapreduce.Job:map0%reduce0%

23/02/2321:21:19INFOmapreduce.Job:map100%reduce0%

23/02/2321:21:26INFOmapreduce.Job:map100%reduce100%

23/02/2321:21:26INFOmapreduce.Job:Jobjob_1677210793042_0001completedsuccessfully

23/02/23 21:21:26 INFO mapreduce.Job: Counters: 49

FileSystemCounters

FILE:Numberofbytesread=78

FILE: Number of bytes written=430742

FILE: Number of read operations=0

FILE:Numberoflargereadoperations=0

36

Downloaded by varahagiri geddam


FILE:Numberofwriteoperations=0

HDFS: Number of bytes read=264

HDFS: Number of bytes written=24

HDFS:Numberofreadoperations=9

HDFS:Numberoflargereadoperations=0

HDFS: Number of write operations=2

JobCounters

Launched map tasks=2

Launchedreducetasks=1

Data-local map tasks=2

Totaltime spent byall maps in occupied slots (ms)=15868

Totaltimespent byallreducesinoccupiedslots(ms)=3826 Total

time spent by all map tasks (ms)=15868

Totaltimespentbyallreduce tasks(ms)=3826

Total vcore-milliseconds taken by all map tasks=15868

Totalvcore-millisecondstakenbyallreducetasks=3826

Totalmegabyte-millisecondstakenbyallmaptasks=16248832

Totalmegabyte-millisecondstakenbyallreducetasks=3917824

Map-Reduce Framework

Map input records=4

Mapoutputrecords=6

Map output bytes=60

Mapoutputmaterializedbytes=84

Input split bytes=210

Combineinputrecords=0

37

Downloaded by varahagiri geddam


Combineoutputrecords=0

Reduce input groups=3

Reduce shuffle bytes=84

Reduce input records=6

Reduce output records=3

Spilled Records=12

Shuffled Maps =2

Failed Shuffles=0

MergedMapoutputs=2

GCtimeelapsed(ms)=175

CPUtimespent(ms)=1230

Physicalmemory(bytes)snapshot=559370240

Virtual memory (bytes) snapshot=4519432192

Totalcommittedheapusage(bytes)=392372224

Shuffle Errors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

FileInputFormatCounters

Bytes Read=54

FileOutputFormatCounters

Bytes Written=24

38

Downloaded by varahagiri geddam


0

[cloudera@quickstart~]$hdfsdfs-ls/output

Found 2 items

-rw-r--r--1clouderasupergroup 02023-02-2321:21/output/_SUCCESS

-rw-r--r--1 cloudera supergroup 242023-02-2321:21/output/part-00000

[cloudera@quickstart ~]$ hdfs dfs -cat /output/part-r-00000

cat: `/output/part-r-00000': No such file or directory

[cloudera@quickstart~]$hdfsdfs-cat/output/part-00000

hello 2

hi 2

students 2

[cloudera@quickstart~]$

39
Downloaded by varahagiri geddam
EXPERIMENT-5:

ImplementingMatrix-MultiplicationwithHadoopMap-reduce

MATRRIXDRIVERCODE

packagecom.lendap.hadoop;

import org.apache.hadoop.conf.*;

importorg.apache.hadoop.fs.Path;

import org.apache.hadoop.io.*;

importorg.apache.hadoop.mapreduce.*;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import

org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

importorg.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

publicclassMatrixMultiply{

publicstaticvoidmain(String[]args)throwsException{ if

(args.length != 2) {

System.err.println("Usage:MatrixMultiply<in_dir><out_dir>"); System.exit(2);

Configurationconf=newConfiguration();

//Misanm-by-nmatrix;Nisann-by-pmatrix.

40

Downloaded by varahagiri geddam


conf.set("m","1000");

conf.set("n","100");

conf.set("p", "1000"); @SuppressWarnings("deprecation")

Jobjob=newJob(conf,"MatrixMultiply");

job.setJarByClass(MatrixMultiply.class);

job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class);

job.setMapperClass(Map.class); job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job,

new Path(args[1]));

job.waitForCompletion(true);

41

Downloaded by varahagiri geddam


MATRRIXMAPPERCODE

package com.lendap.hadoop;

importorg.apache.hadoop.conf.*;

importorg.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

importorg.apache.hadoop.mapreduce.Mapper;

importjava.io.IOException;

publicclassMap

extendsorg.apache.hadoop.mapreduce.Mapper<LongWritable,Text,Text,Text>{ @Override

publicvoidmap(LongWritablekey,Textvalue,Contextcontext) throws

IOException, InterruptedException {

Configurationconf=context.getConfiguration();

int m = Integer.parseInt(conf.get("m"));

intp=Integer.parseInt(conf.get("p"));

String line = value.toString();

//(M,i,j, Mij);

String[]indicesAndValue=line.split(","); Text

outputKey = new Text();

TextoutputValue=newText();

if(indicesAndValue[0].equals("M"))

{ for (int k = 0; k < p; k++) {

42

Downloaded by varahagiri geddam


outputKey.set(indicesAndValue[1]+","+k);

// outputKey.set(i,k);

outputValue.set(indicesAndValue[0]+","+indicesAndValue[2]

+ ","+indicesAndValue[3]);

// outputValue.set(M,j,Mij);

context.write(outputKey, outputValue);

}else{

//(N, j,k,Njk);

for(inti= 0;i<m;i++){

outputKey.set(i + "," + indicesAndValue[2]);

outputValue.set("N,"+indicesAndValue[1]+","

+ indicesAndValue[3]);

context.write(outputKey, outputValue);

MATRRIXREDUCERCODE

package com.lendap.hadoop;

importorg.apache.hadoop.io.Text;

importorg.apache.hadoop.mapreduce.Reducer;

43

Downloaded by varahagiri geddam


importjava.io.IOException;

import java.util.HashMap;

publicclassReduce

extendsorg.apache.hadoop.mapreduce.Reducer<Text,Text,Text,Text>{ @Override

publicvoidreduce(Textkey,Iterable<Text>values,Contextcontext) throws

IOException, InterruptedException {

String[]value;

//key=(i,k),

//Values=[(M/N,j,V/W),..]

HashMap<Integer,Float>hashA=newHashMap<Integer,Float>();

HashMap<Integer,Float>hashB=newHashMap<Integer,Float>(); for

(Text val : values) {

value= val.toString().split(",");

if(value[0].equals("M")){

hashA.put(Integer.parseInt(value[1]),Float.parseFloat(value[2]));

}else{

hashB.put(Integer.parseInt(value[1]),Float.parseFloat(value[2]));

intn=Integer.parseInt(context.getConfiguration().get("n"));

float result = 0.0f;

floatm_ij;

floatn_jk;

44

Downloaded by varahagiri geddam


for(intj= 0; j< n; j++){

m_ij=hashA.containsKey(j) ?hashA.get(j):0.0f;

n_jk= hashB.containsKey(j) ?hashB.get(j) :0.0f;

result += m_ij * n_jk;

if(result!=0.0f){

context.write(null,

newText(key.toString()+","+Float.toString(result)));

Output:

Beforewritingthecodelet’sfirstcreatematricesandputtheminHDFS.

 CreatetwofilesM1,M2andputthematrixvalues.
(speratecolumnswithspacesandrowswithalinebreak)

ForthisexampleIamtakingmatricesas:
123 7 8
456 9 10
1112
 PuttheabovefilestoHDFSatlocation/user/clouders/matrices/
hdfsdfs-mkdir/user/cloudera/matrice
hdfsdfs-put/path/to/M1/user/cloudera/matrices/
hdfsdfs-put/path/to/M2/user/cloudera/matrices/

45

Downloaded by varahagiri geddam


Hadoopdoitsmappingandreducingwork.Afterthesuccessfulcompletionoftheaboveprocessviewtheo
utputby:
hdfsdfs-cat/user/cloudera/mat_output/*

Abovecommandshouldoutputtheresultantmatrix

46

Downloaded by varahagiri geddam


EXPERIMENT-7&8:

(i) Create a sql table of employees


Employeetablewithid,designation Salary

table (salary ,dept id)

(ii) Loadingdata

Create a sql table of employees

Employeetablewithid,designation Salary

table (salary ,dept id)

HiveCommands

HivesupportsDatadefinitionLanguage(DDL),DataManipulationLanguage(DML)andUserdefine
dfunctions.

HiveDDLCommands

createdatabasedr

opdatabasecreate

tabledroptable

altertablecre

ateindexcreat

eviews

HiveDMLCommands

Select

Where

47

Downloaded by varahagiri geddam


GroupByO

rderByLoa

dDataJoin:

o InnerJoin
o LeftOuterJoin
o RightOuterJoin
o FullOuterJoin

HiveDDLCommands

CreateDatabaseStatement

 hive>createdatabasecse;
OK
Timetaken:0.129seconds
 hive>createdatabasebigdata;OK
Timetaken:0.051seconds
 hive>showdatabases;
OK
bigdatac
sedefaul
t
Timetaken:0.013seconds,Fetched:3row(s)

48

Downloaded by varahagiri geddam


Dropdatabase

hive>dropdatabasecse;OK
Timetaken:0.134secondshive>sho
wdatabases;
OK
bigdatad
efault
Timetaken:0.155seconds,Fetched:2row(s)

Createatable

hive>createtableemployee(namestring,idint,salfloat,designationstring)rowformatdelimitedfieldster
minatedby','linesterminatedby'\n'storedastextfile;

OK

Timetaken:2.261seconds

hive>showtables;

OK

employee

Timetaken:0.046seconds,Fetched:1row(s)

hive>descemployee;

OK

name string

id int

sal float

designation string

Timetaken:0.096seconds,Fetched:4row(s)

49

Downloaded by varahagiri geddam


LoaddatafromlocalfilesystemtoHDFSthenloadthatdatatoHIVE

 [cloudera@quickstart~]$hdfsdfs–mkdir/bigdata
 [cloudera@quickstart~]$hdfsdfs-ls/ Found 7

items

drwxrwxrwx -hdfs supergroup 02017-10-2310:29/benchmarks

drwxr-xr-x -clouderasupergroup 02023-04-1421:41/bigdata

drwxr-xr-x -hbase supergroup 02023-04-1420:52/hbase

drwxr-xr-x -solr solr 02017-10-2310:32/solr

drwxrwxrwt -hdfs supergroup 02023-03-0321:03/tmp

drwxr-xr-x -hdfs supergroup 02017-10-2310:31/user

drwxr-xr-x -hdfs supergroup 02017-10-2310:31/var

 [cloudera@quickstart~]$hdfsdfs-put/home/cloudera/Desktop/sample.txt
/bigdata
 [cloudera@quickstart~]$hdfsdfs-ls/bigdata Found 1

items

-rw-r--r-- 1clouderasupergroup 592023-04-1421:42/bigdata/sample.txt

LOADDATATOHIVE

hive>loaddatainpath'/bigdata/sample.txt'intotableemployee; Loading data to

table default.employee

Tabledefault.employeestats:[numFiles=1,totalSize=59] OK

Timetaken:0.803seconds

AlteringandDroppingTables

 HIVE>createtablestudent(idint,namestring);

50

Downloaded by varahagiri geddam


OK

Timetaken:0.483seconds

 Hive>showtables;

OK

employee

student

Timetaken:0.085seconds,Fetched:2row(s)

 hive>altertablestudentrenametocse; OK

Timetaken:0.118seconds

 hive>showtables; OK

cse

employee

Timetaken:0.024seconds,Fetched:2row(s)

 hive>select*fromemployee;

OK

lucky 1 123.0asst.prof

veda 2 456.0child

abhi 3 789.0business

Timetaken:0.933seconds,Fetched:3row(s)

51

Downloaded by varahagiri geddam


1. hive>ALTERTABLEcseADDCOLUMNS(colINT);
2. hive>ALTERTABLEHIVE_TABLEADDCOLUMNS(col1INTCOMMENT'aco
mment');
3. hive>ALTERTABLEHIVE_TABLEREPLACECOLUMNS(col2INT,weightSTRING,
bazINTCOMMENT'bazreplacesnew_col1');

RelationalOperators

 hive>selectavg(sal)fromemployee;
QueryID=cloudera_20230414221212_7d3c1250-d45b-4441-88e7-9127d8f389c2 Total jobs = 1

LaunchingJob1outof1

Numberofreducetasksdeterminedatcompiletime:1

Inordertochangetheaverageload forareducer(inbytes): set

hive.exec.reducers.bytes.per.reducer=<number>

Inorderto limitthemaximumnumberofreducers:

sethive.exec.reducers.max=<number>

Inordertosetaconstantnumberofreducers:

setmapreduce.job.reduces=<number>

Starting Job = job_1681530689884_0001, Tracking URL =


http://quickstart.cloudera:8088/proxy/application_1681530689884_0001/

Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1681530689884_0001 Hadoopjob

informationforStage-1:numberofmappers:1; numberofreducers:1 2023-04-14 22:12:43,864

Stage-1 map = 0%,reduce = 0%

52

Downloaded by varahagiri geddam


2023-04-1422:12:51,667Stage-1map=100%,reduce=0%,Cumulative CPU
0.98sec

2023-04-1422:13:02,273Stage-1map=100%,reduce=100%,Cumulative CPU
2.11sec

MapReduceTotalcumulativeCPUtime:2seconds110msec Ended Job =

job_1681530689884_0001

MapReduceJobsLaunched:

Stage-Stage-1: Map: 1Reduce: 1 Cumulative CPU: 2.11 sec HDFSRead:8237


HDFS Write: 6 SUCCESS

TotalMapReduceCPUTimeSpent:2seconds110msec OK

456.0

Timetaken:31.393seconds,Fetched:1row(s) hive> select

min(sal) from employee; SUCCESS

TotalMapReduceCPUTimeSpent:1seconds740msec OK

789999.0

Timetaken:19.602seconds,Fetched:1row(s) hive> select

max(sal) from employee; SUCCESS

TotalMapReduceCPUTimeSpent:1seconds710msec OK

4623578.0

53

Downloaded by varahagiri geddam


Timetaken:19.673seconds,Fetched:1row(s)

 Hive>SELECT*FROMemployeeWHERESalary>=800000;
OK
hpriya 345 4623578.0 manager
hddsj256433522323.0 svbahdg
Timetaken:0.763seconds,Fetched:2row(s)

Downloaded by varahagiri geddam


54

Downloaded by varahagiri geddam


EXPERIMENT-9:

i)PysparkDefinition(ApachePyspark)anddifferencebetweenPyspark,Scala, pandas

WhatisPySpark?

Apache Spark is written in Scala programming language. PySpark has been released in order to
support the collaboration of Apache Spark and Python, it
actuallyisaPythonAPIforSpark.Inaddition,PySpark, helpsyouinterfacewith
ResilientDistributedDatasets(RDDs)inApacheSparkandPythonprogramming language. This has
been achieved bytaking advantage of the Py4j library.

PySparkLogo

(ii)Pysparkfilesandclassmethods

(i)get(filename)

(ii)getrootdirectory()

PySparkprovidesthefacilitytouploadyourfilesusingsc.addFile.Wecanalsogetthepathofworkingdire
ctoryusingSparkFiles.get.Moreover,toresolvethepathtothefilesaddedthroughSparkContext.add
File(),thefollowingtypesofclassmethodareavailableinSparkFiles,suchas:

o get(filename)
o getrootdirectory()

Note:SparkFilesonlycontain theclassmethodthatcan be used asrequired.Usersshouldnotcreate SparkFiles


instances.

Let'slearnabouttheclassmethodindetail.

55

Downloaded by varahagiri geddam


ClassMethodsofPySpark SparkFiles

o get(filename)

Theget(filename)specifies the path of the file which is


addedthroughSparkContext.addFile().

importos
classSparkFiles(object):
"""
ResolvespathstofilesaddedthroughL{SparkContext.addFile()<pyspark.conte
xt.SparkContext.addFile>}.SparkFilesconsistofonlyclassmethods;usersshou
ldnotcreateSparkFilesinstances.
"""
root_directory=Noneis_runnin
g_on_worker=Falsesc=None
definit(self):
raiseNotImplementedError("DonotconstructSparkFilesobjects")@classmet
hod
defget(cls,filename):
"""
GettheabsolutepathofafileaddedthroughC{SparkContext.addFile()}."""
path=os.path.join(SparkFiles.getRootDirectory(),filename)
returnos.path.abspath(path)@class
method
defgetRootDirectory(cls):

o getrootdirectory()

Thisclassmethodspecifiesthepath tothe rootdirectory.Basically,itcontainsthewholefile which is


added through the SparkContext.addFile().

importos
classSparkFiles(object):
"""

56

Downloaded by varahagiri geddam


Resolves paths to files added through
L{SparkContext.addFile()<pyspark.context.SparkContext.addFile>}.
SparkFilescontainsonlyclassmethods;usersshouldnotcreateSparkFiles
instances.
"""
root_directory = None
is_running_on_worker =False
sc = None
def init(self):
raiseNotImplementedError("DonotconstructSparkFilesobjects")
@classmethod
defget(cls,filename):
...
@classmethod
defgetRootDirectory(cls):
"""
ThanGettherootdirectorywhichcontainsfilesaddedthrough C{SparkContext.addFile()}.
"""
ifcls._is_running_on_worker:
returncls._root_directory
else:
#ThiswillhavetochangeifwesupportmultipleSparkContexts:
returncls._sc._jvm.spark.SparkFiles.getRootDirectory()

57

Downloaded by varahagiri geddam


EXPERIMENT-10:

Pyspark-RDD’S

(i) whatisRDD’s?

(ii) waystoCreateRDD

(i) parallelizedcollections

(ii) externaldataset

(iii) existingRDD’s

(iv) SparkRDD’soperations
(Count,foreach(),Collect,join,Cache()

 PySparkis thePythonAPI for Apache Spark, anopensource,


distributedcomputingframework andsetoflibrariesfor real-time,large-scaledata
processing.Ifyou'realreadyfamiliar withPython and libraries such as Pandas, then PySpark
is a good language to learn to create more scalable analyses and pipelines.
 PySparkhasbeenreleasedinordertosupportthecollaborationofApacheSpark and
Python, it actually is a Python API for Spark. In addition, PySpark, helps you
interface with Resilient Distributed Datasets (RDDs) in Apache Spark and Python
programming language.
 PySparkisacommonlyusedtooltobuild ETLpipelinesforlargedatasets.
 supportsallbasicdatatransformationfeatureslikesorting,mapping,joins,operations,etc.
 PySparksupportsmostofSpark'sfeaturessuchasSparkSQL,DataFrame,Streaming,MLli
b (Machine Learning) and Spark Core. {Spark SQL is a Spark module for structured
data processing. It provides a programming abstraction called DataFrame and can also
act as distributed SQL query engine.}
 PySparkcancreatedistributeddatasets
fromanystoragesourcesupportedbyHadoop, including our local file system,
HDFS, Cassandra, HBase, Amazon S3, etc.
 whatisRDD’s? RDD
stands for Resilient Distributed Dataset, these are the elements that runand operate on
multiple nodes to do parallel processing on a cluster. RDDs are immutable elements,
which means once you create an RDD you cannot change it. RDDs are fault tolerant as
well, hence in case of any failure, theyrecover automatically.
YoucanapplymultipleoperationsontheseRDDstoachievea certain task.

 waystoCreateRDD
TherearefollowingwaystoCreateRDDinSpark.Suchas1.Usingparallelized
collection 2. From existing Apache Spark RDD & 3. From external datasets.

58
Downloaded by varahagiri geddam
 TocreateanRDDusingparallelize()
Theparallelize()methodofthesparkcontext isusedtocreatea Resilient Distributed
Dataset (RRD) from an iterable or a collection.

Syntax

sparkContext.parallelize(iterable,numSlices)

Parameters

 iterable:This isaniterableoracollectionfromwhichanRDDhastobecreated.
 numSlices:Thisisanoptionalparameter thatindicatesthenumber
ofslicestocutthe RDDinto.Thenumber
ofslicescanbemanuallyprovidedbysettingthisparameter. Otherwise, the spark
will set this to the default parallelism that is inferred from the cluster.

ThismethodreturnsanRDD

Codeexample
importpyspark
frompyspark.sqlimportSparkSession

spark=SparkSession.builder.appName('educative-answers').config("spark.some.config.option",
"some-value").getOrCreate()

collection=[("James","Smith","USA","CA"),
("Michael","Rose","USA","NY"),
("Robert","Williams","USA","CA"),
("Maria","Jones","USA","FL")
]

sc=spark.sparkContext

rdd = sc.parallelize(collection)

rdd_elements = rdd.collect()

print("RDD with default slices - ", rdd_elements)

print("Numberofpartitions-",rdd.getNumPartitions())

print("-" * 8)

numSlices=8

rdd=sc.parallelize(collection,numSlices)

59

Downloaded by varahagiri geddam


rdd_elements=rdd.collect()

print("RDD with default slices - ", rdd_elements)

print("Numberofpartitions-",rdd.getNumPartitions())

output:

RDDwithdefaultslices-[('James','Smith','USA','CA'),('Michael','Rose','USA','NY'),('Robert',
'Williams','USA','CA'),('Maria','Jones','USA','FL')]Numberofpartitions-4--------

RDDwithdefaultslices-[('James','Smith','USA','CA'),('Michael','Rose','USA','NY'),('Robert',
'Williams','USA','CA'),('Maria','Jones','USA','FL')]Numberofpartitions–8

 Fromexternaldatasets(Referencingadatasetinexternalstoragesystem)
IfanystoragesourcesupportedbyHadoop, including our localfilesystemit
cancreate RDDs fromit. Apachesparkdoes supportsequencefiles, textfiles, andanyother
Hadoop input format.

WecancreatetextfileRDDsbysparkcontext’stextfilemethod. This
methodusestheURLfor thefile(either a local path on the machine or database or a hdfs://,
s3n://, etc URL). It also reads whole as a collection of lines.

Alwaysbecarefulthat thepathofthelocalsystemandworkernodeshouldalways besimilar.


Thefileshouldbeavailableatthesameplaceinthelocalfilesystemandworkernode.

Wecancopythefileoftheworkernodes.Wecanalsouseanetworkmountedthesharedfile
system.

Toloada dataset fromanexternalstoragesystem, wecanusedataframereader interface.


External storagesystemsuchas filesystems, key-value stores. It supportsmany fileformats like:

 CSV(Stringpath)
 Json(Stringpath)
 Textfile(Stringpath)

CSV(StringPath)
whichreturnsdataset<Row> asaresult.

Example:

importorg.apache.spark.sql.SparkS

ession def main(args:

Array[String]):Unit = { object

DataFormat {

Valspark=SparkSession.builder.appName("ExtDataEx1").master("local").getOrCreate()
Downloaded by varahagiri geddam
60

Downloaded by varahagiri geddam


valdataRDD=spark.read.csv("path/of/csv/file").rdd

json(StringPath)
JSONfile(oneobjectperline)whichreturnsDataset<Row>asaresult.

Example:

valdataRDD=spark.read.json("path/of/json/file").rdd

textfile(StringPath)
atextfilewhichreturnsDatasetofastringasaresult.

Example:

valdataRDD=spark.read.textFile("path/of/text/file").rdd

 FromexistingApacheSparkRDDs

Theprocessofcreatinganotherdatasetfromtheexistingonesmeanstransformation.

Asaresult,transformationalwaysproduces newRDD.Astheyareimmutable, no
changes take place in it if once created. This property maintains the consistency over
the cluster.
Example:

valwords=spark.sparkContext.parallelize(Seq("sun","rises","in","the","east","and","sets","i
n",“the”,
"west"))

valwordPair=words.map(w=>(w.charAt(0),w))

wordPair.foreach(println)

 toCreateRDD’soperations
ToapplyoperationsontheseRDD's,therearetwoways−

 Transformation
 Action

Transformation −Thesearetheoperations, whichareappliedonaRDDto createanew RDD.


Filter, groupBy and map are the examples of transformations.

Action−Thesearetheoperationsthat areapplied onRDD, whichinstructsSparkto perform


computation and send the result back to the driver.

ToapplyanyoperationinPySpark, weneedtocreateaPySparkRDD first.The following


code block has the detail of a PySpark RDD Class −
Downloaded by varahagiri geddam
61

Downloaded by varahagiri geddam


classpyspark.RD
D( jrdd,
ctx,
jrdd_deserializer=AutoBatchedSerializer(PickleSerializer())
)

Afewoperationsonwords

count()

NumberofelementsintheRDDisreturned.

count.py
from pyspark import SparkContext
sc=SparkContext("local","counta
pp") words = sc.parallelize (
["scala",
"java",
"hadoop",
"spark",
"akka",
"sparkvshadoop",
"pyspark",
"pysparkandspark"]
)
counts=words.count()
print"NumberofelementsinRDD->%i"%(counts)
count.py

Command−Thecommandforcount()is−

$SPARK_HOME/bin/spark-submitcount.py

Output−Theoutputfortheabovecommandis−

NumberofelementsinRDD→8

collect()

Allthe elementsinthe RDDare returned.

collect.py
from pyspark import SparkContext
sc=SparkContext("local","Collectapp"
) words = sc.parallelize (
["scala",
"java",
Downloaded by varahagiri geddam
62

Downloaded by varahagiri geddam


"hadoop",
"spark",
"akka",
"sparkvshadoop",
"pyspark",
"pysparkandspark"]
)
coll= words.collect()
print"ElementsinRDD->%s"%(coll)
collect.py

Command−Thecommandforcollect()is−

$SPARK_HOME/bin/spark-submitcollect.py

Output−Theoutputfortheabovecommandis−

ElementsinRDD->[
'scala',
'java',
'hadoop',
'spark',
'akka',
'sparkvshadoo
p', 'pyspark',
'pysparkandspark'
]

foreach(f)

Returns onlythose elements which meet the condition of the function inside foreach. In the
followingexample, wecallaprint functioninforeach, whichprintsalltheelementsintheRDD.

foreach.py
from pyspark import SparkContext
sc=SparkContext("local","ForEachapp"
) words = sc.parallelize (
["scala",
"java",
"hadoop",
"spark",
"akka",
"sparkvshadoop",
"pyspark",
"pysparkandspark"]
)
deff(x): print(x)
fore=words.foreach(f)
foreach.py

Downloaded by varahagiri geddam


63

Downloaded by varahagiri geddam


Command−Thecommandforforeach(f)is−

$SPARK_HOME/bin/spark-submitforeach.py

Output−Theoutputfortheabovecommandis−

scala
java
hadoo
p
spark
akka
sparkvshadoo
p pyspark
pysparkandspark

join(other,numPartitions=None)

It returns RDD with a pair ofelements with the matching keys and all the values for that
particularkey.Inthe followingexample,therearetwopairofelementsintwo different RDDs. After
joining these two RDDs, we get an RDD with elements having matching keys and their values.

join.py
from pyspark import SparkContext
sc=SparkContext("local","Joinapp")
x=sc.parallelize([("spark",1),("hadoop",4)])
y=sc.parallelize([("spark",2),
("hadoop",5)]) joined = x.join(y)
final= joined.collect()
print"JoinRDD->%s"%(final)
join.py

Command−Thecommandforjoin(other,numPartitions=None)is−

$SPARK_HOME/bin/spark-submitjoin.py

Output−Theoutputfortheabovecommandis−

Join RDD ->


[ ('spark',
(1,2)),
('hadoop',(4, 5))
]

cache()

Persist thisRDDwiththedefault storagelevel(MEMORY_ONLY). Youcanalso check ifthe RDD


is cached or not.

Downloaded by varahagiri geddam


64

Downloaded by varahagiri geddam


cache.py
from pyspark import SparkContext
sc=SparkContext("local","Cachea
pp") words = sc.parallelize (
["scala",
"java",
"hadoop",
"spark",
"akka",
"sparkvshadoop",
"pyspark",
"pysparkandspark"]
)
words.cache()
caching= words.persist().is_cached
print"Wordsgotchached>%s"%(caching)
cache.py

Command−Thecommandforcache()is−

$SPARK_HOME/bin/spark-submitcache.py

Output−Theoutput fortheabove programis−

Wordsgotcached->True

Downloaded by varahagiri geddam


65

Downloaded by varahagiri geddam

You might also like