0% found this document useful (0 votes)
80 views210 pages

C21053 Jay Vijay Karwatkar-Big Data Analytics & Visualization

Uploaded by

Jay Karwatkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views210 pages

C21053 Jay Vijay Karwatkar-Big Data Analytics & Visualization

Uploaded by

Jay Karwatkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 210

INDEX

SR NAME OF THE EXPERIMENT DATE SIGN CO


N ATUR
O E OF
SUBJ
ECT
TEAC
HER
1 HDFS: List of Commands (mkdir, touchz, copy from local/put, 23/09/ CO1
copy to local/get, move from local, cp, rmr, du, dus, stat) 2022

2 Map Reduce: 12/10/ CO2


2022
1. Write a program in Map Reduce for WordCount operation.
2. Write a program in Map Reduce for Union operation.
3. Write a program in Map Reduce for Intersection operation.
4. Write a program in Map Reduce for Grouping and
Aggregation.
5. Write a program in Map Reduce for Matrix Multiplication

3 MongoDB: 19/10/ CO3


2022
1. Installation
2. Sample Database Creation
3. Query the Sample Database using MongoDB querying
commands
a. Create Collection
b. Insert Document
c. Query Document
d. Delete Document
e. Indexing

4 Pig: 09/11/ CO4


2022
1. Pig Latin Basic
2. Pig Data Types,
3. Download the data
4. Create your Script
5. Save and Execute the Script
6. Pig Operations: Diagnostic Operators, Grouping and
Joining, Combining &
Splitting, Filtering, Sorting

5 HiveQL : Select Where, Select OrderBy, Select GroupBy, 11/11/ CO4


Select Joins Hive: 2022
1. Hive Data Types
2. Create Database & Table in Hive
3. Hive Partitioning
4. Hive Built-In Operators
5. Hive Built-In Functions
6. Hive Views and Indexes

6 Spark: 14/11/ CO5


2022
1. Downloading Data Set and Processing it Spark
2. Word Count in Apache Spark.

7 Visualization using Tableau: 14/11/ CO6


2022
Tableau: Tool Overview, Importing Data, Analyzing with
Charts
Experiment Number 1

HDFS: List of Commands (mkdir, touchz, copy from local/put, copy to local/get, move from local, cp, rmr,
du, dus, stat)

Write down the use of command and screen shot of commands

Hadoop Practical
To use the HDFS commands , first need to start Hadoop services by using this
command:

start-all.sh

To check the Hadoop Services are up and running using this command: jps

: jps stands for java virtual machine process status tool

Commands:

1. mkdir: To create a directory. In Hadoop dfs there is no home directory by default.


Syntax : hdfs (space) dfs (space) -mkdir (space) / <folder name>
Create a directory by following the syntax. Named it as C21004_Hadoop1.
Directory is created. Using ls command , we can check whether directory is created
or not.

2. ls : This command is used to list all the files. It is useful when we want a hierarchy
of a folder.
Syntax : hdfs (space) dfs (space) -ls (space) /(no space)<path>
It will print all the directories present in HDFS. bin directory contains executables
so, bin/hdfs means we want the executables of hdfs particularly dfs(Distributed
File System) commands.

Ls command will list the files in C21004_Hadoop and C21004_Hadoop1


respectively.

3. touchz: It creates an empty file.


Syntax: hdfs (space) dfs(space) -touchz(space) /<file path>
Using ls command , can list down files in C21004_Hadoop1 directory.

4. cat: To print file contents.


Syntax: hdfs (space) dfs (space) -cat (space)/<file path>
To print the contents of Example.txt file in C21004_Hadoop directory

5. copyFromLocal or put: To copy files/folders from local file system to hdfs


storage. This is the most important command. Local filesystem means the files
present on the OS.
Syntax: hdfs (space) dfs (space) -copyFromLocal or -put (space) <local file path>
<destination(present on hdfs)>
Copying the file Example1.txt from local folder to destination folder
It is local file need
to be copied in
C21004_Hadoop
directory

List the files in C21004_Hadoop using ls command.

6. copyToLocal (or) get: To copy files/folders from hdfs storage to local file system.
Syntax: hdfs (space) dfs (space) -copyToLocal or -get (space) <<source file(on
hdfs)>(space) <local file destination>
Copying the mrudula.txt file from Hadoop file system in local system.

From HDFS, the file


gets copied.

7. moveFromLocal: This command will move file from local to hdfs.


Syntax: hdfs (space) dfs (space) -moveFromLocal (space)<local source file>
(space) <destination HDFS>
Moving the file.txt from 21004 folder to C21004_Hadoop directory. Using the ls
command file.txt is being seen.

8. cp: This command is used to copy files within hdfs.


Syntax: hdfs (space) dfs (space) -cp (space) <source HDFS> (space) <destination
HDFS>
Copying the file file.txt from C21004_Hadoop folder to C21004_Hadoop1 folder.
List down the files using ls command in respective folders.

From C21004_Hadoop

9. mv: This command is used to move files within hdfs.


Syntax : hdfs (space) dfs (space) -mv (space) <source HDFS> (space) <Destination
HDFS>
From C21004_Hadoop folder , Example.txt file gets moved to C21004_Hadoop1.
Using ls command for respective folders. We can observe that, Example1.txt gets
moved to C21004_Hadoop1 folder.

10. rm -r: This command deletes a file from HDFS recursively. It is very useful
command when need to delete a non-empty directory.
Syntax : hdfs (space) dfs (space) -rm(space) -r (space) <file name / file directory
name>
Creating another directory C21004_Hadoop2. Using ls command , listing the
directory names and then deleting the directory which was created
(C21004_Hadoop2)

11. du: It will give the size of each file in directory.


Syntax: hdfs(space) dfs (space) -du (space) <file directory name> Giving
each file size of C12004_ Hadoop and C21004_Hadoop1

12. -du -s: This command will give the total size of directory/file.
Syntax: hdfs(space) dfs (space) -du (space) -s(space) <file directory name>

13. stat: It will give the last modified time of directory or path. In short it will give
stats of the directory or file.
Syntax : hdfs (space) dfs (space) -stat (space) <HDFS file name or directory name>
Experiment Number 2
Map Reduce:
1. Write a program in Map Reduce for WordCount operation.
2. Write a program in Map Reduce for Union operation.
3. Write a program in Map Reduce for Intersection operation.
4. Write a program in Map Reduce for Grouping and Aggregation.
5. Write a program in Map Reduce for Matrix Multiplication

1. Write the following Map Reduce program to understand Map Reduce Paradigm.
(Create your own .txt file)

a. WordCount

PROGRAM:
WordDriver.java
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;

public class WordDriver {


public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf("Usage: WordCount <input dir> <output dir>\n");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(WordCount.class);
job.setJobName("Word Count");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

boolean success = job.waitForCompletion(true);


System.exit(success ? 0 : 1);

}
}

WordMapper.java
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> {


@Override
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = value.toString();
for (String word : line.split("\\W+")) {
if (word.length() > 0) {
context.write(new Text(word), new IntWritable(1));
}
}
}
}

SumReducer.java
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {


@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}
context.write(key, new IntWritable(wordCount));
}
}
OUTPUT:
b. Average WordCount

PROGRAM:
WordDriver.java
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;

public class WordDriver {

public static void main(String[] args) throws Exception {


if(args.length != 2) {
System.out.printf("Usage: WordCount<input><output dir>\n");
System.exit(-1);
}
Job job = new Job();

job.setJarByClass(WordDriver.class);
job.setJobName("Word Count");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FloatWritable.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);

boolean success = job.waitForCompletion(true);


System.exit(success ? 0 : 1);

WordMapper.java
import java.io.IOException;

import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordMapper extends Mapper<LongWritable, Text, Text, FloatWritable> {


@Override
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = value.toString();
for (String word : line.split("\\W+")) {
if(word.length() > 0) {
context.write(new Text(word), new FloatWritable(1));
}
}
}
}

SumReducer.java
import java.io.IOException;

import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class SumReducer extends Reducer<Text, FloatWritable,Text, FloatWritable> {


//private FloatWritable result = new FloatWritable();

Float average = 0f;


Float count = 0f;
Float sum = 0f;

public void reduce(Text key, Iterable<FloatWritable> values, Context context) throws


IOException, InterruptedException {

//Text sumText = new Text("average");

for(FloatWritable val : values) {


sum += val.get();
}

count += 1;
average = sum/count;

//result.set(average);

context.write(key, new FloatWritable(average));


}
}
OUTPUT:
2. Amazon collects item sold data every hour at many locations across the globe and
gathers a large volume of log data, which is a good candidate for analysis with Map
Reduce, since it is semi structured and record-oriented. Write a Map Reduce program
that sorts unit sold data.(Refer ordered_unitsold_data.txt file )

PROGRAM:
ValueSortExp.java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class ValueSortExp {


public static void main(String[] args) throws Exception {

Path inputPath = new Path(args[0]);


Path outputDir = new Path(args[1]);

// Create configuration
Configuration conf = new Configuration(true);

// Create job
Job job = new Job(conf, "Test HIVE Command");
job.setJarByClass(ValueSortExp.class);

// Setup MapReduce
job.setMapperClass(ValueSortExp.MapTask.class);
job.setReducerClass(ValueSortExp.ReduceTask.class);
job.setNumReduceTasks(1);

// Specify key / value


job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
// job.setSortComparatorClass(IntComparator.class);
// Input
FileInputFormat.addInputPath(job, inputPath);
job.setInputFormatClass(TextInputFormat.class);

// Output
FileOutputFormat.setOutputPath(job, outputDir);
job.setOutputFormatClass(TextOutputFormat.class);

int code = job.waitForCompletion(true) ? 0 : 1;


System.exit(code);

public static class MapTask extends Mapper<LongWritable, Text, IntWritable, IntWritable> {


public void map(LongWritable key, Text value, Context context) throws java.io.IOException,
InterruptedException {
String line = value.toString();
String[] tokens = line.split(","); // This is the delimiter between
int keypart = Integer.parseInt(tokens[0]);
int valuePart = Integer.parseInt(tokens[1]);
context.write(new IntWritable(valuePart), new IntWritable(keypart));

}
}

public static class ReduceTask extends Reducer<IntWritable, IntWritable, IntWritable,


IntWritable> {
public void reduce(IntWritable key, Iterable<IntWritable> list, Context context) throws
java.io.IOException, InterruptedException {

for (IntWritable value : list) {


context.write(value,key);

}
}
}
}

OUTPUT:
3. Implement matrix multiplication with Hadoop Map Reduce

PROGRAM:
MatrixMultiply.java
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class MatrixMultiply {

public static void main(String[] args) throws Exception {


if (args.length != 2) {
System.err.println("Usage: MatrixMultiply <in_dir> <out_dir>");
System.exit(2);
}
Configuration conf = new Configuration();
// M is an m-by-n matrix; N is an n-by-p matrix.
conf.set("m", "1000");
conf.set("n", "100");
conf.set("p", "1000");

@SuppressWarnings("Deprecation")
Job job = new Job(conf, "MatrixMultiply");
job.setJarByClass(MatrixMultiply.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));


FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);

}
}

Map.java
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class Map extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, Text>


{
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
int m = Integer.parseInt(conf.get("m"));
int p = Integer.parseInt(conf.get("p"));
String line = value.toString();
// (M, i, j, Mij);
String[] indicesAndValue = line.split(",");
Text outputKey = new Text();
Text outputValue = new Text();
if (indicesAndValue[0].equals("M")) {
for (int k = 0; k < p; k++) {
outputKey.set(indicesAndValue[1] + "," + k);
// outputKey.set(i,k);
outputValue.set(indicesAndValue[0] + "," + indicesAndValue[2]
+ "," + indicesAndValue[3]);
// outputValue.set(M,j,Mij);
context.write(outputKey, outputValue);
}
}
else {
// (N, j, k, Njk);
for (int i = 0; i < m; i++) {
outputKey.set(i + "," + indicesAndValue[2]);
outputValue.set("N," + indicesAndValue[1] + "," + indicesAndValue[3]);
context.write(outputKey, outputValue);
}
}
}
}

Reduce.java
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.util.HashMap;

public class Reduce extends org.apache.hadoop.mapreduce.Reducer<Text, Text, Text, Text> {


@Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String[] value;
//key=(i,k),
//Values = [(M/N,j,V/W),..]
HashMap<Integer, Float> hashA = new HashMap<Integer, Float>();
HashMap<Integer, Float> hashB = new HashMap<Integer, Float>();
for (Text val : values) {
value = val.toString().split(",");
if (value[0].equals("M")) {
hashA.put(Integer.parseInt(value[1]), Float.parseFloat(value[2]));
} else {
hashB.put(Integer.parseInt(value[1]), Float.parseFloat(value[2]));
}
}
int n = Integer.parseInt(context.getConfiguration().get("n"));
float result = 0.0f;
float m_ij;
float n_jk;
for (int j = 0; j < n; j++) {
m_ij = hashA.containsKey(j) ? hashA.get(j) : 0.0f;
n_jk = hashB.containsKey(j) ? hashB.get(j) : 0.0f;
result += m_ij * n_jk;
}
if (result != 0.0f) {
context.write(null,
new Text(key.toString() + "," + Float.toString(result)));
}
}
}
OUTPUT:
Experiment Number 3
MongoDB:
1. Installation
2. Sample Database Creation
3. Query the Sample Database using MongoDB querying commands
a. Create Collection
b. Insert Document
c. Query Document
d. Delete Document
e. Indexing

ADD ALL SCREEN SHOTS OF MONGODB INSTALLATION


ADD ALL SCREEN SHOT OF MONGODB QUERY PERFORMED IN THE PRACTICAL

Installing MongoDB on Windows


1. Download MongoDB Community Server. We will install the 64-bit version for Windows.

2. Click on Setup. Once download is complete open the MSI file. Click Next in the start-up screen.
3. Accept the End-User License Agreement  Click Next.
4. Click on the “complete” button to install all of the components. The custom option can be used
to install selective components or if you want to change the location of the installation.

5. Service Configuration: Select “Run service as Network Service user --> Click Next.

6. Click on the Install button to start the installation.


7. Installation begins. Click Next once completed.

8. Once complete the installation, Click on the Finish button.


9. Go to” C:\Program Files\MongoDB\Server\4.0\bin” and double click on mongo.exe.
Alternatively, you can also click on the MongoDB desktop item.
Practical queries MongoDB

Q1. Create Database with name College using MongoDB command prompt.

Solution: “The use Command” is used to create database. The command will create
a new database if it doesn't exist, otherwise it will return the existing database.

Syntax: use Database Name

> use college1 switched to


db college1
> db.dropDatabase()
{ "ok" : 1 }
“The dropDatabase()” command will delete the database from MongoDB.

> use college


switched to db college

Q2. Create a MongoDB collections employee and Department under database college
Employee:
empcode INT, empfname
STRING,
emplname STRING, job
STRING, manager
STRING, hiredate
STRING,
salary INT, commission
INT,
deptcode INT

Dept table:
deptcode INT, deptname
STRING,
location STRING

Solution:
The createCollection() Method:
MongoDB db.createCollection(name, options) is used to create collection.

Syntax:
Basic syntax of createCollection() command is as follows
− db.createCollection(name, options)

In the command, name is name of collection to be created. Options is a document and is used to
specify configuration of collection.
Parameter Type Description
Name String Name of the collection to be
created
Options Document (Optional) Specify options about
memory size and indexing

Options parameter is optional, so you need to specify only the name of the collection.
Field Type Description
capped Boolean (Optional) If true, enables a capped
collection. Capped collection is a fixed size
collection that automatically overwrites its
oldest entries when it reaches its maximum
size. If you specify true, you need to
specify size parameter also.
autoIndexId Boolean (Optional) If true, automatically create
index on _id fields. Default value is false.
size number (Optional) Specifies a maximum size in
bytes for a capped collection. If capped is
true, then you need to specify this field
also.
max number (Optional) Specifies the maximum number
of documents allowed in the capped
collection.
Following is the list of options you can use −
While inserting the document, MongoDB first checks size field of capped collection, then it checks max
field.

Note: In MongoDB, you don't need to create collection. MongoDB creates collection
automatically, when you insert some document.

Now Creating collection “employee” and “department” in college Database.


> use college switched to db
college
> db.createCollection("employee")
{ "ok" : 1 }
> db.createCollection("department")
{ "ok" : 1 }
Collections are created in college Database. So, lets insert the documents in above collections. In RDBMS,
collections act as table.
Q3. Insert multiple documents in the collections’ employee and department.
Solution:
Inserting the documents in employee collection:
The insert() Method: To insert data into MongoDB collection.
Syntax: db.collection_name.insert(document)
Query: db.employee.insert({"empcode":9369,"empfname":"Tony","emplname":"Stark","job"
:"SoftwareEngineer","manager":7902,"hiredate":"1980-12-
17","salary":2800,"commission":0,"deptcode":20})
Output: WriteResult({ "nInserted" : 1 })
Here employee is collection for college database. If the collection doesn't exist in the database,
then MongoDB will create this collection and then insert a document into it. Note:
1. In the inserted document, if we don't specify the _id parameter, then MongoDB assigns
a unique ObjectId for this document.
2. _id is 12 bytes hexadecimal number unique for every document in a collection. 12 bytes
are divided as follows:- _id: ObjectId(4 bytes timestamp, 3 bytes machine id, 2 bytes
process id, 3 bytes incrementer)
The insertOne() method: Need to insert only one document into a collection you can use this method.
Syntax: db.collection_name.insertOne(document)
Query:
db.employee.insertOne({"empcode":9499,"empfname":"Tim","emplname":"Adolf","j
ob":"Salesman","manager":7698,"hiredate":"1981-
0220","salary":1600,"commission":300,"deptcode":30}) Output:
{
"acknowledged" : true,
"insertedId" : ObjectId("6354cb482e4df4ccf26c3ea4")
}
The insertMany() Method: It can insert multiple documents using the insertMany() method. To this
method you need to pass an array of documents.
Query: db.employee.insertMany([
{"empcode":9566,"empfname":"Kim","emplname":"Jarvis","job":"Manager","manag
er":7839,"hiredate":"1981-04-02","salary":3570,"commission":0,"deptcode":20},
{"empcode":9654,"empfname":"Sam","emplname":"Miles","job":"Salesman","manag
er":7698,"hiredate":"1981-09-28","salary":1250,"commission":1400,"deptcode":10},
{"empcode":9782,"empfname":"Kevin","emplname":"Hill","job":"Manager","manag
er":7839,"hiredate":"1981-06-09","salary":2940,"commission":0,"deptcode":10},
{"empcode":9788,"empfname":"Connie","emplname":"Smith","job":"Analyst","mana
ger":7566,"hiredate":"1982-12-09","salary":3000,"commission":0,"deptcode":20}]) Output:
{
"acknowledged" : true,
"insertedIds" : [
ObjectId("6354d1ce2e4df4ccf26c3ea5"),
ObjectId("6354d1ce2e4df4ccf26c3ea6"),
ObjectId("6354d1ce2e4df4ccf26c3ea7"),
ObjectId("6354d1ce2e4df4ccf26c3ea8")
]
}
Query: db.employee.insertMany([
{"empcode":9839,"empfname":"Alfred","emplname":"Kinsley","job":"President","m
anager":7566,"hiredate":"1981-11-17","salary":5000,"commission":0,"deptcode":10},
{"empcode":9844,"empfname":"Paul","emplname":"Timothy","job":"Salesman","ma
nager":7698,"hiredate":"1981-09-08","salary":1500,"commission":0,"deptcode":30},
{"empcode":9876,"empfname":"John","emplname":"Asghar","job":"Software
Engineer","manager":7788,"hiredate":"1983-0112","salary":3100,"commission":0,"deptcode":20},
{"empcode":9900,"empfname":"Rose","emplname":"Summers","job":"Technical
Lead","manager":7698,"hiredate":"1981-1203","salary":3300,"commission":0,"deptcode":20}
])
Output:
{
"acknowledged" : true,
"insertedIds" : [
ObjectId("6354e0bf2e4df4ccf26c3ea9"), ObjectId("6354e0bf2e4df4ccf26c3eaa"),
ObjectId("6354e0bf2e4df4ccf26c3eab"),
ObjectId("6354e0bf2e4df4ccf26c3eac")
]
} Query:
db.employee.insertMany([
{"empcode":9902,"empfname":"Andrew","emplname":"Faulkner","job":"Analyst","
manager":7566,"hiredate":"1981-12-
03","salary":3000,"commission":0,"deptcode":10},
{"empcode":9934,"empfname":"Karen","emplname":"Matthews","job":"Software
Engineer","manager":7782,"hiredate":"1982-0123","salary":3300,"commission":1400,"deptcode":20},
{"empcode":9591,"empfname":"Wendy","emplname":"Shawn","job":"Salesman","m
anager":7698,"hiredate":"1981-02-22","salary":500,"commission":0,"deptcode":30},
{"empcode":9698,"empfname":"Bella","emplname":"Swan","job":"Manager","manag
er":7839,"hiredate":"1982-05-01","salary":3420,"commission":0,"deptcode":30} ]) Output:

{
"acknowledged" : true,
"insertedIds" : [
ObjectId("6354e0c02e4df4ccf26c3ead"),
ObjectId("6354e0c02e4df4ccf26c3eae"),
ObjectId("6354e0c02e4df4ccf26c3eaf"),
ObjectId("6354e0c02e4df4ccf26c3eb0")
]
}
Employee Data:
db.employee.insert({"empcode":9369,"empfname":"Tony","emplname":"Stark","job"
:"Software Engineer","manager":7902,"hiredate":"1980-12-
17","salary":2800,"commission":0,"deptcode":20})
db.employee.insertOne({"empcode":9499,"empfname":"Tim","emplname":"Adolf","j
ob":"Salesman","manager":7698,"hiredate":"1981-
0220","salary":1600,"commission":300,"deptcode":30})

db.employee.insertMany([
{"empcode":9566,"empfname":"Kim","emplname":"Jarvis","job":"Manager","manag
er":7839,"hiredate":"1981-04-02","salary":3570,"commission":0,"deptcode":20},
{"empcode":9654,"empfname":"Sam","emplname":"Miles","job":"Salesman","manag
er":7698,"hiredate":"1981-09-28","salary":1250,"commission":1400,"deptcode":10},
{"empcode":9782,"empfname":"Kevin","emplname":"Hill","job":"Manager","manag
er":7839,"hiredate":"1981-06-09","salary":2940,"commission":0,"deptcode":10},
{"empcode":9788,"empfname":"Connie","emplname":"Smith","job":"Analyst","mana
ger":7566,"hiredate":"1982-12-09","salary":3000,"commission":0,"deptcode":20}
])

db.employee.insertMany([
{"empcode":9839,"empfname":"Alfred","emplname":"Kinsley","job":"President","m
anager":7566,"hiredate":"1981-11-17","salary":5000,"commission":0,"deptcode":10},
{"empcode":9844,"empfname":"Paul","emplname":"Timothy","job":"Salesman","ma
nager":7698,"hiredate":"1981-09-08","salary":1500,"commission":0,"deptcode":30},
{"empcode":9876,"empfname":"John","emplname":"Asghar","job":"Software
Engineer","manager":7788,"hiredate":"1983-0112","salary":3100,"commission":0,"deptcode":20},
{"empcode":9900,"empfname":"Rose","emplname":"Summers","job":"Technical
Lead","manager":7698,"hiredate":"1981-1203","salary":3300,"commission":0,"deptcode":20}
])

db.employee.insertMany([

{"empcode":9902,"empfname":"Andrew","emplname":"Faulkner","job":"Analyst","
manager":7566,"hiredate":"1981-12-
03","salary":3000,"commission":0,"deptcode":10},
{"empcode":9934,"empfname":"Karen","emplname":"Matthews","job":"Software
Engineer","manager":7782,"hiredate":"1982-0123","salary":3300,"commission":1400,"deptcode":20},
{"empcode":9591,"empfname":"Wendy","emplname":"Shawn","job":"Salesman","m
anager":7698,"hiredate":"1981-02-22","salary":500,"commission":0,"deptcode":30},
{"empcode":9698,"empfname":"Bella","emplname":"Swan","job":"Manager","manag
er":7839,"hiredate":"1982-05-01","salary":3420,"commission":0,"deptcode":30}
])

Department Data:
Query: db.department.insertMany([
... {"deptcode":10,"deptname":"Finance","location":"Edinburgh"},
... {"deptcode":20,"deptname":"Software","location":"Paddington"},
... {"deptcode":30,"deptname":"Sales","location":"Maidstone"},
... {"deptcode":40,"deptname":"Marketing","location":"Darlington"},
... {"deptcode":50,"deptname":"Admin","location":"Birmingham"}
... ])
Output:
{
"acknowledged" : true,
"insertedIds" : [
ObjectId("6354e3672e4df4ccf26c3eb1"),
ObjectId("6354e3672e4df4ccf26c3eb2"),
ObjectId("6354e3672e4df4ccf26c3eb3"),
ObjectId("6354e3672e4df4ccf26c3eb4"),
ObjectId("6354e3672e4df4ccf26c3eb5")

]
}

Q4. List all the documents inside a collections’ employee and department both.
Solution:
The find() method: The find() method will display all the documents in a nonstructured way.
Syntax: db.collection_name.find()
The pretty() method: To display the results in a formatted way, you can use pretty() method.
Syntax: db.collection_name.find().pretty()
Displaying employee details:
Query:
db.employee.find().pretty() Output:
{
"_id" : ObjectId("6354c7f22e4df4ccf26c3ea3"),
"empcode" : 9369,
"empfname" : "Tony",
"emplname" : "Stark",
"job" : "Software Engineer",
"manager" : 7902,
"hiredate" : "1980-12-17",
"salary" : 2800,
"commission" : 0,

"deptcode" : 20
}
{
"_id" : ObjectId("6354cb482e4df4ccf26c3ea4"),
"empcode" : 9499,
"empfname" : "Tim",
"emplname" : "Adolf",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-02-20",
"salary" : 1600,
"commission" : 300,
"deptcode" : 30
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea5"),
"empcode" : 9566,
"empfname" : "Kim",
"emplname" : "Jarvis",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-04-02",
"salary" : 3570,
"commission" : 0,
"deptcode" : 20
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea6"),
"empcode" : 9654,

"empfname" : "Sam",
"emplname" : "Miles",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-09-28",
"salary" : 1250,
"commission" : 1400,
"deptcode" : 10
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea7"),
"empcode" : 9782,
"empfname" : "Kevin",
"emplname" : "Hill",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-06-09",
"salary" : 2940,

"commission" : 0,
"deptcode" : 10
}
{

{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea8"),
"empcode" : 9788,
"empfname" : "Connie",
"emplname" : "Smith",

"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1982-12-09",
"salary" : 3000,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3ea9"),
"empcode" : 9839,
"empfname" : "Alfred",
"emplname" : "Kinsley",
"job" : "President",
"manager" : 7566,
"hiredate" : "1981-11-17",
"salary" : 5000,
"commission" : 0,
"deptcode" : 10

}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eaa"),
"empcode" : 9844,
"empfname" : "Paul",
"emplname" : "Timothy",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-09-08",
"salary" : 1500,
"commission" : 0,
"deptcode" : 30
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eab"),
"empcode" : 9876,
"empfname" : "John",
"emplname" : "Asghar",
"job" : "Software Engineer",
"manager" : 7788,
"hiredate" : "1983-01-12",
"salary" : 3100,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eac"),
"empcode" : 9900,

"empfname" : "Rose",
"emplname" : "Summers",
"job" : "Technical Lead",
"manager" : 7698,

{
"hiredate" : "1981-12-03",
"salary" : 3300,
"commission" : 0,

"deptcode" : 20
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3ead"),
"empcode" : 9902,
"empfname" : "Andrew",
"emplname" : "Faulkner",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1981-12-03",
"salary" : 3000,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eae"),
"empcode" : 9934,
"empfname" : "Karen",
"emplname" : "Matthews",
"job" : "Software Engineer",
"manager" : 7782,
"hiredate" : "1982-01-23",
"salary" : 3300,
"commission" : 1400,
"deptcode" : 20
}

{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eaf"),
"empcode" : 9591,
"empfname" : "Wendy",
"emplname" : "Shawn",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-02-22",
"salary" : 500,
"commission" : 0,
"deptcode" : 30
"_id" : ObjectId("6354e0c02e4df4ccf26c3eb0"),
"empcode" : 9698,
"empfname" : "Bella",
"emplname" : "Swan",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1982-05-01",
"salary" : 3420,
"commission" : 0,
"deptcode" : 30
}

{
Displaying Department Details:
Query:
db.department.find().pretty() Output:
{
"_id" : ObjectId("6354e3672e4df4ccf26c3eb1"),
"deptcode" : 10,
"deptname" : "Finance",
"location" : "Edinburgh"
}
{
"_id" : ObjectId("6354e3672e4df4ccf26c3eb2"),
"deptcode" : 20,
"deptname" : "Software",
"location" : "Paddington"
}
{
"_id" : ObjectId("6354e3672e4df4ccf26c3eb3"),
"deptcode" : 30,

"deptname" : "Sales",
"location" : "Maidstone"
}
{
"_id" : ObjectId("6354e3672e4df4ccf26c3eb4"),
"deptcode" : 40,
"deptname" : "Marketing",
"location" : "Darlington"
}
{
"_id" : ObjectId("6354e3672e4df4ccf26c3eb5"),
"deptcode" : 50,
"deptname" : "Admin",
"location" : "Birmingham"
}
Q5. List all the documents of employee collection WHERE job in ("Salesman ",
"Manager"):
Solution:
The $in operator is used to select documents in which the field's value equals any of the given
values in the array.
Syntax: { field: { $in: [<value 1>, <value 2>, ... <value N> ] } } Query:
db.employee.find({job:{$in: ["Salesman", "Manager"]}}).pretty() Output:
{
"_id" : ObjectId("6354cb482e4df4ccf26c3ea4"),
"empcode" : 9499,
"empfname" : "Tim",
"emplname" : "Adolf",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-02-20",
"salary" : 1600,
"commission" : 300,
"deptcode" : 30
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea5"),
"empcode" : 9566,
"empfname" : "Kim",
"emplname" : "Jarvis",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-04-02",
"salary" : 3570,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea6"),
"empcode" : 9654,
"empfname" : "Sam",
"emplname" : "Miles",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-09-28",
"salary" : 1250,
"commission" : 1400,
"deptcode" : 10
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea7"),
"empcode" : 9782,
"empfname" : "Kevin",
"emplname" : "Hill",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-06-09",
"salary" : 2940,
"commission" : 0,
"deptcode" : 10
}

{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eaa"),
"empcode" : 9844,
"empfname" : "Paul",
"emplname" : "Timothy",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-09-08",
"salary" : 1500,
"commission" : 0,
"deptcode" : 30
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eaf"),
"empcode" : 9591,
"empfname" : "Wendy",
"emplname" : "Shawn",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-02-22",
"salary" : 500,
"commission" : 0,
"deptcode" : 30
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eb0"),
"empcode" : 9698,
"empfname" : "Bella",
"emplname" : "Swan",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1982-05-01",
"salary" : 3420,
"commission" : 0,
"deptcode" : 30
}

Q6. List all the documents of employee collection WHERE Job = " Analyst" And
Salary < 1500:
Solution:
To query documents based on the AND condition, need to use $and keyword. Following
is the basic syntax of AND : db.collection_name.find({ $and: [ {<key1>:<value1>}, {
<key2>:<value2>} ] }) Query:
db.employee.find({$and:[{job:”Analyst”},{salary:{$lt:1500}}]}).pretty() No
records found.

Q7.List all the documents of employee collection WHERE job = " ANALYST" or
SALARY < 1500:
Solution:
To query documents based on the OR condition, you need to use $or keyword.
Following is the basic syntax of OR − db.collection_name.find(
{
$or: [{key1: value1}, {key2:value2}]
}).pretty()
Query:

db.employee.find({$or:[{job:”Analyst”},{salary:{$lt:1500}}]}).pretty() Output:
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea6"),
"empcode" : 9654,
"empfname" : "Sam",
"emplname" : "Miles",

"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-09-28",
"salary" : 1250,
"commission" : 1400,
"deptcode" : 10
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea8"),
"empcode" : 9788,
"empfname" : "Connie",
"emplname" : "Smith",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1982-12-09",
"salary" : 3000,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3ead"),
"empcode" : 9902,
"empfname" : "Andrew",
"emplname" : "Faulkner",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1981-12-03",
"salary" : 3000,
"commission" : 0,
"deptcode" : 10
}
{

"_id" : ObjectId("6354e0c02e4df4ccf26c3eaf"),
"empcode" : 9591,
"empfname" : "Wendy",
"emplname" : "Shawn",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-02-22",
"salary" : 500,
"commission" : 0,
"deptcode" : 30
}

Q8. List all the documents of employee collection WHERE job = " ANALYST" AND
(SALARY < 1500 OR empfname LIKE "T%")
Solution:
Query:
db.employee.find({job:"Analyst",$or:[{salary:{$lt:1500}},{empfname:/^T/}]}).pretty
()
No records found.
Query:
db.employee.find({job:"Analyst",$or:[{salary:{$lt:1500}},{empfname:/^A/}]}).pretty
()
Output:
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3ead"),
"empcode" : 9902,
"empfname" : "Andrew",
"emplname" : "Faulkner",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1981-12-03",
"salary" : 3000,
"commission" : 0,
"deptcode" : 10
}

Q9. List all the documents of employee collection in ascending order of job type.
Solution:
The sort() Method: To sort documents in MongoDB, need to use sort() method. The
method accepts a document containing a list of fields along with their sorting order. To
specify sorting order 1 and -1 are used. 1 is used for ascending order while -1 is used
for descending order.

Syntax:
The basic syntax of sort() method is as follows: db.collection_name.find().sort({KEY:1})
db.employee.find().pretty().sort({"empfname":1})
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3ea9"),
"empcode" : 9839,
"empfname" : "Alfred",
"emplname" : "Kinsley",
"job" : "President",
"manager" : 7566,
"hiredate" : "1981-11-17",
"salary" : 5000,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3ead"),
"empcode" : 9902,
"empfname" : "Andrew",
"emplname" : "Faulkner",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1981-12-03",
"salary" : 3000,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eb0"),
"empcode" : 9698,
"empfname" : "Bella",
"emplname" : "Swan",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1982-05-01",
"salary" : 3420,
"commission" : 0,
"deptcode" : 30

{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea8"),
"empcode" : 9788,
"empfname" : "Connie",
"emplname" : "Smith",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1982-12-09",
"salary" : 3000,
"commission" : 0,
"deptcode" : 20
}

{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eab"),
"empcode" : 9876,
"empfname" : "John",
"emplname" : "Asghar",
"job" : "Software Engineer",
"manager" : 7788,
"hiredate" : "1983-01-12",
"salary" : 3100,
"commission" : 0,
"deptcode" : 20
}
}

{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eae"),
"empcode" : 9934,
"empfname" : "Karen",
"emplname" : "Matthews",
"job" : "Software Engineer",
"manager" : 7782,
"hiredate" : "1982-01-23",
"salary" : 3300,
"commission" : 1400,
"deptcode" : 20
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea7"),
"empcode" : 9782,

"empfname" : "Kevin",
"emplname" : "Hill",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-06-09",
"salary" : 2940,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea5"),
"empcode" : 9566,
"empfname" : "Kim",
"emplname" : "Jarvis",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-04-02",
"salary" : 3570,
"commission" : 0,
"deptcode" : 20

{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eaa"),
"empcode" : 9844,
"empfname" : "Paul",
"emplname" : "Timothy",
"job" : "Salesman",
"manager" : 7698,

"hiredate" : "1981-09-08",
"salary" : 1500,
"commission" : 0,
"deptcode" : 30
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eac"),
"empcode" : 9900,
"empfname" : "Rose",
"emplname" : "Summers",
"job" : "Technical Lead",
"manager" : 7698,
}

"hiredate" : "1981-12-03",
"salary" : 3300,
"commission" : 0,
"deptcode" : 20
}

{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea6"),
"empcode" : 9654,
"empfname" : "Sam",
"emplname" : "Miles",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-09-28",
"salary" : 1250,
"commission" : 1400,
"deptcode" : 10
}
{
"_id" : ObjectId("6354cb482e4df4ccf26c3ea4"),
"empcode" : 9499,
"empfname" : "Tim",
"emplname" : "Adolf",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-02-20",
"salary" : 1600,
"commission" : 300,
"deptcode" : 30
}
{
"_id" : ObjectId("6354c7f22e4df4ccf26c3ea3"),
"empcode" : 9369,
"empfname" : "Tony",
"emplname" : "Stark",
"job" : "Software Engineer",
"manager" : 7902,
"hiredate" : "1980-12-17",
"salary" : 2800,
"commission" : 0,
"deptcode" : 20

{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eaf"),
"empcode" : 9591,

"empfname" : "Wendy",
"emplname" : "Shawn",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-02-22",
"salary" : 500,
"commission" : 0,
"deptcode" : 30
}

Note: Don't specify the sorting preference, then sort() method will display the
documents in ascending order.
}

db.employee.find().pretty().sort({"empfname":1})
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3ea9"),
"empcode" : 9839,
"empfname" : "Alfred",
"emplname" : "Kinsley",
"job" : "President",
"manager" : 7566,
"hiredate" : "1981-11-17",
"salary" : 5000,
"commission" : 0,
"deptcode" : 10
}
{

"_id" : ObjectId("6354e0c02e4df4ccf26c3ead"),
"empcode" : 9902,
"empfname" : "Andrew",
"emplname" : "Faulkner",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1981-12-03",
"salary" : 3000,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eb0"),
"empcode" : 9698,
"empfname" : "Bella",
"emplname" : "Swan",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1982-05-01",
"salary" : 3420,
"commission" : 0,
"deptcode" : 30
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea8"),
"empcode" : 9788,
"empfname" : "Connie",
"emplname" : "Smith",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1982-12-09",
"salary" : 3000,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eab"),
"empcode" : 9876,
"empfname" : "John",
"emplname" : "Asghar",
"job" : "Software Engineer",
"manager" : 7788,
"hiredate" : "1983-01-12",
"salary" : 3100,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eae"),
"empcode" : 9934,
"empfname" : "Karen",
"emplname" : "Matthews",
{

"job" : "Software Engineer",


"manager" : 7782,
"hiredate" : "1982-01-23",
"salary" : 3300,
"commission" : 1400,
"deptcode" : 20
}
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea7"),
"empcode" : 9782,
"empfname" : "Kevin",
"emplname" : "Hill",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-06-09",
"salary" : 2940,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea5"),
"empcode" : 9566,
"empfname" : "Kim",
"emplname" : "Jarvis",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-04-02",
"salary" : 3570,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eaa"),
"empcode" : 9844,
"empfname" : "Paul",
"emplname" : "Timothy",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-09-08",
"salary" : 1500,
"commission" : 0,
"deptcode" : 30
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eac"),
"empcode" : 9900,
"empfname" : "Rose",
"emplname" : "Summers",
"job" : "Technical Lead",
"manager" : 7698,
"hiredate" : "1981-12-03",
"salary" : 3300,
"commission" : 0,
{

"deptcode" : 20
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea6"),
"empcode" : 9654,
"empfname" : "Sam",
"emplname" : "Miles",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-09-28",
"salary" : 1250,
"commission" : 1400,
"deptcode" : 10
}

"_id" : ObjectId("6354cb482e4df4ccf26c3ea4"),
"empcode" : 9499,
"empfname" : "Tim",
"emplname" : "Adolf",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-02-20",
"salary" : 1600,
"commission" : 300,
"deptcode" : 30
}
{
"_id" : ObjectId("6354c7f22e4df4ccf26c3ea3"),
"empcode" : 9369,
"empfname" : "Tony",
"emplname" : "Stark",
"job" : "Software Engineer",
"manager" : 7902,

"hiredate" : "1980-12-17",
"salary" : 2800,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eaf"),
"empcode" : 9591,
"empfname" : "Wendy",
"emplname" : "Shawn",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-02-22",
"salary" : 500,
"commission" : 0,
"deptcode" : 30
}
> db.employee.find().pretty().sort({"empfname":-1})
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eaf"),
{

"empcode" : 9591,
"empfname" : "Wendy",
"emplname" : "Shawn",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-02-22",
"salary" : 500,
"commission" : 0,
"deptcode" : 30
}
{
"_id" : ObjectId("6354c7f22e4df4ccf26c3ea3"),
"empcode" : 9369,
"empfname" : "Tony",
"emplname" : "Stark",
"job" : "Software Engineer",
"manager" : 7902,
"hiredate" : "1980-12-17",
"salary" : 2800,
"commission" : 0,
"deptcode" : 20
"_id" : ObjectId("6354cb482e4df4ccf26c3ea4"),
"empcode" : 9499,
"empfname" : "Tim",
"emplname" : "Adolf",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-02-20",
"salary" : 1600,
"commission" : 300,
"deptcode" : 30
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea6"),
"empcode" : 9654,

"empfname" : "Sam",
"emplname" : "Miles",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-09-28",
"salary" : 1250,
"commission" : 1400,
"deptcode" : 10
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eac"),
"empcode" : 9900,
}
{

"empfname" : "Rose",
"emplname" : "Summers",
"job" : "Technical Lead",
"manager" : 7698,
"hiredate" : "1981-12-03",
"salary" : 3300,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eaa"),
"empcode" : 9844,
"empfname" : "Paul",
"emplname" : "Timothy",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-09-08",
"salary" : 1500,
"commission" : 0,
"deptcode" : 30
}
{

"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea5"),
"empcode" : 9566,
"empfname" : "Kim",
"emplname" : "Jarvis",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-04-02",
"salary" : 3570,
"commission" : 0,
"deptcode" : 20
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea7"),
"empcode" : 9782,
"empfname" : "Kevin",
"emplname" : "Hill",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-06-09",
"salary" : 2940,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eae"),
"empcode" : 9934,
"empfname" : "Karen",
"emplname" : "Matthews",
"job" : "Software Engineer",
"manager" : 7782,
"hiredate" : "1982-01-23",
"salary" : 3300,
}
{

"commission" : 1400,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eab"),
"empcode" : 9876,
"empfname" : "John",
"emplname" : "Asghar",
"job" : "Software Engineer",
"manager" : 7788,
"hiredate" : "1983-01-12",
"salary" : 3100,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea8"),
"empcode" : 9788,
"empfname" : "Connie",
"emplname" : "Smith",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1982-12-09",
"salary" : 3000,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eb0"),
"empcode" : 9698,
"empfname" : "Bella", "emplname" :
"Swan",
"job" : "Manager",
"manager" : 7839,

"hiredate" : "1982-05-01",
"salary" : 3420,
"commission" : 0,
"deptcode" : 30
"_id" : ObjectId("6354e0c02e4df4ccf26c3ead"),
"empcode" : 9902,
"empfname" : "Andrew",
"emplname" : "Faulkner",
"job" : "Analyst",
"manager" : 7566,

"hiredate" : "1981-12-03",
"salary" : 3000,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3ea9"),
"empcode" : 9839,
}
{

"empfname" : "Alfred",
"emplname" : "Kinsley",
"job" : "President",
"manager" : 7566,
"hiredate" : "1981-11-17",
"salary" : 5000,
"commission" : 0,
"deptcode" : 10
}

Q10.Create index on empcode field of employee collection.


Solution:
Indexes are special data structures, that store a small portion of the data set in an easy-totraverse
form. The index stores the value of a specific field or set of fields, ordered by the value of the
field as specified in the index.
The createIndex() Method:To create an index, need to use createIndex() method of MongoDB.
Syntax: db.collection_name.createIndex({KEY:1})
Note: Here key is the name of the field on which you want to create index and 1 is for
ascending order. To create index in descending order you need to use -1.
Query:
db.employee.createIndex({empcode:1}) Output:
{
"createdCollectionAutomatically" : false,
"numIndexesBefore" : 1,
"numIndexesAfter" : 2,
"ok" : 1
}
}
{

Q11. Create multiple index on empcode and hiredate fields of employee collection.
Solution:
Query:
db.employee.createIndex({empcode:1,hiredate:-1})
Output:
{
"createdCollectionAutomatically" : false,
"numIndexesBefore" : 2,
"numIndexesAfter" : 3,
"ok" : 1
}
Q12.Delete first record of employee collection where job is SALESMAN.
}
{

Solution:
The remove() Method: MongoDB's remove() method is used to remove a document from the
collection. remove() method accepts two parameters. One is deletion criteria and second is
justOne flag.

deletion criteria − (Optional) deletion criteria according to documents will be removed. justOne

− (Optional) if set to true or 1, then remove only one document.

Syntax: db.collection_name.remove(deletion_critteria)
The removeOnlyOne Method: If there are multiple records and you want to delete only the
first record, then set justOne parameter in remove() method.
Syntax: db.collection_name.remove(deletion_criteria,1)
Query:

db.employee.remove({job:"Salesman"},1) WriteResult({ "nRemoved" : 1 })


One record is deleted from collection “employee” as job title “salesman”. After firing find()
command. You will see one record is removed that’s Tom record from employee collection. >
db.employee.find().pretty()
{
"_id" : ObjectId("6354c7f22e4df4ccf26c3ea3"),
"empcode" : 9369,
"empfname" : "Tony",
"emplname" : "Stark",
"job" : "Software Engineer",
"manager" : 7902,
"hiredate" : "1980-12-17",
"salary" : 2800,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea5"),
"empcode" : 9566,

"empfname" : "Kim",
"emplname" : "Jarvis",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-04-02",
"salary" : 3570,
"commission" : 0,
}
{

"deptcode" : 20
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea6"),
"empcode" : 9654,
"empfname" : "Sam",
"emplname" : "Miles",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-09-28",
"salary" : 1250,
"commission" : 1400,
"deptcode" : 10
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea7"),
"empcode" : 9782,
"empfname" : "Kevin",
"emplname" : "Hill",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-06-09",
"salary" : 2940,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea8"),
"empcode" : 9788,

"empfname" : "Connie",
"emplname" : "Smith",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1982-12-09",
"salary" : 3000,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3ea9"),
"empcode" : 9839,
"empfname" : "Alfred",
"emplname" : "Kinsley",
"job" : "President",
"manager" : 7566,
"hiredate" : "1981-11-17",
"salary" : 5000,

"commission" : 0,
"deptcode" : 10
}
}
{

{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eaa"),
"empcode" : 9844,
"empfname" : "Paul",
"emplname" : "Timothy",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-09-08",
"salary" : 1500,
"commission" : 0,
"deptcode" : 30
}
{

"_id" : ObjectId("6354e0bf2e4df4ccf26c3eab"),
"empcode" : 9876,
"empfname" : "John",
"emplname" : "Asghar",
"job" : "Software Engineer",
"manager" : 7788,
"hiredate" : "1983-01-12",
"salary" : 3100,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eac"),
"empcode" : 9900,
"empfname" : "Rose",
"emplname" : "Summers",
"job" : "Technical Lead",
"manager" : 7698,
"hiredate" : "1981-12-03",
"salary" : 3300,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3ead"),
"empcode" : 9902,
"empfname" : "Andrew",
"emplname" : "Faulkner",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1981-12-03",
"salary" : 3000,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eae"),
}
{

"empcode" : 9934,
"empfname" : "Karen",
"emplname" : "Matthews",
"job" : "Software Engineer",
"manager" : 7782,
"hiredate" : "1982-01-23",
"salary" : 3300,
"commission" : 1400,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eaf"),
"empcode" : 9591,
"empfname" : "Wendy",
"emplname" : "Shawn",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-02-22",
"salary" : 500,
"commission" : 0,
"deptcode" : 30
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eb0"),
"empcode" : 9698,
"empfname" : "Bella",
"emplname" : "Swan",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1982-05-01",
"salary" : 3420,
"commission" : 0,
"deptcode" : 30
}

Q13.Delete all the record of employee collection where job is SALESMAN.


Solution:
Query:
db.employee.remove({job:"Salesman"})
WriteResult({ "nRemoved" : 3 })
}
{

Records are deleted from employee collection. Job title as “salesman”


> db.employee.find().pretty()
{
"_id" : ObjectId("6354c7f22e4df4ccf26c3ea3"),
"empcode" : 9369,
"empfname" : "Tony",
"emplname" : "Stark",
"job" : "Software Engineer",
"manager" : 7902,
"hiredate" : "1980-12-17",
"salary" : 2800,
"commission" : 0,

"deptcode" : 20
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea5"),
"empcode" : 9566,
"empfname" : "Kim",
"emplname" : "Jarvis",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-04-02",
"salary" : 3570,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea7"),
"empcode" : 9782,
"empfname" : "Kevin",
"emplname" : "Hill",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-06-09",
"salary" : 2940,
"commission" : 0,

"deptcode" : 10
}
{
}
{

"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea8"),
"empcode" : 9788,
"empfname" : "Connie",
"emplname" : "Smith",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1982-12-09",
"salary" : 3000,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3ea9"),
"empcode" : 9839,

"empfname" : "Alfred",
"emplname" : "Kinsley",
"job" : "President",
"manager" : 7566,
"hiredate" : "1981-11-17",
"salary" : 5000,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eab"),
"empcode" : 9876,
"empfname" : "John",
"emplname" : "Asghar",
"job" : "Software Engineer",
"manager" : 7788,
"hiredate" : "1983-01-12",
"salary" : 3100,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eac"),
"empcode" : 9900,
"empfname" : "Rose",
"emplname" : "Summers",
"job" : "Technical Lead",
"manager" : 7698,
"hiredate" : "1981-12-03",
"salary" : 3300,
"commission" : 0,
"deptcode" : 20
}

{
"_id" : ObjectId("6354e0c02e4df4ccf26c3ead"),
"empcode" : 9902,
}
{

"empfname" : "Andrew",
"emplname" : "Faulkner",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1981-12-03",
"salary" : 3000,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eae"),
"empcode" : 9934,
"empfname" : "Karen",
"emplname" : "Matthews",
"job" : "Software Engineer",
"manager" : 7782,
"hiredate" : "1982-01-23",
"salary" : 3300,
"commission" : 1400,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eb0"),
"empcode" : 9698,
"empfname" : "Bella",
"emplname" : "Swan",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1982-05-01",
"salary" : 3420,
"commission" : 0,
"deptcode" : 30
}

Q14.Delete all the record of employee collection.


Solution:

Query:
> db.employee.remove({})
WriteResult({ "nRemoved" : 10 })
}
{

> db.employee.find().pretty() No
records found.

Experiment Number 4
Hive:
1. Hive Data Types
2. Create Database & Table in Hive
3. Hive Partitioning
4. Hive Built-In Operators
5. Hive Built-In Functions
6. Hive Views and Indexes
7. HiveQL : Select Where, Select OrderBy, Select GroupBy, Select Joins
1. Create Database with name College
Query: create database timscdr;

2. Create a Hive Managed tables, employee and dept under database


college Employee table:
empcode INT, empfname STRING, emplname STRING, job STRING, manager STRING,
hiredate STRING, salary INT, commission INT

Query:
> create table timscdr.employee (empcode int, empfname string, emplname string, job string,
manager string, hiredate string, salary int, commission int)
> row format delimited
> fields terminated by ',' ;

Dept table:
deptcode INT, deptname STRING, location STRING
}
{

3. Load data in the tables employee and dept using .csv files
Steps:
First Create an empty document in cloudera and paste data into the file and save it in .csv
extension Emp_data
Command: load data local inpath ‘/home/cloudera/emp_data.csv’ into table timscdr.employee;

Dept_data

4. List all the records in a table


Emp_data

Dept_data
5. Get the table structure
Timscdr.employee

Timscdr.dept

6. Perform following operation on employee tables


a. Prepare a list of all the jobs and their respective number of employees
working for that job.
}
{

b. Display all the MANAGERS in order where the one who served the most
appears first.
Query:
> SELECT empfname, emplname, job, hiredate
> FROM timscdr.employee
> WHERE job = 'MANAGER' order by hiredate asc;
c. Display the FULL NAME and SALARY drawn by an analyst working in
dept no 20
Query:
select empfname, emplname, salary, deptcode
> from timscdr.employee
> where job = 'ANALYST' and deptcode = 20;

7. Perform following joins between employee and dept table using key
deptcode.
a. Left Join
Query:
}
{

select e.empfname, e.deptcode, d.deptname, d.deptcode from timscdr.employee e left outer


join timscdr.dept d on e.deptcode = d.deptcode;

b. Right Join
Query:
select e.empfname, e.deptcode, d.deptname, d.deptcode from timscdr.employee e right outer
join timscdr.dept d on e.deptcode = d.deptcode;
c. Full Join
Query:
Select e.empfname, e.deptcode, d.deptname, d.deptcode from timscdr.employee e full outer
join timscdr.dept d on e.deptcode = d.deptcode;
}
{

d. Inner Join
Query:
select e.empfname, e.deptcode, d.deptname, d.deptcode from timscdr.employee e join
timscdr.dept d on e.deptcode = d.deptcode;
8. Create view on employee table for employee whose salary greater
than 2000 and then drop that view.
Create View:
> create view timscdr.employee_vw as

> select e.empfname, e.job, e.salary


}
{

> from timscdr.employee e

> where e.salary > 2000

> and e.job not like 'M%';

Drop View
> drop view timscdr.employee_vw;

9. Create index on employee table than drop that view.


Query:
Index creation
> create index index_emp
> on table timscdr.employee(empcode)
> as 'bitmap' with deferred rebuild;
Drop Index
> drop index index_emp
> on timscdr.employee;

10. Hive Partitioning


a. Student table
roll INT, stfname STRING, stlname STRING, course STRING
Query:
> create table timscdr.student
> ( roll int, stfname string, stlname string, course string )
> row format delimited

> fields terminated by ',';

Load data in Table student


> load data local inpath '/home/cloudera/stude.csv'
> into table timscdr.student;

>select * from timscdr.student


}
{

Create partitioned table Student_course based on course and Demonstrate static


and dynamic partitioning
Static Partitioning
Query:
>create table timscdr.stud_course (roll int, stfname string, stlname string) portioned by(course
string) row format delimited fields terminated by ‘,’;

>load data local inpath ‘/home/cloudera/java.csv’ into table stud_course


partition(course = “JAVA”);
Dynamic Partitioning
Query:
>use demo;
>load data local inpath ‘/home/cloudera/stude.csv’ into table stud_demo;

>create table student_part (roll int, stfname string, stlname string) portioned by
(course string) row format delimited fileds terminated by ‘,’;
}
{

Insert the data of dummy table into the partition table


Experiment Number 5
Pig:
1. Pig Latin Basic
2. Pig Data Types,
3. Download the data
4. Create your Script
5. Save and Execute the Script
6. Pig Operations: Diagnostic Operators, Grouping and Joining, Combining &
Splitting, Filtering, Sorting

ADD YOUR OWN SCRREN SHOT


PIG INSTALLATION

Prerequisites:
It is essential that you have Hadoop and Java installed on your system before you go for Apache
Pig. Therefore, prior to installing Apache Pig, install Hadoop and Java by following the steps
given in the following link as per Hadoop installation document.

Step 1: Download Apache Pig


Open the homepage of Apache Pig website. Under the section News, click on the link release
page as shown in the following snapshot.

Click on release page.

Step 2: On clicking the specified link, you will be redirected to the Apache Pig Releases page.
Click on Download release now.
Apache Spark

Click on this link will


directed to mirror page.

Step 3: Choose and click any one of these mirrors as shown below.

Click on above
link

Step 4: These mirrors will take you to the Pig Releases page. This page contains various versions
of Apache Pig. Click the latest version among them.
Step 5: Within these folders, you will have the source and binary files of Apache Pig in various
distributions. Download the tar files of the source and binary files of Apache Pig 0.17, pig0.17.0-
src.tar.gz and pig-0.17.0.tar.gz.
Apache Spark

Step 6: Downloaded the pig-0.17.0.tar.gz version of Pig which is latest and about 220MB in
size. Now move the downloaded Pig tar file to your desired location. In this case Moving it to
my /Documents folder.
Using cd command: cd Documents/
Step 7: Extract this tar file with the help of below command (make sure to check your tar
filename):
tar -xvf pig-0.17.0.tar.gz

Step 8: Once it is installed it’s time for us to switch to our Hadoop user. In my case it is hadoop
and password is admin.
Command: su – hadoop
Step 9: Need to move this extracted folder to the hadoopusr user. For that, use the below
command (make sure name of your extracted folder is pig-0.17.0 otherwise change it
accordingly).
Command: sudo mv pig-0.17.0 /usr/local/

Step 10: Now once we moved it we need to change the environment variable for Pig’s location.
For that open the bashrc file with below command.
Command: sudo gedit ~/.bashrc

Once the file open save the below path inside this bashrc file.

#Pig location
export PIG_INSTALL=/usr/local/pig-0.17.0
export PATH=$PATH:/usr/local/pig-0.17.0/bin

Create JAVA_HOME variable

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Apache Spark

Step 11: Then check whether you have configured it correctly or not using the below command:
source ~/.bashrc

Step 12: Successfully installed pig to our Hadoop single node setup, now we start pig with below
pig command.
Command: pig –x local

Experiment Number 6
Spark:
1. SPARK– INTRODUCTION Apache Spark

1. Downloading Data Set and Processing it Spark


2. Word Count in Apache Spark.

ustries are using Hadoop extensively to analyze their data sets. The reason is that Hadoop
framework is based on a simple programming model (MapReduce) and it enables a computing
solution that is scalable, flexible, fault-tolerant and cost effective. Here, the main concern is to
maintain speed in processing large datasets in terms of waiting time between queries and
waiting time to run the program.
Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.
As against a common belief, Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management. Hadoop is just one of the
ways to implement Spark.
Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its
own cluster management computation, it uses Hadoop for storage purpose only.

Apache Spark

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation.
It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for
more types of computations, which includes interactive queries and stream processing. The
main feature of Spark is its in-memory cluster computing that increases the processing speed
of an application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workload in a
respective system, it reduces the management burden of maintaining separate tools.

Evolution of Apache Spark

Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei
Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software
foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-
2014.
Apache Spark
Features of Apache Spark

Apache Spark has following features.


• Speed: Spark helps to run an application in Hadoop cluster, up to 100 times faster in
memory, and 10 times faster when running on disk. This is possible by reducing number
of read/write operations to disk. It stores the intermediate processing data in memory.
Apache Spark

• Supports multiple languages: Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages. Spark comes up with 80
high-level operators for interactive querying.

• Advanced Analytics: Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL
queries, Streaming data, Machine learning (ML), and Graph algorithms.

Spark Built on Hadoop

The following diagram shows three ways of how Spark can be built with Hadoop components.

There are three ways of Spark deployment as explained below.


• Standalone: Spark Standalone deployment means Spark occupies the place on top of
HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly.
Here, Spark and MapReduce will run side by side to cover all spark jobs on cluster.

• Hadoop Yarn: Hadoop Yarn deployment means, simply, spark runs on Yarn without
any pre-installation or root access required. It helps to integrate Spark into Hadoop
ecosystem or Hadoop stack. It allows other components to run on top of stack.
• Spark in MapReduce (SIMR): Spark in MapReduce is used to launch spark job in
addition to standalone deployment. With SIMR, user can start Spark and uses its shell
without any administrative access.

Components of Spark

The following illustration depicts the different components of Spark.

Apache Spark Core


Spark Core is the underlying general execution engine for spark platform that all other
functionality is built upon. It provides In-Memory computing and referencing datasets in
external storage systems.

Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction called
SchemaRDD, which provides support for structured and semi-structured data.

Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets)
transformations on those mini-batches of data.
ApaApacchehe
2. SPARK– RDD SSpapakrkr

MLlib (Machine Learning Library)


MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark architecture. It is, according to benchmarks, done by the MLlib
developers against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine
times as fast as the Hadoop disk-based version of Apache Mahout (before Mahout gained a
Spark interface).

GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API for
expressing graph computation that can model the user-defined graphs by using Pregel
abstraction API. It also provides an optimized runtime for this abstraction.

Resilient Distributed Datasets

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an


immutable distributed collection of objects. Each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster. RDDs can contain any
type of Python, Java, or Scala objects, including user-defined classes.
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created
through deterministic operations on either data on stable storage or other RDDs. RDD is a fault-
tolerant collection of elements that can be operated on in parallel.
There are two ways to create RDDs: parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file system,
HDFS, HBase, or any data source offering a Hadoop Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations.
Let us first discuss how MapReduce operations take place and why they are not so efficient.

Data Sharing is Slow in MapReduce

MapReduce is widely adopted for processing and generating large datasets with a parallel,
distributed algorithm on a cluster. It allows users to write parallel computations, using a set of
high-level operators, without having to worry about work distribution and fault tolerance.
Unfortunately, in most current frameworks, the only way to reuse data between computations
(Ex: between two MapReduce jobs) is to write it to an external stable storage system (Ex:
HDFS). Although this framework provides numerous abstractions for accessing a cluster’s
computational resources, users still want more.
Both Iterative and Interactive applications require faster data sharing across parallel jobs.
Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Regarding
storage system, most of the Hadoop applications, they spend more than 90% of the time doing
HDFS read-write operations.

Iterative Operations on MapReduce

Reuse intermediate results across multiple computations in multi-stage applications. The


following illustration explains how the current framework works, while doing the iterative
operations on MapReduce. This incurs substantial overheads due to data replication, disk I/O,
and serialization, which makes the system slow.

Figure: Iterative operations on MapReduce

Interactive Operations on MapReduce

User runs ad-hoc queries on the same subset of data. Each query will do the disk I/O on the stable
storage, which can dominates application execution time.
The following illustration explains how the current framework works while doing the interactive
queries on MapReduce.
Apache Spark

Figure: Interactive operations on MapReduce

Apache Spark

Data Sharing using Spark RDD

Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of the
Hadoop applications, they spend more than 90% of the time doing HDFS readwrite operations.

Recognizing this problem, researchers developed a specialized framework called Apache


Spark. The key idea of spark is Resilient Distributed Datasets (RDD); it supports inmemory
processing computation. This means, it stores the state of memory as an object across the jobs
and the object is sharable between those jobs. Data sharing in memory is 10 to 100 times faster
than network and Disk.
Let us now try to find out how iterative and interactive operations take place in Spark RDD.

Iterative Operations on Spark RDD

The illustration given below shows the iterative operations on Spark RDD. It will store
intermediate results in a distributed memory instead of Stable storage (Disk) and make the
system faster.
Note: If the Distributed memory (RAM) is sufficient to store intermediate results (State of the
JOB), then it will store those results on the disk.
Figure: Iterative operations on Spark RDD

Interactive Operations on Spark RDD

This illustration shows interactive operations on Spark RDD. If different queries are run on the
same set of data repeatedly, this particular data can be kept in memory for better execution
times.

Figure: Interactive operations on Spark RDD

By default, each transformed RDD may be recomputed each time you run an action on it.
However, you may also persist an RDD in memory, in which case Spark will keep the elements
around on the cluster for much faster access, the next time you query it. There is also support
for persisting RDDs on disk, or replicated across multiple nodes.
ApaApacchehe
3. SPARK– INSTALLATION SSpapakrkr

Spark is Hadoop’s sub-project. Therefore, it is better to install Spark into a Linux based system.
The following steps show how to install Apache Spark.

Step 1: Verifying Java Installation

Java installation is one of the mandatory things in installing Spark. Try the following command
to verify the JAVA version.

$java -version

If Java is already, installed on your system, you get to see the following response –

java version "1.7.0_71"


Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

In case you do not have Java installed on your system, then Install Java before proceeding to
next step.

Step 2: Verifying Scala installation

You should Scala language to implement Spark. So let us verify Scala installation using
following command.

$scala -version

If Scala is already installed on your system, you get to see the following response –

Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL

In case you don’t have Scala installed on your system, then proceed to next step for Scala
installation.

Step 3: Downloading Scala

Download the latest version of Scala by visit the following link Download Scala. For this
tutorial, we are using scala-2.11.6 version. After downloading, you will find the Scala tar file
in the download folder.
Step 4: Installing Scala

Follow the below given steps for installing Scala.

Extract the Scala tar file


Type the following command for extracting the Scala tar file.

$ tar xvf scala-2.11.6.tgz

Move Scala software files


Use the following commands for moving the Scala software files, to respective directory
(/usr/local/scala).

$ su – Password:
# cd /home/Hadoop/Downloads/
# mv scala-2.11.6 /usr/local/scala
# exit

Set PATH for Scala


Use the following command for setting PATH for Scala.

$ export PATH = $PATH:/usr/local/scala/bin

Verifying Scala Installation


After installation, it is better to verify it. Use the following command for verifying Scala
installation.

$scala -version

If Scala is already installed on your system, you get to see the following response –

Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL

Step 5: Downloading Apache Spark


Apache Spark
Download the latest version of Spark by visiting the following link Download Spark. For this
tutorial, we are using spark-1.3.1-bin-hadoop2.6 version. After downloading it, you will find
the Spark tar file in the download folder.

Apache Spark

Step 6: Installing Spark

Follow the steps given below for installing Spark.

Extracting Spark tar


The following command for extracting the spark tar file.

$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz

Moving Spark software files


The following commands for moving the Spark software files to respective directory
(/usr/local/spark).

$ su – Password:

# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
# exit

Setting up the environment for Spark


Add the following line to ~/.bashrc file. It means adding the location, where the spark software
file are located to the PATH variable.

export PATH = $PATH:/usr/local/spark/bin

Use the following command for sourcing the ~/.bashrc file.

$ source ~/.bashrc

Step 7: Verifying the Spark Installation

Write the following command for opening Spark shell.


$spark-shell

If spark is installed successfully then you will find the following output.

Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled; ui
acls disabled; users with view permissions: Set(hadoop); users with modify permissions:
Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port
43292.
Welcome to

/ / //
_\ \/ _ \/ _ `/ / '_/
/ / . /\_,_/_/ /_/\_\ version 1.4.0
/_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71) Type in
expressions to have them evaluated.
Spark context available as sc
scala>

Spark Core is the base of the whole project. It provides distributed task dispatching, scheduling,
and basic I/O functionalities. Spark uses a specialized fundamental data structure known as
RDD (Resilient Distributed Datasets) that is a logical collection of data partitioned across
machines. RDDs can be created in two ways; one is by referencing datasets in external storage
ApaApacchehe
4. SPARK– CORE PROGRAMMING
SSpapakrkr

systems and second is by applying transformations (e.g. map, filter, reducer, join) on existing
RDDs.
The RDD abstraction is exposed through a language-integrated API. This simplifies
programming complexity because the way applications manipulate RDDs is similar to
manipulating local collections of data.

Spark Shell

Spark provides an interactive shell: a powerful tool to analyze data interactively. It is available
in either Scala or Python language. Spark’s primary abstraction is a distributed collection of
items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop Input
Formats (such as HDFS files) or by transforming other RDDs.

Open Spark Shell


The following command is used to open Spark shell.

$ spark-shell

Create simple RDD


Let us create a simple RDD from the text file. Use the following command to create a simple
RDD.

scala> val inputfile = sc.textFile(“input.txt”)

The output for the above command is

inputfile: org.apache.spark.rdd.RDD[String] = input.txt MappedRDD[1] at textFile at


<console>:12

The Spark RDD API introduces few Transformations and few Actions to manipulate RDD.

RDD Transformations

RDD transformations returns pointer to new RDD and allows you to create dependencies
between RDDs. Each RDD in dependency chain (String of Dependencies) has a function for
calculating its data and has a pointer (dependency) to its parent RDD.
Spark is lazy, so nothing will be executed unless you call some transformation or action that
will trigger job creation and execution. Look at the following snippet of the wordcount
example.
Therefore, RDD transformation is not a set of data but is a step in a program (might be the only
step) telling Spark how to get data and what to do with it.
Given below is a list of RDD transformations.

S. No Transformations & Meaning

map(func)

1 Returns a new distributed dataset, formed by passing each element of the source
through a function func.

filter(func)

2 Returns a new dataset formed by selecting those elements of the source on which
func returns true.

flatMap(func)

3 Similar to map, but each input item can be mapped to 0 or more output items (so
func should return a Seq rather than a single item).

mapPartitions(func)

4 Similar to map, but runs separately on each partition (block) of the RDD, so func
must be of type Iterator<T> => Iterator<U> when running on an RDD of type T.

mapPartitionsWithIndex(func)

5 Similar to map Partitions, but also provides func with an integer value
representing the index of the partition, so func must be of type (Int, Iterator<T>)
=> Iterator<U> when running on an RDD of type T.
Apache Spark
sample(withReplacement, fraction, seed)

6 Sample a fraction of the data, with or without replacement, using a given random
number generator seed.

union(otherDataset)

7 Returns a new dataset that contains the union of the elements in the source
dataset and the argument.
intersection(otherDataset)

8 Returns a new RDD that contains the intersection of elements in the source dataset
and the argument.
distinct([numTasks]))

9 Returns a new dataset that contains the distinct elements of the source dataset.
groupByKey([numTasks])

When called on a dataset of (K, V) pairs, returns a dataset of (K, Apache Spark
Iterable<V>) pairs.
10 Note: If you are grouping in order to perform an aggregation (such as a sum or
average) over each key, using reduceByKey or aggregateByKey will yield much
better performance.

reduceByKey(func, [numTasks])

11 When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where
the values for each key are aggregated using the given reduce function func,
which must be of type (V, V) => V. Like in groupByKey, the number of reduce
tasks is configurable through an optional second argument.

aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])

12 When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where
the values for each key are aggregated using the given combine functions and a
neutral "zero" value. Allows an aggregated value type that is different from the
input value type, while avoiding unnecessary allocations. Like in groupByKey,
the number of reduce tasks is configurable through an optional second argument.

sortByKey([ascending], [numTasks])

13 When called on a dataset of (K, V) pairs where K implements Ordered, returns a


dataset of (K, V) pairs sorted by keys in ascending or descending order, as
specified in the Boolean ascending argument.

14 join(otherDataset, [numTasks])

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V,
W)) pairs with all pairs of elements for each key. Outer joins are supported
through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

cogroup(otherDataset, [numTasks])

15 When called on datasets of type (K, V) and (K, W), returns a dataset of (K,
(Iterable<V>, Iterable<W>)) tuples. This operation is also called group With.
Apache Spark

cartesian(otherDataset)

16 When called on datasets of types T and U, returns a dataset of (T, U) pairs (all
pairs of elements).

pipe(command, [envVars])

17 Pipe each partition of the RDD through a shell command, e.g. a Perl or bash
script. RDD elements are written to the process's stdin and lines output to its
stdout are returned as an RDD of strings.

coalesce(numPartitions)

18 Decrease the number of partitions in the RDD to numPartitions. Useful for


running operations more efficiently after filtering down a large dataset.

repartition(numPartitions)

19 Reshuffle the data in the RDD randomly to create either more or fewer partitions
and balance it across them. This always shuffles all data over the network.

repartitionAndSortWithinPartitions(partitioner)

20 Repartition the RDD according to the given partitioner and, within each resulting
partition, sort records by their keys. This is more efficient than calling repartition
and then sorting within each partition because it can push the sorting down into
the shuffle machinery.

S.No Action & Meaning

reduce(func)

1 Aggregate the elements of the dataset using a function func (which takes two
arguments and returns one). The function should be commutative and associative
so that it can be computed correctly in parallel.
Apache Spark

collect()

2 Returns all the elements of the dataset as an array at the driver program. This is
usually useful after a filter or other operation that returns a sufficiently small subset
of the data.

count()

3 Returns the number of elements in the dataset.

first()

4 Returns the first element of the dataset (similar to take (1)).

take(n)

5 Returns an array with the first n elements of the dataset.

takeSample (withReplacement,num, [seed])

6 Returns an array with a random sample of num elements of the dataset, with or
without replacement, optionally pre-specifying a random number generator seed.

takeOrdered(n, [ordering])

7 Returns the first n elements of the RDD using either their natural order or a custom
comparator.

saveAsTextFile(path)

8 Writes the elements of the dataset as a text file (or set of text files) in a given
directory in the local filesystem, HDFS or any other Hadoop-supported file system.
Spark calls toString on each element to convert it to a line of text
in the file.
Apache Spark

saveAsSequenceFile(path) (Java and Scala)


Actions
Writes the elements of the dataset as a Hadoop SequenceFile in a given path in the
local filesystem, HDFS or any other Hadoop-supported file system. This is
9 available on RDDs of key-value pairs that implement Hadoop's Writable interface.
In Scala, it is also available on types that are implicitly convertible to Writable
(Spark includes conversions for basic types like Int, Double, String, etc).

saveAsObjectFile(path) (Java and Scala)

10 Writes the elements of the dataset in a simple format using Java serialization,
which can then be loaded using SparkContext.objectFile().

countByKey()

11 Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with
the count of each key.

foreach(func)

Runs a function func on each element of the dataset. This is usually, done for side
effects such as updating an Accumulator or interacting with external storage
12 systems.

Note: modifying variables other than Accumulators outside


of the foreach() may result in undefined behavior. See Understanding
closures for more details.

The following table gives a list of Actions, which return values.

Programming with RDD

Let us see the implementations of few RDD transformations and actions in RDD programming
with the help of an example.

Example
Consider a word count example: It counts each word appearing in a document. Consider the
following text as an input and is saved as an input.txt file in a home directory. input.txt: input
file.
Apache Spark
people are not as beautiful as they look, as they walk or as they talk.
they are only as beautiful as they love,
as they care as they share.

Follow the procedure given below to execute the given example.

Open Spark-Shell
The following command is used to open spark shell. Generally, spark is built using Scala.
Therefore, a Spark program runs on Scala environment.

$ spark-shell

If Spark shell opens successfully then you will find the following output. Look at the last line
of the output “Spark context available as sc” means the Spark container is automatically created
spark context object with the name sc. Before starting the first step of a program, the
SparkContext object should be created.

Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled; ui
acls disabled; users with view permissions: Set(hadoop); users with modify permissions:
Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port
43292.
Welcome to

/ / //
_\ \/ _ \/ _ `/ / '_/
/ / . /\_,_/_/ /_/\_\ version 1.2.0
/_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71) Type
in expressions to have them evaluated. Spark context available as sc scala>
Apache Spark

Create an RDD
First, we have to read the input file using Spark-Scala API and create an RDD.
The following command is used for reading a file from given location. Here, new RDD is
created with the name of inputfile. The String which is given as an argument in the textFile(“”)
method is absolute path for the input file name. However, if only the file name is given, then it
means that the input file is in the current location.

scala> val inputfile = sc.textFile("input.txt")

Execute Word count Transformation


Our aim is to count the words in a file. Create a flat map for splitting each line into words
(flatMap(line => line.split(“ ”)).
Next, read each word as a key with a value ‘1’ (<key, value> = <word,1>)using map function
(map(word => (word, 1)).
Finally, reduce those keys by adding values of similar keys (reduceByKey(_+_)).
The following command is used for executing word count logic. After executing this, you will
not find any output because this is not an action, this is a transformation; pointing a new RDD
or tell spark to what to do with the given data)

scala> val counts = inputfile.flatMap(line => line.split(" ")).map(word => (word,


1)).reduceByKey(_+_);

Current RDD
While working with the RDD, if you want to know about current RDD, then use the following
command. It will show you the description about current RDD and its dependencies for
debugging.

scala> counts.toDebugString

Caching the Transformations


You can mark an RDD to be persisted using the persist() or cache() methods on it. The first
time it is computed in an action, it will be kept in memory on the nodes. Use the following
command to store the intermediate transformations in memory.
Apache Spark
scala> counts.cache()

Applying the Action


Applying an action, like store all the transformations, results into a text file. The String
argument for saveAsTextFile(“ ”) method is the absolute path of output folder. Try the
following command to save the output in a text file. In the following example, ‘output’ folder
is in current location.

scala> counts.saveAsTextFile("output")

Checking the Output


Open another terminal to go to home directory (where spark is executed in the other terminal).
Use the following commands for checking output directory.

[hadoop@localhost ~]$ cd output/ [hadoop@localhost


output]$ ls -1
part-00000
part-00001
_SUCCESS

The following command is used to see output from Part-00000 files.

[hadoop@localhost output]$ cat part-00000

Output
(people,1)
(are,2)
(not,1)
(as,8)
(beautiful,2)
(they, 7)
(look,1)

The following command is used to see output from Part-00001 files.

[hadoop@localhost output]$ cat part-00001


Apache Spark
Output

(walk, 1) (or, 1)
(talk, 1)
(only, 1)
(love, 1)
(care, 1)
(share, 1)

UN Persist the Storage

Before UN-persisting, if you want to see the storage space that is used for this application, then
use the following URL in your browser.

http://localhost:4040

You will see the following screen, which shows the storage space used for the application,
which are running on the Spark shell.
Apache Spark

If you want to UN-persist the storage space of particular RDD, then use the following
command.

Scala> counts.unpersist()

You will see the output as follows:

15/06/27 00:57:33 INFO ShuffledRDD: Removing RDD 9 from persistence list


15/06/27 00:57:33 INFO BlockManager: Removing RDD 9
15/06/27 00:57:33 INFO BlockManager: Removing block rdd_9_1
15/06/27 00:57:33 INFO MemoryStore: Block rdd_9_1 of size 480 dropped from memory
(free 280061810)
15/06/27 00:57:33 INFO BlockManager: Removing block rdd_9_0
15/06/27 00:57:33 INFO MemoryStore: Block rdd_9_0 of size 296 dropped from memory
(free 280062106)
res7: cou.type = ShuffledRDD[9] at reduceByKey at <console>:14

For verifying the storage space in the browser, use the following URL.
Apache Spark
http://localhost:4040

You will see the following screen. It shows the storage space used for the application, which
are running on the Spark shell.
5. SPARK– DEPLOYMENT Apache Spark

Spark application, using spark-submit, is a shell command used to deploy the Spark application
on a cluster. It uses all respective cluster managers through a uniform interface. Therefore, you
do not have to configure your application for each one.

Example
Let us take the same example of word count, we used before, using shell commands. Here, we
consider the same example as a spark application.

Sample Input
The following text is the input data and the file named is in.txt.

people are not as beautiful as they look, as they


walk or as they talk.
they are only as beautiful as they love, as they
care as they share.

Look at the following program:

SparkWordCount.scala
import org.apache.spark.SparkContext import
org.apache.spark.SparkContext._ import
org.apache.spark._

object SparkWordCount {
def main(args: Array[String]) {

val sc = new SparkContext( "local", "Word Count", "/usr/local/spark", Nil, Map(),


Map())

/* local = master URL; Word Count = application name; */

/* /usr/local/spark = Spark Home; Nil = jars; Map = environment */

/* Map = variables to work nodes */


Apache Spark
/*creating an inputRDD to read text file (in.txt) through Spark context*/ val input =
sc.textFile("in.txt")

/* Transform the inputRDD into countRDD */ val


count=input.flatMap(line=>line.split(" "))
.map(word=>(word, 1))
.reduceByKey(_ + _)

/* saveAsTextFile method is an action that effects on the RDD */


count.saveAsTextFile("outfile")
System.out.println("OK");
}
}
Save the above program into a file named SparkWordCount.scala and place it in a user-
defined directory named spark-application.

Note: While transforming the inputRDD into countRDD, we are using flatMap() for tokenizing
the lines (from text file) into words, map() method for counting the word frequency and
reduceByKey() method for counting each word repetition.
Use the following steps to submit this application. Execute all steps in the sparkapplication
directory through the terminal.
Apache Spark
Step 1: Download Spark Jar
Spark core jar is required for compilation, therefore, download spark-core_2.10-1.3.0.jar from
the following link Spark core jar and move the jar file from download directory to spark-
application directory.

Step 2: Compile program


Compile the above program using the command given below. This command should be
executed from the spark-application directory. Here, /usr/local/spark/lib/sparkassembly-
1.4.0-hadoop2.6.0.jar is a Hadoop support jar taken from Spark library.

$ scalac -classpath "spark-core_2.10-1.3.0.jar:/usr/local/spark/lib/sparkassembly-1.4.0-


hadoop2.6.0.jar" SparkPi.scala

Step 3: Create a JAR


Create a jar file of the spark application using the following command. Here, wordcount is the
file name for jar file.

jar -cvf wordcount.jar SparkWordCount*.class spark-core_2.10-1.3.0.jar


/usr/local/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar

Step 4: Submit spark application


Submit the spark application using the following command:

spark-submit --class SparkWordCount --master local wordcount.jar

If it is executed successfully, then you will find the output given below. The OK letting in the
following output is for user identification and that is the last line of the program. If you carefully
read the following output, you will find different things, such as:
• successfully started service 'sparkDriver' on port 42954
• MemoryStore started with capacity 267.3 MB
• Started SparkUI at http://192.168.1.217:4040
• Added JAR file:/home/hadoop/piapplication/count.jar
• ResultStage 1 (saveAsTextFile at SparkPi.scala:11) finished in 0.566 s
• Stopped Spark web UI at http://192.168.1.217:4040
• MemoryStore cleared
Apache Spark
15/07/08 13:56:04 INFO Slf4jLogger: Slf4jLogger started
15/07/08 13:56:04 INFO Utils: Successfully started service 'sparkDriver' on port 42954.
15/07/08 13:56:04 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://[email protected]:42954]
15/07/08 13:56:04 INFO MemoryStore: MemoryStore started with capacity 267.3 MB
15/07/08 13:56:05 INFO HttpServer: Starting HTTP Server
15/07/08 13:56:05 INFO Utils: Successfully started service 'HTTP file server' on port
56707.
15/07/08 13:56:06 INFO SparkUI: Started SparkUI at http://192.168.1.217:4040
15/07/08 13:56:07 INFO SparkContext: Added JAR
file:/home/hadoop/piapplication/count.jar at
http://192.168.1.217:56707/jars/count.jar with timestamp 1436343967029
15/07/08 13:56:11 INFO Executor: Adding file:/tmp/spark-45a07b83-42ed-42b3b2c2-
823d8d99c5af/userFiles-df4f4c20-a368-4cdd-a2a7-39ed45eb30cf/count.jar to class loader
15/07/08 13:56:11 INFO HadoopRDD: Input split:
file:/home/hadoop/piapplication/in.txt:0+54
15/07/08 13:56:12 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2001 bytes result
sent to driver
(MapPartitionsRDD[5] at saveAsTextFile at SparkPi.scala:11), which is now runnable
15/07/08 13:56:12 INFO DAGScheduler: Submitting 1 missing tasks from
ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at SparkPi.scala:11)
15/07/08 13:56:13 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at
SparkPi.scala:11) finished in 0.566 s
15/07/08 13:56:13 INFO DAGScheduler: Job 0 finished: saveAsTextFile at
SparkPi.scala:11, took 2.892996 s
OK
15/07/08 13:56:13 INFO SparkContext: Invoking stop() from shutdown hook
15/07/08 13:56:13 INFO SparkUI: Stopped Spark web UI at http://192.168.1.217:4040
15/07/08 13:56:13 INFO DAGScheduler: Stopping DAGScheduler
15/07/08 13:56:14 INFO MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped!
15/07/08 13:56:14 INFO Utils: path = /tmp/spark-45a07b83-42ed-42b3-
b2c2823d8d99c5af/blockmgr-ccdda9e3-24f6-491b-b509-3d15a9e05818, already present as
root for deletion.
15/07/08 13:56:14 INFO MemoryStore: MemoryStore cleared
15/07/08 13:56:14 INFO BlockManager: BlockManager stopped
Apache Spark
15/07/08 13:56:14 INFO BlockManagerMaster: BlockManagerMaster stopped
15/07/08 13:56:14 INFO SparkContext: Successfully stopped SparkContext
15/07/08 13:56:14 INFO Utils: Shutdown hook called
15/07/08 13:56:14 INFO Utils: Deleting directory /tmp/spark-45a07b83-42ed-42b3b2c2-
823d8d99c5af
15/07/08 13:56:14 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
OutputCommitCoordinator stopped!

Step 5: Checking output


After successful execution of the program, you will find the directory named outfile in the
spark-application directory.
The following commands are used for opening and checking the list of files in the outfile
directory.

$ cd outfile
$ ls
Part-00000 part-00001 _SUCCESS

The commands for checking output in part-00000 file are:

$ cat part-00000
(people,1)
(are,2)
(not,1)
(as,8)
(beautiful,2)
(they, 7)
(look,1)

The commands for checking output in part-00001 file are:


Apache Spark

$ cat part-00001
(walk, 1) (or, 1)
(talk, 1)
(only, 1)
(love, 1)
(care, 1)
(share, 1)
Go through the following section to know more about the ‘spark-submit’ command.

Spark-submit Syntax

spark-submit [options] <app jar | python file> [app arguments]

Options
The table given below describes a list of options:-
S.No Option Description

1 --master spark://host:port, mesos://host:port, yarn, or local.

2 --deploy-mode Whether to launch the driver program locally ("client") or


on one of the worker machines inside the cluster
("cluster") (Default: client).

3 --class Your application's main class (for Java / Scala apps).

4 --name A name of your application.

5 --jars Comma-separated list of local jars to include on the driver


and executor classpaths.

6 --packages Comma-separated list of maven coordinates of jars to


include on the driver and executor classpaths.

7 --repositories Comma-separated list of additional remote repositories to


search for the maven coordinates given with --packages.
Apache Spark

8 --py-files Comma-separated list of .zip, .egg, or .py files to place on


the PYTHON PATH for Python apps.

9 --files Comma-separated list of files to be placed in the working


directory of each executor.

10 --conf (prop=val) Arbitrary Spark configuration property.

11 --properties-file Path to a file from which to load extra properties. If not


specified, this will look for conf/spark-defaults.

12 --driver-memory Memory for driver (e.g. 1000M, 2G) (Default: 512M).

13 --driver-java-options Extra Java options to pass to the driver.

14 --driver-library-path Extra library path entries to pass to the driver.

15 --driver-class-path Extra class path entries to pass to the driver.


Note that jars added with --jars are automatically included
in the classpath.
16 --executor-memory Memory per executor (e.g. 1000M, 2G) (Default: 1G).

17 --proxy-user User to impersonate when submitting the application.

18 --help, -h Show this help message and exit.

19 --verbose, -v Print additional debug output.

20 --version Print the version of current Spark.

21 --driver-cores NUM Cores for driver (Default: 1).

22 --supervise If given, restarts the driver on failure.

23 --kill If given, kills the driver specified.

24 --status If given, requests the status of the driver specified.


Apache Spark

25 --total-executor-cores Total cores for all executors.

26 --executor-cores Number of cores per executor. (Default: 1 in YARN


mode, or all available cores on the worker in standalone
mode).
Apache Spark
6. ADVANCED SPARK PROGRAMMING

Spark contains two different types of shared variables- one is broadcast variables and
second is accumulators.
• Broadcast variables: used to efficiently, distribute large values.
• Accumulators: used to aggregate the information of particular collection.

Broadcast Variables

Broadcast variables allow the programmer to keep a read-only variable cached on each
machine rather than shipping a copy of it with tasks. They can be used, for example, to
give every node, a copy of a large input dataset, in an efficient manner. Spark also
attempts to distribute broadcast variables using efficient broadcast algorithms to reduce
communication cost.
Spark actions are executed through a set of stages, separated by distributed “shuffle”
operations. Spark automatically broadcasts the common data needed by tasks within
each stage.
The data broadcasted this way is cached in serialized form and is deserialized before
running each task. This means that explicitly creating broadcast variables, is only
useful when tasks across multiple stages need the same data or when caching the data
in deserialized form is important.
Broadcast variables are created from a variable v by
calling
SparkContext.broadcast(v). The broadcast variable is a wrapper around v, and its
value can be accessed by calling the value method. The code given below shows this:

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))

Output:

broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

After the broadcast variable is created, it should be used instead of the value v in any
functions run on the cluster, so that v is not shipped to the nodes more than once. In
addition, the object v should not be modified after its broadcast, in order to ensure that
all nodes get the same value of the broadcast variable.

Accumulators
Big Data Analytics and
Visualization Lab
Accumulators are variables that are only “added” to through an associative operation
and can therefore, be efficiently supported in parallel. They can be used to implement
counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric
types, and programmers can add support for new types. If accumulators are created
with a name, they will be displayed in Spark’s UI. This can be useful for understanding
the progress of running stages (NOTE: this is not yet supported in Python).
Apache Spark

An accumulator is created from an initial value v by


calling
SparkContext.accumulator(v). Tasks running on the cluster can then add to it using
the add method or the += operator (in Scala and Python). However, they cannot read
its value. Only the driver program can read the accumulator’s value, using its value
method.
The code given below shows an accumulator being used to add up the elements of an
array:

scala> val accum = sc.accumulator(0)

scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)

If you want to see the output of above code then use the following command:

scala> accum.value

Output

res2: Int = 10

Numeric RDD Operations

Spark allows you to do different operations on numeric data, using one of the
predefined API methods. Spark’s numeric operations are implemented with a
streaming algorithm that allows building the model, one element at a time.
These operations are computed and returned as a StatusCounter object by calling
status() method.
The following is a list of numeric methods available in StatusCounter.
S.No Method & Meaning
Data Visualization
1 count()
Number of elements in the RDD.

2 Mean()
Average of the elements in the RDD.

3 Sum()
Total value of the elements in the RDD.

4 Max()
Maximum value among all elements in the RDD.

Apache Spark
5 Min()
Minimum value among all elements in the RDD.

6 Variance()
Variance of the elements.

7 Stdev()
Standard deviation.

If you want to use only one of these methods, you can call the corresponding method
directly on RDD.
Big Data Analytics and
Visualization Lab

Experiment Number 7
Visualization using Tableau:
Tableau: Tool Overview, Importing Data, Analyzing with Charts

Analysis operations
Use Data Set: Global Super store

Q1. Find the customer with the highest overall profit. What is his/her profit
ratio?
Ans:
Step 1: Open the superstoreus2015 excel data set
Data Visualization

Step 2: Drag Orders sheet to sheet area


Big Data Analytics and
Visualization Lab
Step 3: Go to sheet 1 and add Customer name as rows and profit as
column

Step 4: Sort the data by clicking on Profit label on bottom


Data Visualization

Step 5: To calculate profit Ratio Profit Ratio =


(Sum([Profit])/Sum([Sales]))
This formula needs to be entered as tooltip or label
Click on Analysis>Create Calculated Field and enter the formula click
apply and ok.

You can see Calculation1 in measures.


Big Data Analytics and
Visualization Lab

Drag is to the Marks area.


Data Visualization

Final answer is:


Big Data Analytics and
Visualization Lab

Q2. Which state has the highest Sales (Sum)? What is the total Sales for that
state?
Solution -

Steps – Drag and drop state as row and sales as column


Sort the data by clicking on sales at bottom
Data Visualization

Q3. Which customer segment has both the highest order quantity and average
discount rate? What is the order quantity and average discount rate for that
state?
Solution -

Step 1 – drag and drop segment as column


Step 2 – drag and drop measures value in marks
Big Data Analytics and
Visualization Lab

Step 3 – click on measures value -> edit filter In filter select discount and
quantity -> Click apply and ok
Data Visualization

Step 4 – drag and drop measures value as row


Big Data Analytics and
Visualization Lab

Q4. Which Product Category has the highest total Sales? Which Product
Category has the worst Profit? Name the Product Category and $ amount for
each.

Ans:
a. Bar Chart displaying total Sales for each Product Category

b. Add a color scale indicating Profit


Data Visualization

c. Each Product Category labeled with total Sales and Each Product
Category labeled with Profit

Q5. Use the same visualization created for Question #4.What was the Profit on
Technology (Product Category) in Boca Raton (City) ?

Add Filter City


Big Data Analytics and
Visualization Lab

Select the city to show a single bar for the city


Data Visualization

Apply a filter for Technology

Final output:
Big Data Analytics and
Visualization Lab

Q6. Which Product Department has the highest Shipping Costs? Name the
Department and cost.
Solution -

Steps – drag and drop in marks area


a. Product category and set it as label
b. Sales and set it as size
c. Product category and set it as color
d. Shipping cost and set it as tooltip
Data Visualization

Final output:

Q7. Use the same visualization created for Question #6. What was the shipping
cost of Office Supplies for Xerox 1905 in the Home Customer Segment in
Cambridge?
Solution -

Step 1- add filter category to filter area (select office supplies)


Big Data Analytics and
Visualization Lab

Step 2 – drag and drop city in marks area

Step 3 – add filter city to filters area (select Cambridge)


Data Visualization

Step 4 – add filter segment to filters area (select home office)


Big Data Analytics and
Visualization Lab

Step 5 – drag and drop segment(select home office) and product name(select
Xerox 1936) in marks area
Data Visualization

Final output -

Preparing Maps
Big Data Analytics and
Visualization Lab

Q1. Prepare a Geographic map to show sales in each state.


Solution –

Step1: open global superstore 2016 dataset

Step 2 – join sheets (orders and people) First drag and drop orders table. Then
click on the arrow of orders, select open. Then drag and drop people's tables.
Data Visualization
Step 3 - Create a Geographic Hierarchy
1. In the Data pane, right-click the geographic field, Country, and
then select Hierarchy > Create Hierarchy.

2. In the Create Hierarchy dialog box that opens, give the


hierarchy a name, such as Mapping Items, and then click OK.
At the bottom of the Dimensions section, the Mapping Items
hierarchy is created with the Country field.

3. In the Data pane, drag the State field to the hierarchy and place
it below the Country field.

4. Repeat step 3 for the City and Postal Code fields.

Step 4 - Build a basic map


In the Data pane, double-click Country.
Big Data Analytics and
Visualization Lab

Step 5 - On the Marks card, click the + icon on the Country field.
Data Visualization
Step 6 – To Add color - From Measures, drag Sales to Color on the Marks
card.

Step 7 - Add labels


1. From Measures, drag Sales to Label on the Marks card.
Each state is labelled with a sum of sales. The numbers need a little bit of
formatting, however.
2. In the Data pane, right-click Sales and select Default Properties > Number
Format.
3. In the Default Number Format dialog box that opens, select Number
(Custom), and then do the following:
o For Decimal Places, enter 0.
o For Units, select Thousands (K).
o Click OK.
Big Data Analytics and
Visualization Lab

Final output -
Data Visualization

Q2. Show Profit Ratio of each state as tooltip on map

Steps –
1) Use same visualization created for q1
2)To calculate profit Ratio
Profit Ratio= (Sum([Profit])/Sum([Sales]))
Click on Analysis>Create Calculated Field and enter the formula, click
apply and ok.
Drag and drop this calculation on label in marks area
Big Data Analytics and
Visualization Lab

Final output -

Q3. Show Profit ratio for Grip Envelop products


Solution –

Steps –
1) use the same visualization created in q2
2) drag and drop product name in marks area
3) add filter product name in filters area (select grip seal Envelope
product name in opened window and click ok)
Data Visualization

Final output –
Big Data Analytics and
Visualization Lab

Q4. In the technology product category which unprofitable state is surrounded


by only profitable states.
Solution –

Step 1 - In the Data pane, double-click Country.


On the Marks card, click the + icon on the Country field.
Data Visualization
Step 2- Drag the product category on the filter shelf and select in technology.

Step 3 - Now Drag the profit measure to the color mark in the marks area.
Also drag profit to label in the marks area.

Final output -
Big Data Analytics and
Visualization Lab

Q5. Which state has the worst Gross Profit Ratio on Envelopes in the
Corporate Customer Segment that were Shipped in 2015?
Solution -

Step 1 - In the Data pane, double-click Country. On the Marks card, click the
+ icon on the Country field.
Drag sub-category, segment, order date to marks area
Drag calculated profit ratio to label and tooltip area
Data Visualization

Step 2 - Drag the order date on the filter shelf and select in year – 2015
Step 3 - Drag the segment on the filter shelf and select corporate.
Step 4 - Step 3 - Drag the category on the filter shelf and select the envelope.
Big Data Analytics and
Visualization Lab
Data Visualization
Big Data Analytics and
Visualization Lab

Final output -

Preparing Reports
Data Visualization

1) Prepare a report showing product category wise sales


Solution –

Steps – Drag and drop sub-category as rows.


Drag and drop sales in marks area and make it as text

Final output -
Big Data Analytics and
Visualization Lab
2) Report showing regionwise, product wise sales
Solution –

Steps – Drag and drop region as rows and sub category as columns
Drag and drop sales in marks area and make it as text

Final output -

3) Report showing state wise sales


Solution –

Steps – Drag and drop state as rows.


Drag and drop sales in marks area and make it as text
Final output -
Data Visualization

4) What is the percent of total Sales for the ‘Home Office’ Customer
Segment in July of 2014?
Solution –
Step 1 –
Drag and segment as rows.
Drag and drop sales in marks area and make it as text
Big Data Analytics and
Visualization Lab

Step 2 – add order date filter. Select year in window appeared -> next -> select
July 2014

Step 3 – click on arrow on sales in marks area -> quick table calculation ->
percentage of total
Data Visualization

Final Output -
Big Data Analytics and
Visualization Lab
Data Visualization
5) Find the top 10 Product Names by Sales within each region. Which
product is ranked #2 in both the Central & West regions in 2015?
Solution –

Step 1 -Drag “Product Name” dimension from data pane window to Row
Shelf, region to column shelf and then add an “order Date” on Filter shelf and
select “Year” of Order date as 2015

Step 2 - put region on Filter shelf and select “Central” and “West” checkbox
Big Data Analytics and
Visualization Lab

Step 3 - Drag a Sales measure to the text label in the marks area.
Add the “Product name” on the Filter shelf. Once the Filter Pop up is open,
Select “TOP” tab >By Field > Top 10 by Sum (Sales).
Data Visualization

Right click on the aggregated Sales measure and click on the arrow sign then
select Quick Table Calculation > Rank.
As the default addressing is Table across, please change it into Table Down
(Compute using -> Table Down).
Big Data Analytics and
Visualization Lab
Data Visualization
Final output -

You might also like