C21053 Jay Vijay Karwatkar-Big Data Analytics & Visualization
C21053 Jay Vijay Karwatkar-Big Data Analytics & Visualization
HDFS: List of Commands (mkdir, touchz, copy from local/put, copy to local/get, move from local, cp, rmr,
du, dus, stat)
Hadoop Practical
To use the HDFS commands , first need to start Hadoop services by using this
command:
start-all.sh
To check the Hadoop Services are up and running using this command: jps
Commands:
2. ls : This command is used to list all the files. It is useful when we want a hierarchy
of a folder.
Syntax : hdfs (space) dfs (space) -ls (space) /(no space)<path>
It will print all the directories present in HDFS. bin directory contains executables
so, bin/hdfs means we want the executables of hdfs particularly dfs(Distributed
File System) commands.
6. copyToLocal (or) get: To copy files/folders from hdfs storage to local file system.
Syntax: hdfs (space) dfs (space) -copyToLocal or -get (space) <<source file(on
hdfs)>(space) <local file destination>
Copying the mrudula.txt file from Hadoop file system in local system.
From C21004_Hadoop
10. rm -r: This command deletes a file from HDFS recursively. It is very useful
command when need to delete a non-empty directory.
Syntax : hdfs (space) dfs (space) -rm(space) -r (space) <file name / file directory
name>
Creating another directory C21004_Hadoop2. Using ls command , listing the
directory names and then deleting the directory which was created
(C21004_Hadoop2)
12. -du -s: This command will give the total size of directory/file.
Syntax: hdfs(space) dfs (space) -du (space) -s(space) <file directory name>
13. stat: It will give the last modified time of directory or path. In short it will give
stats of the directory or file.
Syntax : hdfs (space) dfs (space) -stat (space) <HDFS file name or directory name>
Experiment Number 2
Map Reduce:
1. Write a program in Map Reduce for WordCount operation.
2. Write a program in Map Reduce for Union operation.
3. Write a program in Map Reduce for Intersection operation.
4. Write a program in Map Reduce for Grouping and Aggregation.
5. Write a program in Map Reduce for Matrix Multiplication
1. Write the following Map Reduce program to understand Map Reduce Paradigm.
(Create your own .txt file)
a. WordCount
PROGRAM:
WordDriver.java
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
}
}
WordMapper.java
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
SumReducer.java
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
PROGRAM:
WordDriver.java
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
job.setJarByClass(WordDriver.class);
job.setJobName("Word Count");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FloatWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
WordMapper.java
import java.io.IOException;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
SumReducer.java
import java.io.IOException;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
count += 1;
average = sum/count;
//result.set(average);
PROGRAM:
ValueSortExp.java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
// Create configuration
Configuration conf = new Configuration(true);
// Create job
Job job = new Job(conf, "Test HIVE Command");
job.setJarByClass(ValueSortExp.class);
// Setup MapReduce
job.setMapperClass(ValueSortExp.MapTask.class);
job.setReducerClass(ValueSortExp.ReduceTask.class);
job.setNumReduceTasks(1);
// Output
FileOutputFormat.setOutputPath(job, outputDir);
job.setOutputFormatClass(TextOutputFormat.class);
}
}
}
}
}
}
OUTPUT:
3. Implement matrix multiplication with Hadoop Map Reduce
PROGRAM:
MatrixMultiply.java
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
@SuppressWarnings("Deprecation")
Job job = new Job(conf, "MatrixMultiply");
job.setJarByClass(MatrixMultiply.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.waitForCompletion(true);
}
}
Map.java
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
Reduce.java
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.util.HashMap;
2. Click on Setup. Once download is complete open the MSI file. Click Next in the start-up screen.
3. Accept the End-User License Agreement Click Next.
4. Click on the “complete” button to install all of the components. The custom option can be used
to install selective components or if you want to change the location of the installation.
5. Service Configuration: Select “Run service as Network Service user --> Click Next.
Q1. Create Database with name College using MongoDB command prompt.
Solution: “The use Command” is used to create database. The command will create
a new database if it doesn't exist, otherwise it will return the existing database.
Q2. Create a MongoDB collections employee and Department under database college
Employee:
empcode INT, empfname
STRING,
emplname STRING, job
STRING, manager
STRING, hiredate
STRING,
salary INT, commission
INT,
deptcode INT
Dept table:
deptcode INT, deptname
STRING,
location STRING
Solution:
The createCollection() Method:
MongoDB db.createCollection(name, options) is used to create collection.
Syntax:
Basic syntax of createCollection() command is as follows
− db.createCollection(name, options)
In the command, name is name of collection to be created. Options is a document and is used to
specify configuration of collection.
Parameter Type Description
Name String Name of the collection to be
created
Options Document (Optional) Specify options about
memory size and indexing
Options parameter is optional, so you need to specify only the name of the collection.
Field Type Description
capped Boolean (Optional) If true, enables a capped
collection. Capped collection is a fixed size
collection that automatically overwrites its
oldest entries when it reaches its maximum
size. If you specify true, you need to
specify size parameter also.
autoIndexId Boolean (Optional) If true, automatically create
index on _id fields. Default value is false.
size number (Optional) Specifies a maximum size in
bytes for a capped collection. If capped is
true, then you need to specify this field
also.
max number (Optional) Specifies the maximum number
of documents allowed in the capped
collection.
Following is the list of options you can use −
While inserting the document, MongoDB first checks size field of capped collection, then it checks max
field.
Note: In MongoDB, you don't need to create collection. MongoDB creates collection
automatically, when you insert some document.
{
"acknowledged" : true,
"insertedIds" : [
ObjectId("6354e0c02e4df4ccf26c3ead"),
ObjectId("6354e0c02e4df4ccf26c3eae"),
ObjectId("6354e0c02e4df4ccf26c3eaf"),
ObjectId("6354e0c02e4df4ccf26c3eb0")
]
}
Employee Data:
db.employee.insert({"empcode":9369,"empfname":"Tony","emplname":"Stark","job"
:"Software Engineer","manager":7902,"hiredate":"1980-12-
17","salary":2800,"commission":0,"deptcode":20})
db.employee.insertOne({"empcode":9499,"empfname":"Tim","emplname":"Adolf","j
ob":"Salesman","manager":7698,"hiredate":"1981-
0220","salary":1600,"commission":300,"deptcode":30})
db.employee.insertMany([
{"empcode":9566,"empfname":"Kim","emplname":"Jarvis","job":"Manager","manag
er":7839,"hiredate":"1981-04-02","salary":3570,"commission":0,"deptcode":20},
{"empcode":9654,"empfname":"Sam","emplname":"Miles","job":"Salesman","manag
er":7698,"hiredate":"1981-09-28","salary":1250,"commission":1400,"deptcode":10},
{"empcode":9782,"empfname":"Kevin","emplname":"Hill","job":"Manager","manag
er":7839,"hiredate":"1981-06-09","salary":2940,"commission":0,"deptcode":10},
{"empcode":9788,"empfname":"Connie","emplname":"Smith","job":"Analyst","mana
ger":7566,"hiredate":"1982-12-09","salary":3000,"commission":0,"deptcode":20}
])
db.employee.insertMany([
{"empcode":9839,"empfname":"Alfred","emplname":"Kinsley","job":"President","m
anager":7566,"hiredate":"1981-11-17","salary":5000,"commission":0,"deptcode":10},
{"empcode":9844,"empfname":"Paul","emplname":"Timothy","job":"Salesman","ma
nager":7698,"hiredate":"1981-09-08","salary":1500,"commission":0,"deptcode":30},
{"empcode":9876,"empfname":"John","emplname":"Asghar","job":"Software
Engineer","manager":7788,"hiredate":"1983-0112","salary":3100,"commission":0,"deptcode":20},
{"empcode":9900,"empfname":"Rose","emplname":"Summers","job":"Technical
Lead","manager":7698,"hiredate":"1981-1203","salary":3300,"commission":0,"deptcode":20}
])
db.employee.insertMany([
{"empcode":9902,"empfname":"Andrew","emplname":"Faulkner","job":"Analyst","
manager":7566,"hiredate":"1981-12-
03","salary":3000,"commission":0,"deptcode":10},
{"empcode":9934,"empfname":"Karen","emplname":"Matthews","job":"Software
Engineer","manager":7782,"hiredate":"1982-0123","salary":3300,"commission":1400,"deptcode":20},
{"empcode":9591,"empfname":"Wendy","emplname":"Shawn","job":"Salesman","m
anager":7698,"hiredate":"1981-02-22","salary":500,"commission":0,"deptcode":30},
{"empcode":9698,"empfname":"Bella","emplname":"Swan","job":"Manager","manag
er":7839,"hiredate":"1982-05-01","salary":3420,"commission":0,"deptcode":30}
])
Department Data:
Query: db.department.insertMany([
... {"deptcode":10,"deptname":"Finance","location":"Edinburgh"},
... {"deptcode":20,"deptname":"Software","location":"Paddington"},
... {"deptcode":30,"deptname":"Sales","location":"Maidstone"},
... {"deptcode":40,"deptname":"Marketing","location":"Darlington"},
... {"deptcode":50,"deptname":"Admin","location":"Birmingham"}
... ])
Output:
{
"acknowledged" : true,
"insertedIds" : [
ObjectId("6354e3672e4df4ccf26c3eb1"),
ObjectId("6354e3672e4df4ccf26c3eb2"),
ObjectId("6354e3672e4df4ccf26c3eb3"),
ObjectId("6354e3672e4df4ccf26c3eb4"),
ObjectId("6354e3672e4df4ccf26c3eb5")
]
}
Q4. List all the documents inside a collections’ employee and department both.
Solution:
The find() method: The find() method will display all the documents in a nonstructured way.
Syntax: db.collection_name.find()
The pretty() method: To display the results in a formatted way, you can use pretty() method.
Syntax: db.collection_name.find().pretty()
Displaying employee details:
Query:
db.employee.find().pretty() Output:
{
"_id" : ObjectId("6354c7f22e4df4ccf26c3ea3"),
"empcode" : 9369,
"empfname" : "Tony",
"emplname" : "Stark",
"job" : "Software Engineer",
"manager" : 7902,
"hiredate" : "1980-12-17",
"salary" : 2800,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354cb482e4df4ccf26c3ea4"),
"empcode" : 9499,
"empfname" : "Tim",
"emplname" : "Adolf",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-02-20",
"salary" : 1600,
"commission" : 300,
"deptcode" : 30
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea5"),
"empcode" : 9566,
"empfname" : "Kim",
"emplname" : "Jarvis",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-04-02",
"salary" : 3570,
"commission" : 0,
"deptcode" : 20
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea6"),
"empcode" : 9654,
"empfname" : "Sam",
"emplname" : "Miles",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-09-28",
"salary" : 1250,
"commission" : 1400,
"deptcode" : 10
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea7"),
"empcode" : 9782,
"empfname" : "Kevin",
"emplname" : "Hill",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-06-09",
"salary" : 2940,
"commission" : 0,
"deptcode" : 10
}
{
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea8"),
"empcode" : 9788,
"empfname" : "Connie",
"emplname" : "Smith",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1982-12-09",
"salary" : 3000,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3ea9"),
"empcode" : 9839,
"empfname" : "Alfred",
"emplname" : "Kinsley",
"job" : "President",
"manager" : 7566,
"hiredate" : "1981-11-17",
"salary" : 5000,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eaa"),
"empcode" : 9844,
"empfname" : "Paul",
"emplname" : "Timothy",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-09-08",
"salary" : 1500,
"commission" : 0,
"deptcode" : 30
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eab"),
"empcode" : 9876,
"empfname" : "John",
"emplname" : "Asghar",
"job" : "Software Engineer",
"manager" : 7788,
"hiredate" : "1983-01-12",
"salary" : 3100,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eac"),
"empcode" : 9900,
"empfname" : "Rose",
"emplname" : "Summers",
"job" : "Technical Lead",
"manager" : 7698,
{
"hiredate" : "1981-12-03",
"salary" : 3300,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3ead"),
"empcode" : 9902,
"empfname" : "Andrew",
"emplname" : "Faulkner",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1981-12-03",
"salary" : 3000,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eae"),
"empcode" : 9934,
"empfname" : "Karen",
"emplname" : "Matthews",
"job" : "Software Engineer",
"manager" : 7782,
"hiredate" : "1982-01-23",
"salary" : 3300,
"commission" : 1400,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eaf"),
"empcode" : 9591,
"empfname" : "Wendy",
"emplname" : "Shawn",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-02-22",
"salary" : 500,
"commission" : 0,
"deptcode" : 30
"_id" : ObjectId("6354e0c02e4df4ccf26c3eb0"),
"empcode" : 9698,
"empfname" : "Bella",
"emplname" : "Swan",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1982-05-01",
"salary" : 3420,
"commission" : 0,
"deptcode" : 30
}
{
Displaying Department Details:
Query:
db.department.find().pretty() Output:
{
"_id" : ObjectId("6354e3672e4df4ccf26c3eb1"),
"deptcode" : 10,
"deptname" : "Finance",
"location" : "Edinburgh"
}
{
"_id" : ObjectId("6354e3672e4df4ccf26c3eb2"),
"deptcode" : 20,
"deptname" : "Software",
"location" : "Paddington"
}
{
"_id" : ObjectId("6354e3672e4df4ccf26c3eb3"),
"deptcode" : 30,
"deptname" : "Sales",
"location" : "Maidstone"
}
{
"_id" : ObjectId("6354e3672e4df4ccf26c3eb4"),
"deptcode" : 40,
"deptname" : "Marketing",
"location" : "Darlington"
}
{
"_id" : ObjectId("6354e3672e4df4ccf26c3eb5"),
"deptcode" : 50,
"deptname" : "Admin",
"location" : "Birmingham"
}
Q5. List all the documents of employee collection WHERE job in ("Salesman ",
"Manager"):
Solution:
The $in operator is used to select documents in which the field's value equals any of the given
values in the array.
Syntax: { field: { $in: [<value 1>, <value 2>, ... <value N> ] } } Query:
db.employee.find({job:{$in: ["Salesman", "Manager"]}}).pretty() Output:
{
"_id" : ObjectId("6354cb482e4df4ccf26c3ea4"),
"empcode" : 9499,
"empfname" : "Tim",
"emplname" : "Adolf",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-02-20",
"salary" : 1600,
"commission" : 300,
"deptcode" : 30
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea5"),
"empcode" : 9566,
"empfname" : "Kim",
"emplname" : "Jarvis",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-04-02",
"salary" : 3570,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea6"),
"empcode" : 9654,
"empfname" : "Sam",
"emplname" : "Miles",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-09-28",
"salary" : 1250,
"commission" : 1400,
"deptcode" : 10
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea7"),
"empcode" : 9782,
"empfname" : "Kevin",
"emplname" : "Hill",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-06-09",
"salary" : 2940,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eaa"),
"empcode" : 9844,
"empfname" : "Paul",
"emplname" : "Timothy",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-09-08",
"salary" : 1500,
"commission" : 0,
"deptcode" : 30
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eaf"),
"empcode" : 9591,
"empfname" : "Wendy",
"emplname" : "Shawn",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-02-22",
"salary" : 500,
"commission" : 0,
"deptcode" : 30
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eb0"),
"empcode" : 9698,
"empfname" : "Bella",
"emplname" : "Swan",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1982-05-01",
"salary" : 3420,
"commission" : 0,
"deptcode" : 30
}
Q6. List all the documents of employee collection WHERE Job = " Analyst" And
Salary < 1500:
Solution:
To query documents based on the AND condition, need to use $and keyword. Following
is the basic syntax of AND : db.collection_name.find({ $and: [ {<key1>:<value1>}, {
<key2>:<value2>} ] }) Query:
db.employee.find({$and:[{job:”Analyst”},{salary:{$lt:1500}}]}).pretty() No
records found.
Q7.List all the documents of employee collection WHERE job = " ANALYST" or
SALARY < 1500:
Solution:
To query documents based on the OR condition, you need to use $or keyword.
Following is the basic syntax of OR − db.collection_name.find(
{
$or: [{key1: value1}, {key2:value2}]
}).pretty()
Query:
db.employee.find({$or:[{job:”Analyst”},{salary:{$lt:1500}}]}).pretty() Output:
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea6"),
"empcode" : 9654,
"empfname" : "Sam",
"emplname" : "Miles",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-09-28",
"salary" : 1250,
"commission" : 1400,
"deptcode" : 10
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea8"),
"empcode" : 9788,
"empfname" : "Connie",
"emplname" : "Smith",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1982-12-09",
"salary" : 3000,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3ead"),
"empcode" : 9902,
"empfname" : "Andrew",
"emplname" : "Faulkner",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1981-12-03",
"salary" : 3000,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eaf"),
"empcode" : 9591,
"empfname" : "Wendy",
"emplname" : "Shawn",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-02-22",
"salary" : 500,
"commission" : 0,
"deptcode" : 30
}
Q8. List all the documents of employee collection WHERE job = " ANALYST" AND
(SALARY < 1500 OR empfname LIKE "T%")
Solution:
Query:
db.employee.find({job:"Analyst",$or:[{salary:{$lt:1500}},{empfname:/^T/}]}).pretty
()
No records found.
Query:
db.employee.find({job:"Analyst",$or:[{salary:{$lt:1500}},{empfname:/^A/}]}).pretty
()
Output:
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3ead"),
"empcode" : 9902,
"empfname" : "Andrew",
"emplname" : "Faulkner",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1981-12-03",
"salary" : 3000,
"commission" : 0,
"deptcode" : 10
}
Q9. List all the documents of employee collection in ascending order of job type.
Solution:
The sort() Method: To sort documents in MongoDB, need to use sort() method. The
method accepts a document containing a list of fields along with their sorting order. To
specify sorting order 1 and -1 are used. 1 is used for ascending order while -1 is used
for descending order.
Syntax:
The basic syntax of sort() method is as follows: db.collection_name.find().sort({KEY:1})
db.employee.find().pretty().sort({"empfname":1})
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3ea9"),
"empcode" : 9839,
"empfname" : "Alfred",
"emplname" : "Kinsley",
"job" : "President",
"manager" : 7566,
"hiredate" : "1981-11-17",
"salary" : 5000,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3ead"),
"empcode" : 9902,
"empfname" : "Andrew",
"emplname" : "Faulkner",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1981-12-03",
"salary" : 3000,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eb0"),
"empcode" : 9698,
"empfname" : "Bella",
"emplname" : "Swan",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1982-05-01",
"salary" : 3420,
"commission" : 0,
"deptcode" : 30
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea8"),
"empcode" : 9788,
"empfname" : "Connie",
"emplname" : "Smith",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1982-12-09",
"salary" : 3000,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eab"),
"empcode" : 9876,
"empfname" : "John",
"emplname" : "Asghar",
"job" : "Software Engineer",
"manager" : 7788,
"hiredate" : "1983-01-12",
"salary" : 3100,
"commission" : 0,
"deptcode" : 20
}
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eae"),
"empcode" : 9934,
"empfname" : "Karen",
"emplname" : "Matthews",
"job" : "Software Engineer",
"manager" : 7782,
"hiredate" : "1982-01-23",
"salary" : 3300,
"commission" : 1400,
"deptcode" : 20
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea7"),
"empcode" : 9782,
"empfname" : "Kevin",
"emplname" : "Hill",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-06-09",
"salary" : 2940,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea5"),
"empcode" : 9566,
"empfname" : "Kim",
"emplname" : "Jarvis",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-04-02",
"salary" : 3570,
"commission" : 0,
"deptcode" : 20
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eaa"),
"empcode" : 9844,
"empfname" : "Paul",
"emplname" : "Timothy",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-09-08",
"salary" : 1500,
"commission" : 0,
"deptcode" : 30
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eac"),
"empcode" : 9900,
"empfname" : "Rose",
"emplname" : "Summers",
"job" : "Technical Lead",
"manager" : 7698,
}
"hiredate" : "1981-12-03",
"salary" : 3300,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea6"),
"empcode" : 9654,
"empfname" : "Sam",
"emplname" : "Miles",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-09-28",
"salary" : 1250,
"commission" : 1400,
"deptcode" : 10
}
{
"_id" : ObjectId("6354cb482e4df4ccf26c3ea4"),
"empcode" : 9499,
"empfname" : "Tim",
"emplname" : "Adolf",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-02-20",
"salary" : 1600,
"commission" : 300,
"deptcode" : 30
}
{
"_id" : ObjectId("6354c7f22e4df4ccf26c3ea3"),
"empcode" : 9369,
"empfname" : "Tony",
"emplname" : "Stark",
"job" : "Software Engineer",
"manager" : 7902,
"hiredate" : "1980-12-17",
"salary" : 2800,
"commission" : 0,
"deptcode" : 20
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eaf"),
"empcode" : 9591,
"empfname" : "Wendy",
"emplname" : "Shawn",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-02-22",
"salary" : 500,
"commission" : 0,
"deptcode" : 30
}
Note: Don't specify the sorting preference, then sort() method will display the
documents in ascending order.
}
db.employee.find().pretty().sort({"empfname":1})
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3ea9"),
"empcode" : 9839,
"empfname" : "Alfred",
"emplname" : "Kinsley",
"job" : "President",
"manager" : 7566,
"hiredate" : "1981-11-17",
"salary" : 5000,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3ead"),
"empcode" : 9902,
"empfname" : "Andrew",
"emplname" : "Faulkner",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1981-12-03",
"salary" : 3000,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eb0"),
"empcode" : 9698,
"empfname" : "Bella",
"emplname" : "Swan",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1982-05-01",
"salary" : 3420,
"commission" : 0,
"deptcode" : 30
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea8"),
"empcode" : 9788,
"empfname" : "Connie",
"emplname" : "Smith",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1982-12-09",
"salary" : 3000,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eab"),
"empcode" : 9876,
"empfname" : "John",
"emplname" : "Asghar",
"job" : "Software Engineer",
"manager" : 7788,
"hiredate" : "1983-01-12",
"salary" : 3100,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eae"),
"empcode" : 9934,
"empfname" : "Karen",
"emplname" : "Matthews",
{
"deptcode" : 20
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea6"),
"empcode" : 9654,
"empfname" : "Sam",
"emplname" : "Miles",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-09-28",
"salary" : 1250,
"commission" : 1400,
"deptcode" : 10
}
"_id" : ObjectId("6354cb482e4df4ccf26c3ea4"),
"empcode" : 9499,
"empfname" : "Tim",
"emplname" : "Adolf",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-02-20",
"salary" : 1600,
"commission" : 300,
"deptcode" : 30
}
{
"_id" : ObjectId("6354c7f22e4df4ccf26c3ea3"),
"empcode" : 9369,
"empfname" : "Tony",
"emplname" : "Stark",
"job" : "Software Engineer",
"manager" : 7902,
"hiredate" : "1980-12-17",
"salary" : 2800,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eaf"),
"empcode" : 9591,
"empfname" : "Wendy",
"emplname" : "Shawn",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-02-22",
"salary" : 500,
"commission" : 0,
"deptcode" : 30
}
> db.employee.find().pretty().sort({"empfname":-1})
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eaf"),
{
"empcode" : 9591,
"empfname" : "Wendy",
"emplname" : "Shawn",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-02-22",
"salary" : 500,
"commission" : 0,
"deptcode" : 30
}
{
"_id" : ObjectId("6354c7f22e4df4ccf26c3ea3"),
"empcode" : 9369,
"empfname" : "Tony",
"emplname" : "Stark",
"job" : "Software Engineer",
"manager" : 7902,
"hiredate" : "1980-12-17",
"salary" : 2800,
"commission" : 0,
"deptcode" : 20
"_id" : ObjectId("6354cb482e4df4ccf26c3ea4"),
"empcode" : 9499,
"empfname" : "Tim",
"emplname" : "Adolf",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-02-20",
"salary" : 1600,
"commission" : 300,
"deptcode" : 30
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea6"),
"empcode" : 9654,
"empfname" : "Sam",
"emplname" : "Miles",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-09-28",
"salary" : 1250,
"commission" : 1400,
"deptcode" : 10
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eac"),
"empcode" : 9900,
}
{
"empfname" : "Rose",
"emplname" : "Summers",
"job" : "Technical Lead",
"manager" : 7698,
"hiredate" : "1981-12-03",
"salary" : 3300,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eaa"),
"empcode" : 9844,
"empfname" : "Paul",
"emplname" : "Timothy",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-09-08",
"salary" : 1500,
"commission" : 0,
"deptcode" : 30
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea5"),
"empcode" : 9566,
"empfname" : "Kim",
"emplname" : "Jarvis",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-04-02",
"salary" : 3570,
"commission" : 0,
"deptcode" : 20
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea7"),
"empcode" : 9782,
"empfname" : "Kevin",
"emplname" : "Hill",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-06-09",
"salary" : 2940,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eae"),
"empcode" : 9934,
"empfname" : "Karen",
"emplname" : "Matthews",
"job" : "Software Engineer",
"manager" : 7782,
"hiredate" : "1982-01-23",
"salary" : 3300,
}
{
"commission" : 1400,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eab"),
"empcode" : 9876,
"empfname" : "John",
"emplname" : "Asghar",
"job" : "Software Engineer",
"manager" : 7788,
"hiredate" : "1983-01-12",
"salary" : 3100,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea8"),
"empcode" : 9788,
"empfname" : "Connie",
"emplname" : "Smith",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1982-12-09",
"salary" : 3000,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eb0"),
"empcode" : 9698,
"empfname" : "Bella", "emplname" :
"Swan",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1982-05-01",
"salary" : 3420,
"commission" : 0,
"deptcode" : 30
"_id" : ObjectId("6354e0c02e4df4ccf26c3ead"),
"empcode" : 9902,
"empfname" : "Andrew",
"emplname" : "Faulkner",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1981-12-03",
"salary" : 3000,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3ea9"),
"empcode" : 9839,
}
{
"empfname" : "Alfred",
"emplname" : "Kinsley",
"job" : "President",
"manager" : 7566,
"hiredate" : "1981-11-17",
"salary" : 5000,
"commission" : 0,
"deptcode" : 10
}
Q11. Create multiple index on empcode and hiredate fields of employee collection.
Solution:
Query:
db.employee.createIndex({empcode:1,hiredate:-1})
Output:
{
"createdCollectionAutomatically" : false,
"numIndexesBefore" : 2,
"numIndexesAfter" : 3,
"ok" : 1
}
Q12.Delete first record of employee collection where job is SALESMAN.
}
{
Solution:
The remove() Method: MongoDB's remove() method is used to remove a document from the
collection. remove() method accepts two parameters. One is deletion criteria and second is
justOne flag.
deletion criteria − (Optional) deletion criteria according to documents will be removed. justOne
Syntax: db.collection_name.remove(deletion_critteria)
The removeOnlyOne Method: If there are multiple records and you want to delete only the
first record, then set justOne parameter in remove() method.
Syntax: db.collection_name.remove(deletion_criteria,1)
Query:
"empfname" : "Kim",
"emplname" : "Jarvis",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-04-02",
"salary" : 3570,
"commission" : 0,
}
{
"deptcode" : 20
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea6"),
"empcode" : 9654,
"empfname" : "Sam",
"emplname" : "Miles",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-09-28",
"salary" : 1250,
"commission" : 1400,
"deptcode" : 10
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea7"),
"empcode" : 9782,
"empfname" : "Kevin",
"emplname" : "Hill",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-06-09",
"salary" : 2940,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea8"),
"empcode" : 9788,
"empfname" : "Connie",
"emplname" : "Smith",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1982-12-09",
"salary" : 3000,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3ea9"),
"empcode" : 9839,
"empfname" : "Alfred",
"emplname" : "Kinsley",
"job" : "President",
"manager" : 7566,
"hiredate" : "1981-11-17",
"salary" : 5000,
"commission" : 0,
"deptcode" : 10
}
}
{
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eaa"),
"empcode" : 9844,
"empfname" : "Paul",
"emplname" : "Timothy",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-09-08",
"salary" : 1500,
"commission" : 0,
"deptcode" : 30
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eab"),
"empcode" : 9876,
"empfname" : "John",
"emplname" : "Asghar",
"job" : "Software Engineer",
"manager" : 7788,
"hiredate" : "1983-01-12",
"salary" : 3100,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eac"),
"empcode" : 9900,
"empfname" : "Rose",
"emplname" : "Summers",
"job" : "Technical Lead",
"manager" : 7698,
"hiredate" : "1981-12-03",
"salary" : 3300,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3ead"),
"empcode" : 9902,
"empfname" : "Andrew",
"emplname" : "Faulkner",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1981-12-03",
"salary" : 3000,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eae"),
}
{
"empcode" : 9934,
"empfname" : "Karen",
"emplname" : "Matthews",
"job" : "Software Engineer",
"manager" : 7782,
"hiredate" : "1982-01-23",
"salary" : 3300,
"commission" : 1400,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eaf"),
"empcode" : 9591,
"empfname" : "Wendy",
"emplname" : "Shawn",
"job" : "Salesman",
"manager" : 7698,
"hiredate" : "1981-02-22",
"salary" : 500,
"commission" : 0,
"deptcode" : 30
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eb0"),
"empcode" : 9698,
"empfname" : "Bella",
"emplname" : "Swan",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1982-05-01",
"salary" : 3420,
"commission" : 0,
"deptcode" : 30
}
"deptcode" : 20
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea5"),
"empcode" : 9566,
"empfname" : "Kim",
"emplname" : "Jarvis",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-04-02",
"salary" : 3570,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea7"),
"empcode" : 9782,
"empfname" : "Kevin",
"emplname" : "Hill",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1981-06-09",
"salary" : 2940,
"commission" : 0,
"deptcode" : 10
}
{
}
{
"_id" : ObjectId("6354d1ce2e4df4ccf26c3ea8"),
"empcode" : 9788,
"empfname" : "Connie",
"emplname" : "Smith",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1982-12-09",
"salary" : 3000,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3ea9"),
"empcode" : 9839,
"empfname" : "Alfred",
"emplname" : "Kinsley",
"job" : "President",
"manager" : 7566,
"hiredate" : "1981-11-17",
"salary" : 5000,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eab"),
"empcode" : 9876,
"empfname" : "John",
"emplname" : "Asghar",
"job" : "Software Engineer",
"manager" : 7788,
"hiredate" : "1983-01-12",
"salary" : 3100,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0bf2e4df4ccf26c3eac"),
"empcode" : 9900,
"empfname" : "Rose",
"emplname" : "Summers",
"job" : "Technical Lead",
"manager" : 7698,
"hiredate" : "1981-12-03",
"salary" : 3300,
"commission" : 0,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3ead"),
"empcode" : 9902,
}
{
"empfname" : "Andrew",
"emplname" : "Faulkner",
"job" : "Analyst",
"manager" : 7566,
"hiredate" : "1981-12-03",
"salary" : 3000,
"commission" : 0,
"deptcode" : 10
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eae"),
"empcode" : 9934,
"empfname" : "Karen",
"emplname" : "Matthews",
"job" : "Software Engineer",
"manager" : 7782,
"hiredate" : "1982-01-23",
"salary" : 3300,
"commission" : 1400,
"deptcode" : 20
}
{
"_id" : ObjectId("6354e0c02e4df4ccf26c3eb0"),
"empcode" : 9698,
"empfname" : "Bella",
"emplname" : "Swan",
"job" : "Manager",
"manager" : 7839,
"hiredate" : "1982-05-01",
"salary" : 3420,
"commission" : 0,
"deptcode" : 30
}
Query:
> db.employee.remove({})
WriteResult({ "nRemoved" : 10 })
}
{
> db.employee.find().pretty() No
records found.
Experiment Number 4
Hive:
1. Hive Data Types
2. Create Database & Table in Hive
3. Hive Partitioning
4. Hive Built-In Operators
5. Hive Built-In Functions
6. Hive Views and Indexes
7. HiveQL : Select Where, Select OrderBy, Select GroupBy, Select Joins
1. Create Database with name College
Query: create database timscdr;
Query:
> create table timscdr.employee (empcode int, empfname string, emplname string, job string,
manager string, hiredate string, salary int, commission int)
> row format delimited
> fields terminated by ',' ;
Dept table:
deptcode INT, deptname STRING, location STRING
}
{
3. Load data in the tables employee and dept using .csv files
Steps:
First Create an empty document in cloudera and paste data into the file and save it in .csv
extension Emp_data
Command: load data local inpath ‘/home/cloudera/emp_data.csv’ into table timscdr.employee;
Dept_data
Dept_data
5. Get the table structure
Timscdr.employee
Timscdr.dept
b. Display all the MANAGERS in order where the one who served the most
appears first.
Query:
> SELECT empfname, emplname, job, hiredate
> FROM timscdr.employee
> WHERE job = 'MANAGER' order by hiredate asc;
c. Display the FULL NAME and SALARY drawn by an analyst working in
dept no 20
Query:
select empfname, emplname, salary, deptcode
> from timscdr.employee
> where job = 'ANALYST' and deptcode = 20;
7. Perform following joins between employee and dept table using key
deptcode.
a. Left Join
Query:
}
{
b. Right Join
Query:
select e.empfname, e.deptcode, d.deptname, d.deptcode from timscdr.employee e right outer
join timscdr.dept d on e.deptcode = d.deptcode;
c. Full Join
Query:
Select e.empfname, e.deptcode, d.deptname, d.deptcode from timscdr.employee e full outer
join timscdr.dept d on e.deptcode = d.deptcode;
}
{
d. Inner Join
Query:
select e.empfname, e.deptcode, d.deptname, d.deptcode from timscdr.employee e join
timscdr.dept d on e.deptcode = d.deptcode;
8. Create view on employee table for employee whose salary greater
than 2000 and then drop that view.
Create View:
> create view timscdr.employee_vw as
Drop View
> drop view timscdr.employee_vw;
>create table student_part (roll int, stfname string, stlname string) portioned by
(course string) row format delimited fileds terminated by ‘,’;
}
{
Prerequisites:
It is essential that you have Hadoop and Java installed on your system before you go for Apache
Pig. Therefore, prior to installing Apache Pig, install Hadoop and Java by following the steps
given in the following link as per Hadoop installation document.
Step 2: On clicking the specified link, you will be redirected to the Apache Pig Releases page.
Click on Download release now.
Apache Spark
Step 3: Choose and click any one of these mirrors as shown below.
Click on above
link
Step 4: These mirrors will take you to the Pig Releases page. This page contains various versions
of Apache Pig. Click the latest version among them.
Step 5: Within these folders, you will have the source and binary files of Apache Pig in various
distributions. Download the tar files of the source and binary files of Apache Pig 0.17, pig0.17.0-
src.tar.gz and pig-0.17.0.tar.gz.
Apache Spark
Step 6: Downloaded the pig-0.17.0.tar.gz version of Pig which is latest and about 220MB in
size. Now move the downloaded Pig tar file to your desired location. In this case Moving it to
my /Documents folder.
Using cd command: cd Documents/
Step 7: Extract this tar file with the help of below command (make sure to check your tar
filename):
tar -xvf pig-0.17.0.tar.gz
Step 8: Once it is installed it’s time for us to switch to our Hadoop user. In my case it is hadoop
and password is admin.
Command: su – hadoop
Step 9: Need to move this extracted folder to the hadoopusr user. For that, use the below
command (make sure name of your extracted folder is pig-0.17.0 otherwise change it
accordingly).
Command: sudo mv pig-0.17.0 /usr/local/
Step 10: Now once we moved it we need to change the environment variable for Pig’s location.
For that open the bashrc file with below command.
Command: sudo gedit ~/.bashrc
Once the file open save the below path inside this bashrc file.
#Pig location
export PIG_INSTALL=/usr/local/pig-0.17.0
export PATH=$PATH:/usr/local/pig-0.17.0/bin
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Apache Spark
Step 11: Then check whether you have configured it correctly or not using the below command:
source ~/.bashrc
Step 12: Successfully installed pig to our Hadoop single node setup, now we start pig with below
pig command.
Command: pig –x local
Experiment Number 6
Spark:
1. SPARK– INTRODUCTION Apache Spark
ustries are using Hadoop extensively to analyze their data sets. The reason is that Hadoop
framework is based on a simple programming model (MapReduce) and it enables a computing
solution that is scalable, flexible, fault-tolerant and cost effective. Here, the main concern is to
maintain speed in processing large datasets in terms of waiting time between queries and
waiting time to run the program.
Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.
As against a common belief, Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management. Hadoop is just one of the
ways to implement Spark.
Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark has its
own cluster management computation, it uses Hadoop for storage purpose only.
Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation.
It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for
more types of computations, which includes interactive queries and stream processing. The
main feature of Spark is its in-memory cluster computing that increases the processing speed
of an application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workload in a
respective system, it reduces the management burden of maintaining separate tools.
Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei
Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software
foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-
2014.
Apache Spark
Features of Apache Spark
• Supports multiple languages: Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages. Spark comes up with 80
high-level operators for interactive querying.
• Advanced Analytics: Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL
queries, Streaming data, Machine learning (ML), and Graph algorithms.
The following diagram shows three ways of how Spark can be built with Hadoop components.
• Hadoop Yarn: Hadoop Yarn deployment means, simply, spark runs on Yarn without
any pre-installation or root access required. It helps to integrate Spark into Hadoop
ecosystem or Hadoop stack. It allows other components to run on top of stack.
• Spark in MapReduce (SIMR): Spark in MapReduce is used to launch spark job in
addition to standalone deployment. With SIMR, user can start Spark and uses its shell
without any administrative access.
Components of Spark
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction called
SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets)
transformations on those mini-batches of data.
ApaApacchehe
2. SPARK– RDD SSpapakrkr
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API for
expressing graph computation that can model the user-defined graphs by using Pregel
abstraction API. It also provides an optimized runtime for this abstraction.
MapReduce is widely adopted for processing and generating large datasets with a parallel,
distributed algorithm on a cluster. It allows users to write parallel computations, using a set of
high-level operators, without having to worry about work distribution and fault tolerance.
Unfortunately, in most current frameworks, the only way to reuse data between computations
(Ex: between two MapReduce jobs) is to write it to an external stable storage system (Ex:
HDFS). Although this framework provides numerous abstractions for accessing a cluster’s
computational resources, users still want more.
Both Iterative and Interactive applications require faster data sharing across parallel jobs.
Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Regarding
storage system, most of the Hadoop applications, they spend more than 90% of the time doing
HDFS read-write operations.
User runs ad-hoc queries on the same subset of data. Each query will do the disk I/O on the stable
storage, which can dominates application execution time.
The following illustration explains how the current framework works while doing the interactive
queries on MapReduce.
Apache Spark
Apache Spark
Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most of the
Hadoop applications, they spend more than 90% of the time doing HDFS readwrite operations.
The illustration given below shows the iterative operations on Spark RDD. It will store
intermediate results in a distributed memory instead of Stable storage (Disk) and make the
system faster.
Note: If the Distributed memory (RAM) is sufficient to store intermediate results (State of the
JOB), then it will store those results on the disk.
Figure: Iterative operations on Spark RDD
This illustration shows interactive operations on Spark RDD. If different queries are run on the
same set of data repeatedly, this particular data can be kept in memory for better execution
times.
By default, each transformed RDD may be recomputed each time you run an action on it.
However, you may also persist an RDD in memory, in which case Spark will keep the elements
around on the cluster for much faster access, the next time you query it. There is also support
for persisting RDDs on disk, or replicated across multiple nodes.
ApaApacchehe
3. SPARK– INSTALLATION SSpapakrkr
Spark is Hadoop’s sub-project. Therefore, it is better to install Spark into a Linux based system.
The following steps show how to install Apache Spark.
Java installation is one of the mandatory things in installing Spark. Try the following command
to verify the JAVA version.
$java -version
If Java is already, installed on your system, you get to see the following response –
In case you do not have Java installed on your system, then Install Java before proceeding to
next step.
You should Scala language to implement Spark. So let us verify Scala installation using
following command.
$scala -version
If Scala is already installed on your system, you get to see the following response –
In case you don’t have Scala installed on your system, then proceed to next step for Scala
installation.
Download the latest version of Scala by visit the following link Download Scala. For this
tutorial, we are using scala-2.11.6 version. After downloading, you will find the Scala tar file
in the download folder.
Step 4: Installing Scala
$ su – Password:
# cd /home/Hadoop/Downloads/
# mv scala-2.11.6 /usr/local/scala
# exit
$scala -version
If Scala is already installed on your system, you get to see the following response –
Apache Spark
$ su – Password:
# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
# exit
$ source ~/.bashrc
If spark is installed successfully then you will find the following output.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled; ui
acls disabled; users with view permissions: Set(hadoop); users with modify permissions:
Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port
43292.
Welcome to
/ / //
_\ \/ _ \/ _ `/ / '_/
/ / . /\_,_/_/ /_/\_\ version 1.4.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71) Type in
expressions to have them evaluated.
Spark context available as sc
scala>
Spark Core is the base of the whole project. It provides distributed task dispatching, scheduling,
and basic I/O functionalities. Spark uses a specialized fundamental data structure known as
RDD (Resilient Distributed Datasets) that is a logical collection of data partitioned across
machines. RDDs can be created in two ways; one is by referencing datasets in external storage
ApaApacchehe
4. SPARK– CORE PROGRAMMING
SSpapakrkr
systems and second is by applying transformations (e.g. map, filter, reducer, join) on existing
RDDs.
The RDD abstraction is exposed through a language-integrated API. This simplifies
programming complexity because the way applications manipulate RDDs is similar to
manipulating local collections of data.
Spark Shell
Spark provides an interactive shell: a powerful tool to analyze data interactively. It is available
in either Scala or Python language. Spark’s primary abstraction is a distributed collection of
items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop Input
Formats (such as HDFS files) or by transforming other RDDs.
$ spark-shell
The Spark RDD API introduces few Transformations and few Actions to manipulate RDD.
RDD Transformations
RDD transformations returns pointer to new RDD and allows you to create dependencies
between RDDs. Each RDD in dependency chain (String of Dependencies) has a function for
calculating its data and has a pointer (dependency) to its parent RDD.
Spark is lazy, so nothing will be executed unless you call some transformation or action that
will trigger job creation and execution. Look at the following snippet of the wordcount
example.
Therefore, RDD transformation is not a set of data but is a step in a program (might be the only
step) telling Spark how to get data and what to do with it.
Given below is a list of RDD transformations.
map(func)
1 Returns a new distributed dataset, formed by passing each element of the source
through a function func.
filter(func)
2 Returns a new dataset formed by selecting those elements of the source on which
func returns true.
flatMap(func)
3 Similar to map, but each input item can be mapped to 0 or more output items (so
func should return a Seq rather than a single item).
mapPartitions(func)
4 Similar to map, but runs separately on each partition (block) of the RDD, so func
must be of type Iterator<T> => Iterator<U> when running on an RDD of type T.
mapPartitionsWithIndex(func)
5 Similar to map Partitions, but also provides func with an integer value
representing the index of the partition, so func must be of type (Int, Iterator<T>)
=> Iterator<U> when running on an RDD of type T.
Apache Spark
sample(withReplacement, fraction, seed)
6 Sample a fraction of the data, with or without replacement, using a given random
number generator seed.
union(otherDataset)
7 Returns a new dataset that contains the union of the elements in the source
dataset and the argument.
intersection(otherDataset)
8 Returns a new RDD that contains the intersection of elements in the source dataset
and the argument.
distinct([numTasks]))
9 Returns a new dataset that contains the distinct elements of the source dataset.
groupByKey([numTasks])
When called on a dataset of (K, V) pairs, returns a dataset of (K, Apache Spark
Iterable<V>) pairs.
10 Note: If you are grouping in order to perform an aggregation (such as a sum or
average) over each key, using reduceByKey or aggregateByKey will yield much
better performance.
reduceByKey(func, [numTasks])
11 When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where
the values for each key are aggregated using the given reduce function func,
which must be of type (V, V) => V. Like in groupByKey, the number of reduce
tasks is configurable through an optional second argument.
12 When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where
the values for each key are aggregated using the given combine functions and a
neutral "zero" value. Allows an aggregated value type that is different from the
input value type, while avoiding unnecessary allocations. Like in groupByKey,
the number of reduce tasks is configurable through an optional second argument.
sortByKey([ascending], [numTasks])
14 join(otherDataset, [numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V,
W)) pairs with all pairs of elements for each key. Outer joins are supported
through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
cogroup(otherDataset, [numTasks])
15 When called on datasets of type (K, V) and (K, W), returns a dataset of (K,
(Iterable<V>, Iterable<W>)) tuples. This operation is also called group With.
Apache Spark
cartesian(otherDataset)
16 When called on datasets of types T and U, returns a dataset of (T, U) pairs (all
pairs of elements).
pipe(command, [envVars])
17 Pipe each partition of the RDD through a shell command, e.g. a Perl or bash
script. RDD elements are written to the process's stdin and lines output to its
stdout are returned as an RDD of strings.
coalesce(numPartitions)
repartition(numPartitions)
19 Reshuffle the data in the RDD randomly to create either more or fewer partitions
and balance it across them. This always shuffles all data over the network.
repartitionAndSortWithinPartitions(partitioner)
20 Repartition the RDD according to the given partitioner and, within each resulting
partition, sort records by their keys. This is more efficient than calling repartition
and then sorting within each partition because it can push the sorting down into
the shuffle machinery.
reduce(func)
1 Aggregate the elements of the dataset using a function func (which takes two
arguments and returns one). The function should be commutative and associative
so that it can be computed correctly in parallel.
Apache Spark
collect()
2 Returns all the elements of the dataset as an array at the driver program. This is
usually useful after a filter or other operation that returns a sufficiently small subset
of the data.
count()
first()
take(n)
6 Returns an array with a random sample of num elements of the dataset, with or
without replacement, optionally pre-specifying a random number generator seed.
takeOrdered(n, [ordering])
7 Returns the first n elements of the RDD using either their natural order or a custom
comparator.
saveAsTextFile(path)
8 Writes the elements of the dataset as a text file (or set of text files) in a given
directory in the local filesystem, HDFS or any other Hadoop-supported file system.
Spark calls toString on each element to convert it to a line of text
in the file.
Apache Spark
10 Writes the elements of the dataset in a simple format using Java serialization,
which can then be loaded using SparkContext.objectFile().
countByKey()
11 Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with
the count of each key.
foreach(func)
Runs a function func on each element of the dataset. This is usually, done for side
effects such as updating an Accumulator or interacting with external storage
12 systems.
Let us see the implementations of few RDD transformations and actions in RDD programming
with the help of an example.
Example
Consider a word count example: It counts each word appearing in a document. Consider the
following text as an input and is saved as an input.txt file in a home directory. input.txt: input
file.
Apache Spark
people are not as beautiful as they look, as they walk or as they talk.
they are only as beautiful as they love,
as they care as they share.
Open Spark-Shell
The following command is used to open spark shell. Generally, spark is built using Scala.
Therefore, a Spark program runs on Scala environment.
$ spark-shell
If Spark shell opens successfully then you will find the following output. Look at the last line
of the output “Spark context available as sc” means the Spark container is automatically created
spark context object with the name sc. Before starting the first step of a program, the
SparkContext object should be created.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled; ui
acls disabled; users with view permissions: Set(hadoop); users with modify permissions:
Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port
43292.
Welcome to
/ / //
_\ \/ _ \/ _ `/ / '_/
/ / . /\_,_/_/ /_/\_\ version 1.2.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71) Type
in expressions to have them evaluated. Spark context available as sc scala>
Apache Spark
Create an RDD
First, we have to read the input file using Spark-Scala API and create an RDD.
The following command is used for reading a file from given location. Here, new RDD is
created with the name of inputfile. The String which is given as an argument in the textFile(“”)
method is absolute path for the input file name. However, if only the file name is given, then it
means that the input file is in the current location.
Current RDD
While working with the RDD, if you want to know about current RDD, then use the following
command. It will show you the description about current RDD and its dependencies for
debugging.
scala> counts.toDebugString
scala> counts.saveAsTextFile("output")
Output
(people,1)
(are,2)
(not,1)
(as,8)
(beautiful,2)
(they, 7)
(look,1)
(walk, 1) (or, 1)
(talk, 1)
(only, 1)
(love, 1)
(care, 1)
(share, 1)
Before UN-persisting, if you want to see the storage space that is used for this application, then
use the following URL in your browser.
http://localhost:4040
You will see the following screen, which shows the storage space used for the application,
which are running on the Spark shell.
Apache Spark
If you want to UN-persist the storage space of particular RDD, then use the following
command.
Scala> counts.unpersist()
For verifying the storage space in the browser, use the following URL.
Apache Spark
http://localhost:4040
You will see the following screen. It shows the storage space used for the application, which
are running on the Spark shell.
5. SPARK– DEPLOYMENT Apache Spark
Spark application, using spark-submit, is a shell command used to deploy the Spark application
on a cluster. It uses all respective cluster managers through a uniform interface. Therefore, you
do not have to configure your application for each one.
Example
Let us take the same example of word count, we used before, using shell commands. Here, we
consider the same example as a spark application.
Sample Input
The following text is the input data and the file named is in.txt.
SparkWordCount.scala
import org.apache.spark.SparkContext import
org.apache.spark.SparkContext._ import
org.apache.spark._
object SparkWordCount {
def main(args: Array[String]) {
Note: While transforming the inputRDD into countRDD, we are using flatMap() for tokenizing
the lines (from text file) into words, map() method for counting the word frequency and
reduceByKey() method for counting each word repetition.
Use the following steps to submit this application. Execute all steps in the sparkapplication
directory through the terminal.
Apache Spark
Step 1: Download Spark Jar
Spark core jar is required for compilation, therefore, download spark-core_2.10-1.3.0.jar from
the following link Spark core jar and move the jar file from download directory to spark-
application directory.
If it is executed successfully, then you will find the output given below. The OK letting in the
following output is for user identification and that is the last line of the program. If you carefully
read the following output, you will find different things, such as:
• successfully started service 'sparkDriver' on port 42954
• MemoryStore started with capacity 267.3 MB
• Started SparkUI at http://192.168.1.217:4040
• Added JAR file:/home/hadoop/piapplication/count.jar
• ResultStage 1 (saveAsTextFile at SparkPi.scala:11) finished in 0.566 s
• Stopped Spark web UI at http://192.168.1.217:4040
• MemoryStore cleared
Apache Spark
15/07/08 13:56:04 INFO Slf4jLogger: Slf4jLogger started
15/07/08 13:56:04 INFO Utils: Successfully started service 'sparkDriver' on port 42954.
15/07/08 13:56:04 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://[email protected]:42954]
15/07/08 13:56:04 INFO MemoryStore: MemoryStore started with capacity 267.3 MB
15/07/08 13:56:05 INFO HttpServer: Starting HTTP Server
15/07/08 13:56:05 INFO Utils: Successfully started service 'HTTP file server' on port
56707.
15/07/08 13:56:06 INFO SparkUI: Started SparkUI at http://192.168.1.217:4040
15/07/08 13:56:07 INFO SparkContext: Added JAR
file:/home/hadoop/piapplication/count.jar at
http://192.168.1.217:56707/jars/count.jar with timestamp 1436343967029
15/07/08 13:56:11 INFO Executor: Adding file:/tmp/spark-45a07b83-42ed-42b3b2c2-
823d8d99c5af/userFiles-df4f4c20-a368-4cdd-a2a7-39ed45eb30cf/count.jar to class loader
15/07/08 13:56:11 INFO HadoopRDD: Input split:
file:/home/hadoop/piapplication/in.txt:0+54
15/07/08 13:56:12 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2001 bytes result
sent to driver
(MapPartitionsRDD[5] at saveAsTextFile at SparkPi.scala:11), which is now runnable
15/07/08 13:56:12 INFO DAGScheduler: Submitting 1 missing tasks from
ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at SparkPi.scala:11)
15/07/08 13:56:13 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at
SparkPi.scala:11) finished in 0.566 s
15/07/08 13:56:13 INFO DAGScheduler: Job 0 finished: saveAsTextFile at
SparkPi.scala:11, took 2.892996 s
OK
15/07/08 13:56:13 INFO SparkContext: Invoking stop() from shutdown hook
15/07/08 13:56:13 INFO SparkUI: Stopped Spark web UI at http://192.168.1.217:4040
15/07/08 13:56:13 INFO DAGScheduler: Stopping DAGScheduler
15/07/08 13:56:14 INFO MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped!
15/07/08 13:56:14 INFO Utils: path = /tmp/spark-45a07b83-42ed-42b3-
b2c2823d8d99c5af/blockmgr-ccdda9e3-24f6-491b-b509-3d15a9e05818, already present as
root for deletion.
15/07/08 13:56:14 INFO MemoryStore: MemoryStore cleared
15/07/08 13:56:14 INFO BlockManager: BlockManager stopped
Apache Spark
15/07/08 13:56:14 INFO BlockManagerMaster: BlockManagerMaster stopped
15/07/08 13:56:14 INFO SparkContext: Successfully stopped SparkContext
15/07/08 13:56:14 INFO Utils: Shutdown hook called
15/07/08 13:56:14 INFO Utils: Deleting directory /tmp/spark-45a07b83-42ed-42b3b2c2-
823d8d99c5af
15/07/08 13:56:14 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
OutputCommitCoordinator stopped!
$ cd outfile
$ ls
Part-00000 part-00001 _SUCCESS
$ cat part-00000
(people,1)
(are,2)
(not,1)
(as,8)
(beautiful,2)
(they, 7)
(look,1)
$ cat part-00001
(walk, 1) (or, 1)
(talk, 1)
(only, 1)
(love, 1)
(care, 1)
(share, 1)
Go through the following section to know more about the ‘spark-submit’ command.
Spark-submit Syntax
Options
The table given below describes a list of options:-
S.No Option Description
Spark contains two different types of shared variables- one is broadcast variables and
second is accumulators.
• Broadcast variables: used to efficiently, distribute large values.
• Accumulators: used to aggregate the information of particular collection.
Broadcast Variables
Broadcast variables allow the programmer to keep a read-only variable cached on each
machine rather than shipping a copy of it with tasks. They can be used, for example, to
give every node, a copy of a large input dataset, in an efficient manner. Spark also
attempts to distribute broadcast variables using efficient broadcast algorithms to reduce
communication cost.
Spark actions are executed through a set of stages, separated by distributed “shuffle”
operations. Spark automatically broadcasts the common data needed by tasks within
each stage.
The data broadcasted this way is cached in serialized form and is deserialized before
running each task. This means that explicitly creating broadcast variables, is only
useful when tasks across multiple stages need the same data or when caching the data
in deserialized form is important.
Broadcast variables are created from a variable v by
calling
SparkContext.broadcast(v). The broadcast variable is a wrapper around v, and its
value can be accessed by calling the value method. The code given below shows this:
Output:
After the broadcast variable is created, it should be used instead of the value v in any
functions run on the cluster, so that v is not shipped to the nodes more than once. In
addition, the object v should not be modified after its broadcast, in order to ensure that
all nodes get the same value of the broadcast variable.
Accumulators
Big Data Analytics and
Visualization Lab
Accumulators are variables that are only “added” to through an associative operation
and can therefore, be efficiently supported in parallel. They can be used to implement
counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric
types, and programmers can add support for new types. If accumulators are created
with a name, they will be displayed in Spark’s UI. This can be useful for understanding
the progress of running stages (NOTE: this is not yet supported in Python).
Apache Spark
If you want to see the output of above code then use the following command:
scala> accum.value
Output
res2: Int = 10
Spark allows you to do different operations on numeric data, using one of the
predefined API methods. Spark’s numeric operations are implemented with a
streaming algorithm that allows building the model, one element at a time.
These operations are computed and returned as a StatusCounter object by calling
status() method.
The following is a list of numeric methods available in StatusCounter.
S.No Method & Meaning
Data Visualization
1 count()
Number of elements in the RDD.
2 Mean()
Average of the elements in the RDD.
3 Sum()
Total value of the elements in the RDD.
4 Max()
Maximum value among all elements in the RDD.
Apache Spark
5 Min()
Minimum value among all elements in the RDD.
6 Variance()
Variance of the elements.
7 Stdev()
Standard deviation.
If you want to use only one of these methods, you can call the corresponding method
directly on RDD.
Big Data Analytics and
Visualization Lab
Experiment Number 7
Visualization using Tableau:
Tableau: Tool Overview, Importing Data, Analyzing with Charts
Analysis operations
Use Data Set: Global Super store
Q1. Find the customer with the highest overall profit. What is his/her profit
ratio?
Ans:
Step 1: Open the superstoreus2015 excel data set
Data Visualization
Q2. Which state has the highest Sales (Sum)? What is the total Sales for that
state?
Solution -
Q3. Which customer segment has both the highest order quantity and average
discount rate? What is the order quantity and average discount rate for that
state?
Solution -
Step 3 – click on measures value -> edit filter In filter select discount and
quantity -> Click apply and ok
Data Visualization
Q4. Which Product Category has the highest total Sales? Which Product
Category has the worst Profit? Name the Product Category and $ amount for
each.
Ans:
a. Bar Chart displaying total Sales for each Product Category
c. Each Product Category labeled with total Sales and Each Product
Category labeled with Profit
Q5. Use the same visualization created for Question #4.What was the Profit on
Technology (Product Category) in Boca Raton (City) ?
Final output:
Big Data Analytics and
Visualization Lab
Q6. Which Product Department has the highest Shipping Costs? Name the
Department and cost.
Solution -
Final output:
Q7. Use the same visualization created for Question #6. What was the shipping
cost of Office Supplies for Xerox 1905 in the Home Customer Segment in
Cambridge?
Solution -
Step 5 – drag and drop segment(select home office) and product name(select
Xerox 1936) in marks area
Data Visualization
Final output -
Preparing Maps
Big Data Analytics and
Visualization Lab
Step 2 – join sheets (orders and people) First drag and drop orders table. Then
click on the arrow of orders, select open. Then drag and drop people's tables.
Data Visualization
Step 3 - Create a Geographic Hierarchy
1. In the Data pane, right-click the geographic field, Country, and
then select Hierarchy > Create Hierarchy.
3. In the Data pane, drag the State field to the hierarchy and place
it below the Country field.
Step 5 - On the Marks card, click the + icon on the Country field.
Data Visualization
Step 6 – To Add color - From Measures, drag Sales to Color on the Marks
card.
Final output -
Data Visualization
Steps –
1) Use same visualization created for q1
2)To calculate profit Ratio
Profit Ratio= (Sum([Profit])/Sum([Sales]))
Click on Analysis>Create Calculated Field and enter the formula, click
apply and ok.
Drag and drop this calculation on label in marks area
Big Data Analytics and
Visualization Lab
Final output -
Steps –
1) use the same visualization created in q2
2) drag and drop product name in marks area
3) add filter product name in filters area (select grip seal Envelope
product name in opened window and click ok)
Data Visualization
Final output –
Big Data Analytics and
Visualization Lab
Step 3 - Now Drag the profit measure to the color mark in the marks area.
Also drag profit to label in the marks area.
Final output -
Big Data Analytics and
Visualization Lab
Q5. Which state has the worst Gross Profit Ratio on Envelopes in the
Corporate Customer Segment that were Shipped in 2015?
Solution -
Step 1 - In the Data pane, double-click Country. On the Marks card, click the
+ icon on the Country field.
Drag sub-category, segment, order date to marks area
Drag calculated profit ratio to label and tooltip area
Data Visualization
Step 2 - Drag the order date on the filter shelf and select in year – 2015
Step 3 - Drag the segment on the filter shelf and select corporate.
Step 4 - Step 3 - Drag the category on the filter shelf and select the envelope.
Big Data Analytics and
Visualization Lab
Data Visualization
Big Data Analytics and
Visualization Lab
Final output -
Preparing Reports
Data Visualization
Final output -
Big Data Analytics and
Visualization Lab
2) Report showing regionwise, product wise sales
Solution –
Steps – Drag and drop region as rows and sub category as columns
Drag and drop sales in marks area and make it as text
Final output -
4) What is the percent of total Sales for the ‘Home Office’ Customer
Segment in July of 2014?
Solution –
Step 1 –
Drag and segment as rows.
Drag and drop sales in marks area and make it as text
Big Data Analytics and
Visualization Lab
Step 2 – add order date filter. Select year in window appeared -> next -> select
July 2014
Step 3 – click on arrow on sales in marks area -> quick table calculation ->
percentage of total
Data Visualization
Final Output -
Big Data Analytics and
Visualization Lab
Data Visualization
5) Find the top 10 Product Names by Sales within each region. Which
product is ranked #2 in both the Central & West regions in 2015?
Solution –
Step 1 -Drag “Product Name” dimension from data pane window to Row
Shelf, region to column shelf and then add an “order Date” on Filter shelf and
select “Year” of Order date as 2015
Step 2 - put region on Filter shelf and select “Central” and “West” checkbox
Big Data Analytics and
Visualization Lab
Step 3 - Drag a Sales measure to the text label in the marks area.
Add the “Product name” on the Filter shelf. Once the Filter Pop up is open,
Select “TOP” tab >By Field > Top 10 by Sum (Sales).
Data Visualization
Right click on the aggregated Sales measure and click on the arrow sign then
select Quick Table Calculation > Rank.
As the default addressing is Table across, please change it into Table Down
(Compute using -> Table Down).
Big Data Analytics and
Visualization Lab
Data Visualization
Final output -