Big Data Analytics and Visualization Lab
Big Data Analytics and Visualization Lab
MCA
(TWO YEARS PATTERN)
SEMESTER - III (CBCS)
Published by : Director,
Institute of Distance and Open Learning,
University of Mumbai,
Vidyanagari,Mumbai - 400 098.
Pig Shell, Pig Data Types, Creating a Pig Data Model, Read-
ing and Storing Data, Pig Operations
Self-Learning Topics:
6 Spark: RDD, Actions and Transformation on RDD , Ways 2
to Create -file, data in memory, other RDD. Lazy Execution,
Persisti RDD
Self-Learning Topics: Machine Learning Algorithms like
K-Means using Spark.
7 Visualization: Connect to data, Build Charts and Analyze 6
Data, Create Dashboard, Create Stories using Tableau
Self-Learning Topics: Tableau using web.
1
SET UP AND CONFIGURATION HADOOP
USING CLOUDERA CREATING A HDFS
SYSTEM WITH MINIMUM 1 NAME NODE
AND 1 DATA NODES HDFS COMMANDS
Unit Structure :
1.1 Objectives
1.2 Prerequisite
1.3 GUI Configuration
1.4 Command Line Configuration
1.5 Summary
1.6 Sample Questions
1.7 References
1.1 OBJECTIVES
Hadoop file system stores the data in multiple copies. Also, it’s a cost-
effective solution for any business to store their data efficiently. HDFS
Operations acts as the key to open the vaults in which you store the data to
be available from remote locations. This chapter describes how to set up
and edit the deployment configuration files for HDFS
Java -version
1
Big Data Analytics and
Visualization Lab
Now we need to set Hadoop bin directory and Java bin directory path in
system variable path.
Edit Path in system variable
22
Click on New and add the bin directory path of Hadoop and Java in it. Set up and Configuration
Hadoop using Cloudera
creating a HDFS System with
Minimum 1 Name Node
and 1 Data Nodes
HDFS Commands
Now we need to edit some files located in the hadoop directory of the etc
folder where we installed hadoop. The files that need to be edited have been
highlighted.
3
Big Data Analytics and 1. Edit the file core-site.xml in the hadoop directory. Copy this xml
Visualization Lab
property in the configuration in the file
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
2. Edit mapred-site.xml and copy this property in the configuration
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
3. Create a folder ‘data’ in the hadoop directory
44
5. Edit the file hdfs-site.xml and add below property in the configuration Set up and Configuration
Hadoop using Cloudera
creating a HDFS System with
Note: The path of namenode and datanode across value would be the path Minimum 1 Name Node
of the datanode and namenode folders you just created. and 1 Data Nodes
HDFS Commands
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\Users\hp\Downloads\hadoop-3.1.0\hadoop-
3.1.0\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value> C:\Users\hp\Downloads\hadoop-3.1.0\hadoop-
3.1.0\data\datanode</value>
</property>
</configuration>
6. Edit the file yarn-site.xml and add below property in the configuration
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.clas
s</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
5
Big Data Analytics and
Visualization Lab
7. Edit hadoop-env.cmd and replace %JAVA_HOME% with the path of
the java folder where your jdk 1.8 is installed
8. Hadoop needs windows OS specific files which does not come with
default download of hadoop.
To include those files, replace the bin folder in hadoop directory with the
bin folder provided in this github link.
https://github.com/s911415/apache-hadoop-3.1.0-winutils
Download it as zip file. Extract it and copy the bin folder in it. If you want
to save the old bin folder, rename it like bin_old and paste the copied bin
folder in that directory.
7
Big Data Analytics and
Visualization Lab
1.4 COMMAND LINE CONFIGURATION
Starting HDFS
Format the configured HDFS file system and then open the namenode
(HDFS server) and execute the following command.
$ hadoop namenode -format
Start the distributed file system and follow the command listed below to
start the namenode as well as the data nodes in cluster.
$ start-dfs.sh
Read & Write Operations in HDFS
You can execute almost all operations on Hadoop Distributed File
Systems that can be executed on the local file system. You can execute
various reading, writing operations such as creating a directory, providing
permissions, copying files, updating files, deleting, etc. You can add access
rights and browse the file system to get the cluster information like the
number of dead nodes, live nodes, spaces used, etc.
HDFS Operations to Read the file
To read any file from the HDFS, you have to interact with the NameNode
as it stores the metadata about the DataNodes. The user gets a token from
the NameNode and that specifies the address where the data is stored.
You can put a read request to NameNode for a particular block location
through distributed file systems. The NameNode will then check your
privilege to access the DataNode and allows you to read the address block
if the access is valid.
$ hadoop fs -cat <file>
HDFS Operations to write in file
Similar to the read operation, the HDFS Write operation is used to write the
file on a particular address through the NameNode. This NameNode
provides the slave address where the client/user can write or add data. After
writing on the block location, the slave replicates that block and copies to
another slave location using the factor 3 replication. The salve is then
reverted back to the client for authentication.
The process for accessing a NameNode is pretty similar to that of a reading
operation. Below is the HDFS write commence:
bin/hdfs dfs -ls <path>
88
Listing Files in HDFS Set up and Configuration
Hadoop using Cloudera
creating a HDFS System with
Finding the list of files in a directory and the status of a file using ‘ls’ Minimum 1 Name Node
command in the terminal. Syntax of ls can be passed to a directory or a and 1 Data Nodes
HDFS Commands
filename as an argument which are displayed as follows:
$ $HADOOP_HOME/bin/hadoop fs -ls <args>
Inserting Data into HDFS
Below mentioned steps are followed to insert the required file in the Hadoop
file system.
Step1: Create an input directory
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input
Step2: Use the put command transfer and store the data file from the local
systems to the HDFS using the following commands in the terminal.
$ $HADOOP_HOME/bin/hadoop fs -put /home/intellipaat.txt /user/input
Step3: Verify the file using ls command.
$ $HADOOP_HOME/bin/hadoop fs -ls /user/input
10
10
Installation of Hadoop Set up and Configuration
Hadoop using Cloudera
creating a HDFS System with
Hadoop should be downloaded in the master server using the following Minimum 1 Name Node
procedure. and 1 Data Nodes
HDFS Commands
# mkdir /opt/hadoop
# cd /opt/hadoop/
# wget http://apache.mesi.com.ar/hadoop/common/hadoop-1.2.1/hadoop-
1.2.0.tar.gz
# tar -xzf hadoop-1.2.0.tar.gz
# mv hadoop-1.2.0 hadoop
# chown -R hadoop /opt/hadoop
# cd /opt/hadoop/hadoop/
Configuring Hadoop
Hadoop server must be configured in core-site.xml and should be
edited wherever required.
<configuration>
<property>
<name>fs.default.name</name><value>hdfs://hadoop-
master:9000/</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
hdfs-site.xml file should be editted.
<configuration>
<property>
<name>dfs.data.dir</name>
<value>/opt/hadoop/hadoop/dfs/name/data</value>
<final>true</final>
11
Big Data Analytics and </property>
Visualization Lab
<property>
<name>dfs.name.dir</name>
<value>/opt/hadoop/hadoop/dfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
$ cd /opt/hadoop/hadoop
Master Node Configuration
$ vi etc/hadoop/masters
hadoop-master
Slave Node Configuration
$ vi etc/hadoop/slaves
hadoop-slave-1
hadoop-slave-2
15
Big Data Analytics and Start the DataNode on New Node
Visualization Lab
Datanode daemon should be started manually using
$HADOOP_HOME/bin/hadoop-daemon.sh script. Master (NameNode)
should correspondingly join the cluster after automatically contacted. New
node should be added to the configuration/slaves file in the master server.
New node will be identified by script-based commands.
Login to new node
su hadoop or ssh -X [email protected]
HDFS is started on a newly added slave node
./bin/hadoop-daemon.sh start datanode
jps command output must be checked on a new node.
$ jps
7141 DataNode
10312 Jps
Removing a DataNode
Node can be removed from a cluster while it is running, without any worries
of data loss. A decommissioning feature is made available by HDFS which
ensures that removing a node is performed securely.
Step 1
Login to master machine so that the user can check Hadoop is being
installed.
$ su hadoop
Step 2
Before starting the cluster an exclude file must be configured where a key
named dfs.hosts.exclude should be added to
our$HADOOP_HOME/etc/hadoop/hdfs-site.xmlfile.
NameNode’s local file system contains a list of machines which are not
permitted to connect to HDFS receives full path by this key and the value
associated with it as follows.
<property>
<name>dfs.hosts.exclude</name><value>/home/hadoop/hadoop-
1.2.1/hdfs_exclude.txt</value><description>>DFS exclude</description>
</property>
16
16
Step 3 Set up and Configuration
Hadoop using Cloudera
creating a HDFS System with
Hosts with respect to decommission are determined. Minimum 1 Name Node
and 1 Data Nodes
File reorganization by the hdfs_exclude.txt for each and every machine to HDFS Commands
be decommissioned which will results in preventing them from connecting
to the NameNode.
slave2.in
Step 4
Force configuration reloads.
“$HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes” should be run
$ $HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes
NameNode will be forced made to re-read its configuration, as this is
inclusive for the newly updated ‘excludes’ file. Nodes will be
decommissioned over a period of time intervals, and allowing time for each
node’s blocks to be replicated onto machines which are scheduled to be
active.jps command output should be checked on slave2.in. Once the work
is done DataNode process will shutdown automatically.
Step 5
Shutdown nodes.
The decommissioned hardware can be carefully shut down for maintenance
purpose after the decommission process has been finished.
$ $HADOOP_HOME/bin/hadoop dfsadmin -report
Step 6
Excludes are edited again and once the machines have been
decommissioned, they are removed from the ‘excludes’ file.
“$HADOOP_HOME/bin/hadoop dfsadmin -refreshNodes” will read the
excludes file back into the NameNode.
Data Nodes will rejoin the cluster after the maintenance has been
completed, or if additional capacity is needed in the cluster again is being
informed.
To run/shutdown tasktracker
$ $HADOOP_HOME/bin/hadoop-daemon.sh stop tasktracker
$ $HADOOP_HOME/bin/hadoop-daemon.sh start tasktracker
Add a new node with the following steps
1) Take a new system which gives access to create a new username and
password
17
Big Data Analytics and 2) Install the SSH and with master node setup ssh connections
Visualization Lab
3) Add sshpublic_rsa id key having an authorized keys file
4) Add the new data node hostname, IP address and other informative
details in /etc/hosts slaves file192.168.1.102 slave3.in slave3
5) Start the DataNode on the New Node
6) Login to the new node command like suhadoop or Ssh -
X [email protected]
7) Start HDFS of newly added in the slave node by using the following
command ./bin/hadoop-daemon.sh start data node
8) Check the output of jps command on a new node.
1.5 SUMMARY
1.7 REFERENCES
19
Big Data Analytics
and Visualization Lab
2
EXPERIMENT 1
AIM
Write a simple program for Word Count Using Map Reduce Programming
OBJECTIVE
THEORY
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
}
}
The above program consists of three classes:
● Driver class (Public, void, static, or main; this is the entry
point).
● The Map class which extends the public class
Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT> and
implements the Map function.
● The Reduce class which extends the public class
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> and
implements the Reduce function.
6. Make a jar file
Right Click on Project> Export> Select export destination as Jar
File > next> Finish.
22
22
Experiments
To move this into Hadoop directly, open the terminal and enter the
following commands:
23
Big Data Analytics [training@localhost ~]$ hadoop fs -put wordcountFile
and Visualization Lab
wordCountFile
9. Open the result:
[training@localhost ~]$ hadoop fs -ls MRDir1
Found 3 items
-rw-r--r-- 1 training supergroup 0 2022-02-23 03:36
/user/training/MRDir1/_SUCCESS
drwxr-xr-x - training supergroup 0 2022-02-23 03:36
/user/training/MRDir1/_logs
-rw-r--r-- 1 training supergroup 20 2022-02-23 03:36
/user/training/MRDir1/part-r-00000
[training@localhost ~]$ hadoop fs -cat MRDir1/part-r-00000
BUS 7
CAR 4
TRAIN 6
24
24
EXPERIMENT 2 Experiments
AIM :
OBJECTIVE
THEORY:
import java.io.IOException;
job.setReducerClass(Reducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, input);
FileOutputFormat.setOutputPath(job, output);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
27
Big Data Analytics EXPERIMENT 3
and Visualization Lab
AIM :
OBJECTIVE
THEORY
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
30
30
EXPERIMENT 4 Experiments
AIM :
OBJECTIVE
THEORY:
job.setJarByClass(GroupSum.class);
job.setMapperClass(MyMapper.class);
job.setCombinerClass(MyReducer.class);
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, input);
FileOutputFormat.setOutputPath(job, output);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
33
Big Data Analytics EXPERIMENT 5
and Visualization Lab
AIM
OBJECTIVE
THEORY
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInfor
mation.java:1917)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
mapreduce worked when I used a smaller matrix. I will post the code for
mapper and reducer below:
public class Map
extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text,
Text, Text> {
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
int m = Integer.parseInt(conf.get("m"));
int p = Integer.parseInt(conf.get("p"));
String line = value.toString();
// (M, i, j, Mij);
String[] indicesAndValue = line.split(",");
Text outputKey = new Text();
Text outputValue = new Text();
if (indicesAndValue[0].equals("M")) {
for (int k = 0; k < p; k++) {
outputKey.set(indicesAndValue[1] + "," + k);
// outputKey.set(i,k);
outputValue.set(indicesAndValue[0] + "," +
indicesAndValue[2]
+ "," + indicesAndValue[3]);
// outputValue.set(M,j,Mij);
context.write(outputKey, outputValue);
}
} else {
// (N, j, k, Njk);
for (int i = 0; i < m; i++) {
outputKey.set(i + "," + indicesAndValue[2]);
outputValue.set("N," + indicesAndValue[1] + ","
+ indicesAndValue[3]);
context.write(outputKey, outputValue);
}
}
35
Big Data Analytics }
and Visualization Lab
}
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.waitForCompletion(true);
}
}
37
Big Data Analytics Run Above code similar to first experiment.
and Visualization Lab
www.geeksforgeeks.org
https://github.com/wosiu/Hadoop-map-reduce-relational-
algebra/blob/master/src/main/java/WordCount.java
38
38
3
MONGO DB
Unit Structure :
3.0 Objectives
3.1 Introduction
3.1.1 How it Works
3.1.2 How MongoDB is different from RDBMS
3.1.3 Features of MongoDB
3.1.4 Advantages of MongoDB
3.1.5 Disadvantages of MongoDB
3.2 MongoDB Environment
3.2.1 Install MongoDB in Windows
3.3 Creating a Database Using MongoDB
3.4 Creating Collections in MongoDB
3.5 CRUD Document
3.5.1 Create Operations
3.5.2 Read Operations
3.5.3 Update Operations
3.5.4 Delete Operations
3.6 Let’s Sum up
3.7 Website references
3.0 OBJECTIVES
3.1 INTRODUCTION
● We now have papers within the collection. These documents hold the
data we want to store in the MongoDB database, and a single
collection can contain several documents. You are schema-less,
which means that each document does not have to be identical to the
others.
● The fields are used to build the documents. Fields in documents are
key-value pairs, similar to columns in a relation database. The fields'
values can be of any BSON data type, such as double, string, Boolean,
and so on.
● The data in MongoDB is saved in the form of BSON documents.
40
40 BSON stands for Binary representation of JSON documents in this
context. To put it another way, the MongoDB server translates JSON MONGO DB
data into a binary format called as BSON, which can then be stored
and searched more effectively.
Mongo DB RDBMS
42
42
3.1.5 Disadvantages of MongoDB MONGO DB
● You may not keep more than 16MB of data in your documents.
● The nesting of data in BSON is likewise limited; you can only nest
data up to 100 levels.
To get started with MongoDB, you have to install it in your system. You
need to find and download the latest version of MongoDB, which will be
compatible with your computer system. You can use this
(http://www.mongodb.org/downloads) link and follow the instruction to
install MongoDB in your PC.
The process of setting up MongoDB in different operating systems is also
different, here various installation steps have been mentioned and according
to your convenience, you can select it and follow it.
3.2.1 Install Mongo DB in Windows
The website of MongoDB provides all the installation instructions, and
MongoDB is supported by Windows, Linux as well as Mac OS.
It is to be noted that, MongoDB will not run in Windows XP; so you need
to install higher versions of windows to use this database.
Once you visit the link (http://www.mongodb.org/downloads), click the
download button.
1. Once the download is complete, double click this setup file to install
it. Follow the steps:
2.
43
Big Data Analytics and 3. Now, choose Complete to install MongoDB completely.
Visualization Lab
5. The setup system will also prompt you to install MongoDB Compass,
which is MongoDB official graphical user interface (GUI). You can
tick the checkbox to install that as well.
44
44
MONGO DB
Once the installation is done completely, you need to start MongoDB and
to do so follow the process:
1. Open Command Prompt.
2. Type: C:\Program Files\MongoDB\Server\4.0\bin
3. Now type the command simply: mongodb to run the server.
In this way, you can start your MongoDB database. Now, for running
MongoDB primary client system, you have to use the command:
C:\Program Files\MongoDB\Server\4.0\bin>mongo.exe
45
Big Data Analytics and
Visualization Lab
3.3 CREATING A DATABASE USING MONGODB
46
46
If you want to check your databases list, use the command show dbs. MONGO DB
>show dbs
local 0.78125GB
test 0.23012GB
>db.movie.insert({"name":"tutorials point"})
>show dbs
local 0.78125GB
mydb 0.23012GB
test 0.23012GB
48
48
Options parameter is optional, so you need to specify only the name of the MONGO DB
collection. Following is the list of options you can use –
While inserting the document, MongoDB first checks size field of capped
collection, then it checks max field.
Examples
Basic syntax of createCollection() method without options is as follows −
>use test
switched to db test
>db.createCollection("mycollection")
{ "ok" : 1 }
>
You can check the created collection by using the command show
collections.
>show collections
mycollection
system.indexes
The following example shows the syntax of createCollection() method
with few important options −
> db.createCollection("mycol", { capped : true, autoIndexID : true, size :
6142800, max : 10000 } ){ 49
Big Data Analytics and "ok" : 0,
Visualization Lab
"errmsg" : "BSON field 'create.autoIndexID' is an unknown field.",
"code" : 40415,
"codeName" : "Location40415"
}
>
In MongoDB, you don't need to create collection. MongoDB creates
collection automatically, when you insert some document.
>db.tutorialspoint.insert({"name" : "tutorialspoint"}),
WriteResult({ "nInserted" : 1 })
>show collections
mycol
mycollection
system.indexes
tutorialspoint
>
The drop() Method
MongoDB's db.collection.drop() is used to drop a collection from the
database.
Syntax
Basic syntax of drop() command is as follows −
db.COLLECTION_NAME.drop()
Example
First, check the available collections into your database mydb.
>use mydb
switched to db mydb
>show collections
mycol
mycollection
system.indexes
tutorialspoint
>
Now drop the collection with the name mycollection.
>db.mycollection.drop()
true
>
Again check the list of collections into database.
50
50
>show collections MONGO DB
mycol
system.indexes
tutorialspoint
>
drop() method will return true, if the selected collection is dropped
successfully, otherwise it will return false.
3.5 CRUD DOCUMENT
As we know, we can use MongoDB for a variety of purposes such as
building an application (including web and mobile), data analysis, or as an
administrator of a MongoDB database. In all of these cases, we must
interact with the MongoDB server to perform specific operations such as
entering new data into the application, updating data in the application,
deleting data from the application, and reading the data of the application.
MongoDB provides a set of simple yet fundamental operations known as
CRUD operations that will allow you to quickly interact with the MongoDB
server.
3.5.1 Create Operations — these operations are used to insert or add new
documents to the collection. If a collection does not exist, a new collection
will be created in the database. MongoDB provides the following methods
for performing and creating operations:
Method Description
insertOne()
As the namesake, insertOne() allows you to insert one document into the
collection. For this example, we’re going to work with a collection called
51
Big Data Analytics and RecordsDB. We can insert a single entry into our collection by calling the
Visualization Lab
insertOne() method on RecordsDB. We then provide the information we
want to insert in the form of key-value pairs, establishing the Schema.
Example:
db.RecordsDB.insertOne({
name: "Marsh",
age: "6 years",
species: "Dog",
ownerAddress: "380 W. Fir Ave",
chipped: true
})
> db.RecordsDB.insertOne({
... name: "Marsh",
... age: "6 years",
... species: "Dog",
... ownerAddress: "380 W. Fir Ave",
... chipped: true
... })
{
"acknowledged" : true,
"insertedId" : ObjectId("5fd989674e6b9ceb8665c57d")
}
insertMany()
It’s possible to insert multiple items at one time by calling
the insertMany() method on the desired collection. In this case, we pass
multiple items into our chosen collection (RecordsDB) and separate them
by commas. Within the parentheses, we use brackets to indicate that we
are passing in a list of multiple entries. This is commonly referred to as a
nested method.
52
52
Example: MONGO DB
db.RecordsDB.insertMany([{
name: "Marsh",
age: "6 years",
species: "Dog",
ownerAddress: "380 W. Fir Ave",
chipped: true},
{name: "Kitana",
age: "4 years",
species: "Cat",
ownerAddress: "521 E. Cortland",
chipped: true}])
find()
In order to get all the documents from a collection, we can simply use the
find() method on our chosen collection. Executing just the find() method
with no arguments will return all records currently in the collection.
db.RecordsDB.find
findOne()
In order to get one document that satisfies the search criteria, we can simply
use the findOne() method on our chosen collection. If multiple documents
satisfy the query, this method returns the first document according to the
natural order which reflects the order of documents on the disk. If no
documents satisfy the search criteria, the function returns null. The function
takes the following form of syntax.
db.{collection}.findOne({query}, {projection})
3.5.3 Update Operations
Update operations, like create operations, work on a single collection and
are atomic at the document level. Filters and criteria are used to choose the
documents to be updated during an update procedure.
You should be cautious while altering documents since alterations are
permanent and cannot be reversed. This also applies to remove operations.
There are three techniques for updating documents in MongoDB CRUD:
● db.collection.updateOne()
● db.collection.updateMany()
● db.collection.replaceOne()
updateOne()
With an update procedure, we may edit a single document and update an
existing record. To do this, we use the updateOne() function on a specified
collection, in this case "RecordsDB." To update a document, we pass two
arguments to the method: an update filter and an update action.
The update filter specifies which items should be updated, and the update
action specifies how those items should be updated. We start with the update
filter. Then we utilise the "$set" key and supply the values for the fields we
wish to change. This function updates the first record that matches the
specified filter.
54
54
updateMany() MONGO DB
● The data model that MongoDB follows is a highly elastic one that lets
you combine and store data of multivariate types without having to
compromise on the powerful indexing options, data access, and
validation rules.
● A group of database documents can be called a collection. The
RDBMS equivalent to a collection is a table. The entire collection
exists within a single database. There are no schemas when it comes
55
Big Data Analytics and to collections. Inside the collection, various documents can have
Visualization Lab
varied fields, but mostly the documents within a collection are meant
for the same purpose or for serving the same end goal.
1. https://www.mongodb.com
2. https://www.tutorialspoint.com
3. https://www.w3schools.com
4. https://www.educba.com
56
56
4
HIVE
Unit Structure :
4.0 Objectives
4.1 Introduction
4.2 Summary
4.3 References
4.4 Unit End Exercises
4.0 OBJECTIVE
Hive allows users to read, write, and manage petabytes of data using
SQL. Hive is built on top of Apache Hadoop, which is an open-source
framework used to efficiently store and process large datasets. As a result,
Hive is closely integrated with Hadoop, and is designed to work quickly
on petabytes of data.
4.1 INTRODUCTION
Here, IF NOT EXISTS is an optional clause, which notifies the user that a
database with the same name already exists. We can use SCHEMA in place
of DATABASE in this command. The following query is executed to create
a database named userdb:
or
o default
o userdb
58
58
JDBC Program HIVE
o import java.sql.SQLException;
o import java.sql.Connection;
o import java.sql.ResultSet;
o import java.sql.Statement;
o import java.sql.DriverManager;
o
o Class.forName(driverName);
o // get connection
o
o Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/default", "",
"");
o Statement stmt = con.createStatement();
o
o con.close();
o }
o }
59
Big Data Analytics and Save the program in a file named HiveCreateDb.java. The following
Visualization Lab
commands are used to compile and execute this program.
o $ javac HiveCreateDb.java
o $ java HiveCreateDb
Output:
Database userdb created successfully.
Hive - Partitioning
Hive organizes tables into partitions. It is a way of dividing a table into
related parts based on the values of partitioned columns such as date, city,
and department. Using partition, it is easy to query a portion of the data.
Tables or partitions are sub-divided into buckets, to provide extra structure
to the data that may be used for more efficient querying. Bucketing works
based on the value of hash function of some column of a table.
For example, a table named Tab1 contains employee data such as id, name,
dept, and yoj (i.e., year of joining). Suppose you need to retrieve the details
of all employees who joined in 2012. A query searches the whole table for
the required information. However, if you partition the employee data with
the year and store it in a separate file, it reduces the query processing time.
The following example shows how to partition a file and its data:
The following file contains employeedata table.
/tab1/employeedata/file1
id, name, dept, yoj
1, gopal, TP, 2012
2, kiran, HR, 2012
3, kaleel,SC, 2013
4, Prasanth, SC, 2013
The above data is partitioned into two files using year.
/tab1/employeedata/2012/file2
1, gopal, TP, 2012
2, kiran, HR, 2012
/tab1/employeedata/2013/file3
3, kaleel,SC, 2013
4, Prasanth, SC, 2013
60
60
Adding a Partition HIVE
We can add partitions to a table by altering the table. Let us assume we have
a table called employee with fields such as Id, Name, Salary, Designation,
Dept, and yoj.
Syntax:
partition_spec:
: (p_column = p_col_value, p_column = p_col_value, ...)
Renaming a Partition
The syntax of this command is as follows.
Dropping a Partition
The following syntax is used to drop a partition:
61
Big Data Analytics and
Visualization Lab
4.2 SUMMARY
62
62
HIVE
String ucase(string A) Same as above.
Example
The following queries demonstrate some built-in functions:
round() function
On successful execution of the query, you get to see the following response:
64
64
ceil() function HIVE
On successful execution of the query, you get to see the following response:
3.0
Aggregate Functions
Hive supports the following built-in aggregate functions. The usage of
these functions is as same as the SQL aggregate functions.
65
Big Data Analytics and Relational Operators
Visualization Lab
These operators are used to compare two operands. The following table
describes the relational operators available in Hive:
Let us assume the employee table is composed of fields named Id, Name,
Salary, Designation, and Dept as shown below. Generate a query to retrieve
the employee details whose Id is 1205.
+-----+--------------+--------+---------------------------+------+
| Id | Name | Salary | Designation | Dept |
+-----+--------------+------------------------------------+------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin|
+-----+--------------+--------+---------------------------+------+
The following query is executed to retrieve the employee details using the
above table:
hive> SELECT * FROM employee WHERE Id=1205;
On successful execution of query, you get to see the following response:
+-----+-----------+-----------+----------------------------------+
| ID | Name | Salary | Designation | Dept |
+-----+---------------+-------+----------------------------------+
|1205 | Kranthi | 30000 | Op Admin | Admin |
+-----+-----------+-----------+----------------------------------+
The following query is executed to retrieve the employee details whose
salary is more than or equal to Rs 40000.
hive> SELECT * FROM employee WHERE Salary>=40000;
On successful execution of query, you get to see the following response:
+-----+------------+--------+----------------------------+------+
| ID | Name | Salary | Designation | Dept |
+-----+------------+--------+----------------------------+------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali| 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
+-----+------------+--------+----------------------------+------+
67
Big Data Analytics and Arithmetic Operators
Visualization Lab
These operators support various common arithmetic operations on the
operands. All of them return number types. The following table describes
the arithmetic operators available in Hive:
Example
The following query adds two numbers, 20 and 30.
On successful execution of the query, you get to see the following response:
+--------+
| ADD |
+--------+
| 50 |
+--------+
68
68
Logical Operators HIVE
The operators are logical expressions. All of them return either TRUE or
FALSE.
A || B boolean Same as A OR B.
Example
The following query is used to retrieve employee details whose Department
is TP and Salary is more than Rs 40000.
On successful execution of the query, you get to see the following response:
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
+------+--------------+-------------+-------------------+--------+
Complex Operators
These operators provide an expression to access the elements of Complex
Types.
69
Big Data Analytics and
Visualization Lab Operator Operand Description
Example
Let us take an example for view. Assume employee table as given below,
with the fields Id, Name, Salary, Designation, and Dept. Generate a query
to retrieve the employee details who earn a salary of more than Rs 30000.
We store the result in a view named emp_30000.
+------+--------------+-------------+-------------------+--------+
| ID | Name | Salary | Designation | Dept |
+------+--------------+-------------+-------------------+--------+
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |
+------+--------------+-------------+-------------------+--------+
70
70
The following query retrieves the employee details using the above HIVE
scenario:
hive> CREATE VIEW emp_30000 AS
SELECT * FROM employee
WHERE salary>30000;
Dropping a View
Use the following syntax to drop a view:
DROP VIEW view_name
The following query drops a view named as emp_30000:
hive> DROP VIEW emp_30000;
Creating an Index
An Index is nothing but a pointer on a particular column of a table. Creating
an index means creating a pointer on a particular column of a table. Its
syntax is as follows:
CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[
[ ROW FORMAT ...] STORED AS ...
| STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
Example
Let us take an example for index. Use the same employee table that we have
used earlier with the fields Id, Name, Salary, Designation, and Dept. Create
an index named index_salary on the salary column of the employee table.
71
Big Data Analytics and The following query creates an index:
Visualization Lab
hive> CREATE INDEX inedx_salary ON TABLE employee(salary)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';
It is a pointer to the salary column. If the column is modified, the changes
are stored using an index value.
Dropping an Index
The following syntax is used to drop an index:
DROP INDEX <index_name> ON <table_name>
The following query drops an index named index_salary:
hive> DROP INDEX index_salary ON employee;
4.3 REFERENCES
72
72
5
PIG
Unit Structure :
5.0 Objectives
5.1 Introduction
5.2 Summary
5.3 References
5.4 Unit End Exercises
5.0 OBJECTIVE
5.1 INTRODUCTION
73
Big Data Analytics and Where Should Pig be Used?
Visualization Lab
Pig can be used under following scenarios:
When data loads are time sensitive.
● When processing various data sources.
● When analytical insights are required through sampling.
Pig Latin – Basics
Pig Latin is the language used to analyze data in Hadoop using Apache Pig.
In this chapter, we are going to discuss the basics of Pig Latin such as Pig
Latin statements, data types, general and relational operators, and Pig Latin
UDF’s.
Pig Latin – Data Model
As discussed in the previous chapters, the data model of Pig is fully nested.
A Relation is the outermost structure of the Pig Latin data model. And it is
a bag where −
● A bag is a collection of tuples.
● A tuple is an ordered set of fields.
● A field is a piece of data.
Pig Latin – Statemets
While processing data using Pig Latin, statements are the basic constructs.
● These statements work with relations. They
include expressions and schemas.
● Every statement ends with a semicolon (;).
● We will perform various operations using operators provided by Pig
Latin, through statements.
● Except LOAD and STORE, while performing all other operations, Pig
Latin statements take a relation as input and produce another relation
as output.
● As soon as you enter a Load statement in the Grunt shell, its semantic
checking will be carried out. To see the contents of the schema, you
need to use the Dump operator. Only after performing
the dump operation, the MapReduce job for loading the data into the
file system will be carried out.
74
74
Example PIG
Given below is a Pig Latin statement, which loads data to Apache Pig.
Null Values
Values for all the above data types can be NULL. Apache Pig treats null
values in a similar way as SQL does.
A null can be an unknown value or a non-existent value. It is used as a
placeholder for optional values. These nulls can occur naturally or can be
the result of an operation.
Pig Latin – Arithmetic Operators
The following table describes the arithmetic operators of Pig Latin. Suppose
a = 10 and b = 20.
76
76
PIG
Bincond − Evaluates the Boolean b = (a == 1)?
operators. It has three operands as shown 20: 30;
below. if a = 1 the
variable x = (expression) ? value1 if value of b is
?:
true : value2 if false. 20.
if a!=1 the
value of b is
30.
77
Big Data Analytics and
Visualization Lab <= Less than or equal to − Checks if the value (a <= b) is
of the left operand is less than or equal to the true.
value of the right operand. If yes, then the
condition becomes true.
Operator Description
Filtering
78
78
PIG
FOREACH, To generate data transformations based on
GENERATE columns of data.
Sorting
Diagnostic Operators
79
Big Data Analytics and Apache Pig - Grunt Shell
Visualization Lab
After invoking the Grunt shell, you can run your Pig scripts in the shell. In
addition to that, there are certain useful shell and utility commands provided
by the Grunt shell. This chapter explains the shell and utility commands
provided by the Grunt shell.
Shell Commands
The Grunt shell of Apache Pig is mainly used to write Pig Latin scripts.
Prior to that, we can invoke any shell commands using sh and fs.
sh Command
Using sh command, we can invoke any shell commands from the Grunt
shell. Using sh command from the Grunt shell, we cannot execute the
commands that are a part of the shell environment (ex − cd).
Syntax
Given below is the syntax of sh command.
grunt> sh shell command parameters
Example
We can invoke the ls command of Linux shell from the Grunt shell using
the sh option as shown below. In this example, it lists out the files in
the /pig/bin/ directory.
grunt> sh ls
pig
pig_1444799121955.log
pig.cmd
pig.py
fs Command
Using the fs command, we can invoke any FsShell commands from the
Grunt shell.
Syntax
Given below is the syntax of fs command.
grunt> sh File System command parameters
Example
We can invoke the ls command of HDFS from the Grunt shell using fs
command. In the following example, it lists the files in the HDFS root
directory.
grunt> fs –ls
Found 3 items
drwxrwxrwx - Hadoop supergroup 0 2015-09-08 14:13 Hbase
drwxr-xr-x - Hadoop supergroup 0 2015-09-09 14:52 seqgen_data
80
80 drwxr-xr-x - Hadoop supergroup 0 2015-09-08 11:30 twitter_data
In the same way, we can invoke all the other file system shell commands PIG
from the Grunt shell using the fs command.
Utility Commands
The Grunt shell provides a set of utility commands. These include utility
commands such as clear, help, history, quit, and set; and commands such
as exec, kill, and run to control Pig from the Grunt shell. Given below is
the description of the utility commands provided by the Grunt shell.
clear Command
The clear command is used to clear the screen of the Grunt shell.
Syntax
You can clear the screen of the grunt shell using the clear command as
shown below.
grunt> clear
help Command
The help command gives you a list of Pig commands or Pig properties.
Usage
You can get a list of Pig commands using the help command as shown
below.
grunt> help
Commands: <pig latin statement>; - See the PigLatin manual for details:
http://hadoop.apache.org/pig
Assume we have executed three statements since opening the Grunt shell.
Then, using the history command will produce the following output.
grunt> history
customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING
PigStorage(',');
orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING
PigStorage(',');
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING
PigStorage(',');
set Command
The set command is used to show/assign values to keys used in Pig.
Usage
Using this command, you can set values to the following keys.
default_parallel You can set the number of reducers for a map job
by passing any whole number as a value to this key.
job.name You can set the Job name to the required job by
passing a string value to this key.
83
Big Data Analytics and
Visualization Lab job.priority You can set the job priority to a job by passing one
of the following values to this key −
● very_low
● low
● normal
● high
● very_high
stream.skippath For streaming, you can set the path from where the
data is not to be transferred, by passing the desired
path in the form of a string to this key.
quit Command
You can quit from the Grunt shell using this command.
Usage
Quit from the Grunt shell as shown below.
grunt> quit
Let us now take a look at the commands using which you can control
Apache Pig from the Grunt shell.
exec Command
Using the exec command, we can execute Pig scripts from the Grunt shell.
Syntax
Given below is the syntax of the utility command exec.
grunt> exec [–param param_name = param_value] [–param_file
file_name] [script]
Example
Let us assume there is a file named student.txt in the /pig_data/ directory
of HDFS with the following content.
Student.txt
001,Rajiv,Hyderabad
002,siddarth,Kolkata
003,Rajesh,Delhi
And, assume we have a script file named sample_script.pig in
the /pig_data/ directory of HDFS with the following content.
Sample_script.pig
84
84
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING PIG
PigStorage(',')
as (id:int,name:chararray,city:chararray);
Dump student;
Now, let us execute the above script from the Grunt shell using
the exec command as shown below.
grunt> exec /sample_script.pig
Output
The exec command executes the script in the sample_script.pig. As
directed in the script, it loads the student.txt file into Pig and gives you the
result of the Dump operator displaying the following content.
(1,Rajiv,Hyderabad)
(2,siddarth,Kolkata)
(3,Rajesh,Delhi)
kill Command
You can kill a job from the Grunt shell using this command.
Syntax
Given below is the syntax of the kill command.
grunt> kill JobId
Example
Suppose there is a running Pig job having id Id_0055, you can kill it from
the Grunt shell using the kill command, as shown below.
grunt> kill Id_0055
run Command
You can run a Pig script from the Grunt shell using the run command
Syntax
Given below is the syntax of the run command.
grunt> run [–param param_name = param_value] [–param_file file_name]
script
85
Big Data Analytics and Example
Visualization Lab
Student.txt
001,Rajiv,Hyderabad
002,siddarth,Kolkata
003,Rajesh,Delhi
Sample_script.pig
PigStorage(',') as (id:int,name:chararray,city:chararray);
Now, let us run the above script from the Grunt shell using the run command
as shown below.
You can see the output of the script using the Dump operator as shown
below.
grunt> Dump;
(1,Rajiv,Hyderabad)
(2,siddarth,Kolkata)
(3,Rajesh,Delhi)
Note − The difference between exec and the run command is that if we
use run, the statements from the script are available in the command
history.
86
86
5.2 SUMMARY PIG
87
Big Data Analytics and Internally, Apache Pig converts these scripts into a series of MapReduce
Visualization Lab
jobs, and thus, it makes the programmer’s job easy. The architecture of
Apache Pig is shown below.
The logical plan (DAG) is passed to the logical optimizer, which carries out
the logical optimizations such as projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of
MapReduce jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order.
Finally, these MapReduce jobs are executed on Hadoop producing the
desired results.
Pig Latin Data Model
The data model of Pig Latin is fully nested and it allows complex non-
atomic datatypes such as map and tuple. Given below is the
diagrammatical representation of Pig Latin’s data model.
Atom
Any single value in Pig Latin, irrespective of their data, type is known as
an Atom. It is stored as string and can be used as string and number. int,
long, float, double, chararray, and bytearray are the atomic values of Pig. A
piece of data or a simple atomic value is known as a field.
Example − ‘raja’ or ‘30’
Tuple
A record that is formed by an ordered set of fields is known as a tuple, the
fields can be of any type. A tuple is similar to a row in a table of RDBMS.
Example − (Raja, 30)
89
Big Data Analytics and Bag
Visualization Lab
A bag is an unordered set of tuples. In other words, a collection of tuples
(non-unique) is known as a bag. Each tuple can have any number of fields
(flexible schema). A bag is represented by ‘{}’. It is similar to a table in
RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple
contain the same number of fields or that the fields in the same position
(column) have the same type.
Map
A map (or data map) is a set of key-value pairs. The key needs to be of type
chararray and should be unique. The value might be of any type. It is
represented by ‘[]’
Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there
is no guarantee that tuples are processed in any particular order).
Preparing HDFS
In MapReduce mode, Pig reads (loads) data from HDFS and stores the
results back in HDFS. Therefore, let us start HDFS and create the following
sample data in HDFS.
90
90
PIG
Student First Last Name Phone City
ID Name
The above dataset contains personal details like id, first name, last name,
phone number and city, of six students.
Step 1: Verifying Hadoop
First of all, verify the installation using Hadoop version command, as shown
below.
$ hadoop version
If your system contains Hadoop, and if you have set the PATH variable,
then you will get the following output −
Hadoop 2.6.0
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r
e3496499ecb8d220fba99dc5ed4c99c8f9e33bb1
Compiled by jenkins on 2014-11-13T21:10Z
Compiled with protoc 2.5.0
From source with checksum 18e43357c8f927c0695f1e9522859d6a
This command was run using
/home/Hadoop/hadoop/share/hadoop/common/hadoop
common-2.6.0.jar
Step 2: Starting HDFS
Browse through the sbin directory of Hadoop and start yarn and Hadoop
dfs (distributed file system) as shown below.
cd /$Hadoop_Home/sbin/
91
Big Data Analytics and $ start-dfs.sh
Visualization Lab
localhost: starting namenode, logging to
/home/Hadoop/hadoop/logs/hadoopHadoop-namenode-
localhost.localdomain.out
localhost: starting datanode, logging to
/home/Hadoop/hadoop/logs/hadoopHadoop-datanode-
localhost.localdomain.out
Starting secondary namenodes [0.0.0.0]
starting secondarynamenode, logging to
/home/Hadoop/hadoop/logs/hadoop-Hadoopsecondarynamenode-
localhost.localdomain.out
$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /home/Hadoop/hadoop/logs/yarn-
Hadoopresourcemanager-localhost.localdomain.out
localhost: starting nodemanager, logging to
/home/Hadoop/hadoop/logs/yarnHadoop-nodemanager-
localhost.localdomain.out
Step 3: Create a Directory in HDFS
In Hadoop DFS, you can create directories using the command mkdir.
Create a new directory in HDFS with the name Pig_Data in the required
path as shown below.
$cd /$Hadoop_Home/bin/
$ hdfs dfs -mkdir hdfs://localhost:9000/Pig_Data
006,Archana,Mishra,9848022335,Chennai.
Now, move the file from the local file system to HDFS using put command
as shown below. (You can use copyFromLocal command as well.)
$ cd $HADOOP_HOME/bin
$ hdfs dfs -put /home/Hadoop/Pig/Pig_Data/student_data.txt
dfs://localhost:9000/pig_data/
$ cd $HADOOP_HOME/bin
$ hdfs dfs -cat hdfs://localhost:9000/pig_data/student_data.txt
Output
You can see the content of the file as shown below.
15/10/01 12:16:55 WARN util.NativeCodeLoader: Unable to load native-
hadoop
library for your platform... using builtin-java classes where applicable
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai
The Load Operator
You can load data into Apache Pig from the file system (HDFS/ Local)
using LOAD operator of Pig Latin.
Syntax
The load statement consists of two parts divided by the “=” operator. On the
left-hand side, we need to mention the name of the relation where we want
to store the data, and on the right-hand side, we have to define how we store
the data. Given below is the syntax of the Load operator.
93
Big Data Analytics and Relation_name = LOAD 'Input file path' USING function as schema;
Visualization Lab
Where,
● relation_name − We have to mention the relation in which we want
to store the data.
● Input file path − We have to mention the HDFS directory where the
file is stored. (In MapReduce mode)
● function − We have to choose a function from the set of load
functions provided by Apache Pig (BinStorage, JsonLoader,
PigStorage, TextLoader).
● Schema − We have to define the schema of the data. We can define
the required schema as follows −
(column1 : data type, column2 : data type, column3 : data type);
Note − We load the data without specifying the schema. In that case, the
columns will be addressed as $01, $02, etc… (check).
Example
As an example, let us load the data in student_data.txt in Pig under the
schema named Student using the LOAD command.
Start the Pig Grunt Shell
First of all, open the Linux terminal. Start the Pig Grunt shell in MapReduce
mode as shown below.
$ Pig –x mapreduce
It will start the Pig Grunt shell as shown below.
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType :
LOCAL
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Trying ExecType :
MAPREDUCE
15/10/01 12:33:37 INFO pig.ExecTypeProvider: Picked MAPREDUCE as
the ExecType
94
94
INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine PIG
- Connecting to hadoop file system at: hdfs://localhost:9000
grunt>
Execute the Load Statement
Now load the data from the file student_data.txt into Pig by executing the
following Pig Latin statement in the Grunt shell.
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
Following is the description of the above statement.
datatype int char array char array char array char array
95
Big Data Analytics and STORE Relation_name INTO ' required_directory_path ' [USING
Visualization Lab
function];
Example
Assume we have a file student_data.txt in HDFS with the following
content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as
shown below.
Output
After executing the store statement, you will get the following output. A
directory is created with the specified name and the data will be stored in it.
2015-10-05 13:05:05,429 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.
MapReduceLau ncher - 100% complete
2015-10-05 13:05:05,429 [main] INFO
org.apache.pig.tools.pigstats.mapreduce.SimplePigStats -
Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.6.0 0.15.0 Hadoop 2015-10-0 13:03:03 2015-10-05 13:05:05
96
96 UNKNOWN
Success! PIG
98
98
The second file contain two fields: url & rating. These two files are CSV PIG
files.
The Apache Pig operators can be classified as: Relational and Diagnostic.
Relational Operators:
Relational operators are the main tools Pig Latin provides to operate on the
data. It allows you to transform the data by sorting, grouping, joining,
projecting and filtering. This section covers the basic relational operators.
LOAD:
LOAD operator is used to load data from the file system or HDFS storage
into a Pig relation.
In this example, the Load operator loads data from file ‘first’ to form
relation ‘loading1’. The field names are user, url, id.
FOREACH:
This operator generates data transformations based on columns of data. It is
used to add or remove fields from a relation. Use FOREACH-GENERATE
operation to work with columns of data.
FOREACH Result:
99
Big Data Analytics and FILTER:
Visualization Lab
This operator selects tuples from a relation based on a condition.
In this example, we are filtering the record from ‘loading1’ when the
condition ‘id’ is greater than 8.
FILTER Result:
JOIN:
JOIN operator is used to perform an inner, equijoin join of two or more
relations based on common field values. The JOIN operator always
performs an inner join. Inner joins ignore null keys, so it makes sense to
filter them out before the join.
In this example, join the two relations based on the column ‘url’ from
‘loading1’ and ‘loading2’.
JOIN Result:
ORDER BY:
Order By is used to sort a relation based on one or more fields. You can do
sorting in ascending or descending order using ASC and DESC keywords.
In below example, we are sorting data in loading2 in ascending order on
ratings field.
100
100
ORDER BY Result: PIG
DISTINCT:
Distinct removes duplicate tuples in a relation.Lets take an input file as
below, which has amr,crap,8 and amr,myblog,10 twice in the file. When
we apply distinct on the data in this file, duplicate entries are removed.
DISTINCT Result:
STORE:
Store is used to save results to the file system.
Here we are saving loading3 data into a file named storing on HDFS.
101
Big Data Analytics and STORE Result:
Visualization Lab
GROUP:
The GROUP operator groups together the tuples with the same group key
(key field). The key field will be a tuple if the group key has more than one
field, otherwise it will be the same type as that of the group key. The result
of a GROUP operation is a relation that includes one tuple per group.
In this example, group th
COGROUP:
COGROUP is same as GROUP operator. For readability, programmers
usually use GROUP when only one relation is involved and COGROUP
when multiple relations re involved.
In this example group the ‘loading1’ and ‘loading2’ by url field in both
102
102 relations.
PIG
COGROUP Result:
CROSS:
The CROSS operator is used to compute the cross product (Cartesian
product) of two or more relations.
Applying cross product on loading1 and loading2.
CROSS Result:
LIMIT:
LIMIT operator is used to limit the number of output tuples. If the specified
number of output tuples is equal to or exceeds the number of tuples in the
relation, the output will include all tuples in the relation.
103
Big Data Analytics and LIMIT Result:
Visualization Lab
SPLIT:
SPLIT operator is used to partition the contents of a relation into two or
more relations based on some expression. Depending on the conditions
stated in the expression.
Split the loading2 into two relations x and y. x relation created by loading2
contain the fields that the rating is greater than 8 and y relation contain fields
that rating is less than or equal to 8.
5.3 REFERENCES
104
104
6
SPARK
Unit Structure :
6.0 Introduction
6.1 Spark
6.2 Spark RDD – Operations
6.3 Create an RDD in Apache Spark
6.4 SPARK IN-Memory Computing
6.5 Lazy Evaluation In Apache Spark
6.6 RDD Persistence and Caching in Spark
6.7 Lab Sessions
6.8 Exercises
6.9 Questions
6.10 Quiz
6.11 Video Lectures
6.12 Moocs
6.13 References
6.0 INTRODUCTION
6.1 SPARK
106
106
We can store the frequently used RDD in in-memory and we can also SPARK
retrieve them directly from memory without going to disk, this
speedup the execution. We can perform Multiple operations on the
same data, this happens by storing the data explicitly in memory by
calling persist() or cache() function. Follow this guide for the detailed
study of RDD persistence in Spark.
vi. Partitioning
RDD partition the records logically and distributes the data across
various nodes in the cluster. The logical divisions are only for
processing and internally it has no division. Thus, it provides
parallelism.
vii. Parallel
Rdd, process the data parallelly over the cluster.
viii. Location-Stickiness
RDDs are capable of defining placement preference to compute
partitions. Placement preference refers to information about the
location of RDD. The DAGScheduler places the partitions in such a
way that task is close to data as much as possible. Thus speed up
computation.
ix. Coarse-grained Operation
We apply coarse-grained transformations to RDD. Coarse-grained
meaning the operation applies to the whole dataset not on an
individual element in the data set of RDD.
x. Typed
We can have RDD of various types like: RDD [int], RDD [long],
RDD [string].
xi. No limitation
We can have any number of RDD. There is no limit to its number.
The limit depends on the size of disk and memory.
Limitations of Apache Spark – Ways to Overcome Spark Drawbacks
108
108
in a user-friendly manner. Apache Spark requires lots of RAM to run SPARK
in-memory, thus the cost of Spark is quite high.
e. Less number of Algorithms
Spark MLlib lags behind in terms of a number of available algorithms
like Tanimoto distance.
f. Manual Optimization
The Spark job requires to be manually optimized and is adequate to
specific datasets. If we want to partition and cache in Spark to be
correct, it should be controlled manually.
g. Iterative Processing
In Spark, the data iterates in batches and each iteration is scheduled
and executed separately.
h. Latency
Apache Spark has higher latency as compared to Apache Flink.
i. Window Criteria
Spark does not support record based window criteria. It only has time-
based window criteria.
j. Back Pressure Handling
Back pressure is build up of data at an input-output when the buffer
is full and not able to receive the additional incoming data. No data is
transferred until the buffer is empty.
❖ TRANSFORMATION
❖ ACTION
109
Big Data Analytics and Actions are operations, triggers the process by returning the result back to
Visualization Lab
program. Transformation and actions work differently.
1. Transformation
Transformation is a process of forming new RDDs from the existing
ones. Transformation is a user specific function. It is a process of
changing the current dataset in the dataset we want to have.
Some common transformations supported by Spark are:
For example, Map(func), Filter(func), Mappartitions (func), Flatmap
(func) etc.
All transformed RDDs are lazy in nature. As we are already familiar
with the term “Lazy Evaluations”. That means it does not produce
their results instantly. However, we always require an action to
complete the computation.
To trigger the execution, an action is a must. Up to that action data
inside RDD is not transformed or available.
After transformation, you incrementally build the lineage. That
lineage is which formed by the entire parent RDDs of final RDDs. As
soon as the execution process ends, resultant RDDs will be
completely different from their parent RDDs.
They can be smaller (e.g. filter, count, distinct, sample), bigger (e.g.
flatMap, union, Cartesian) or the same size (e.g. map).
Transformation can categorize further as: Narrow Transformations,
Wide Transformations.
a. Narrow Transformations
Narrow transformations are the result of map() and filter() functions
and these compute data that live on a single partition meaning there
will not be any data movement between partitions to execute narrow
transformations.
An output RDD also has partitions with records. In the parent RDD,
that output originates from a single partition. Additionally, to
calculate the result Only a limited subset of partitions is used.
In Apache spark narrow transformations groups as a stage. That
process is mainly known as pipelining. Pipelining is an
implementation mechanism. Functions such as map(), mapPartition(),
flatMap(), filter(), union() are some examples of narrow
transformation.
110
110
SPARK
package com.sparkbyexamples.spark.rdd
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
object WordCountExample {
def main(args:Array[String]): Unit = {
val spark:SparkSession = SparkSession.builder()
.master("local[3]")
.appName("SparkByExamples.com")
.getOrCreate()
val sc = spark.sparkContext
val rdd:RDD[String] = sc.textFile("src/main/resources/test.txt")
println("initial partition count:"+rdd.getNumPartitions)
val reparRdd = rdd.repartition(4)
println("re-partition count:"+reparRdd.getNumPartitions)
//rdd.coalesce(3)
rdd.collect().foreach(println)
// rdd flatMap transformation
val rdd2 = rdd.flatMap(f=>f.split(" "))
rdd2.foreach(f=>println(f))
//Create a Tuple by adding 1 to each word
val rdd3:RDD[(String,Int)]= rdd2.map(m=>(m,1))
rdd3.foreach(println)
//Filter transformation
val rdd4 = rdd3.filter(a=> a._1.startsWith("a"))
rdd4.foreach(println)
//ReduceBy transformation
val rdd5 = rdd3.reduceByKey(_ + _)
rdd5.foreach(println)
//Swap word,count and sortByKey transformation
val rdd6 = rdd5.map(a=>(a._2,a._1)).sortByKey()
println("Final Result")
//Action - foreach
rdd6.foreach(println)
//Action - count
println("Count : "+rdd6.count())
//Action - first
113
Big Data Analytics and val firstRec = rdd6.first()
Visualization Lab
println("First Record : "+firstRec._1 + ","+ firstRec._2)
//Action - max
val datMax = rdd6.max()
println("Max Record : "+datMax._1 + ","+ datMax._2)
//Action - reduce
val totalWordCount = rdd6.reduce((a,b) => (a._1+b._1,a._2))
println("dataReduce Record : "+totalWordCount._1)
//Action - take
val data3 = rdd6.take(3)
data3.foreach(f=>{
println("data3 Key:"+ f._1 +", Value:"+f._2)
})
//Action - collect
val data = rdd6.collect()
data.foreach(f=>{
println("Key:"+ f._1 +", Value:"+f._2)
})
//Action - saveAsTextFile
rdd5.saveAsTextFile("c:/tmp/wordCount")
}
}
2. Actions
An action is an operation, triggers execution of computations and
RDD transformations. Also, returns the result back to the storage or
its program. Transformation returns new RDDs and actions returns
some other data types. Actions give non-RDD values to the RDD
operations.
It forces the evaluation of the transformation process need for the
RDD they may call on. Since they actually need to produce output.
An action instructs Spark to compute a result from a series of
transformations.
Actions are one of two ways to send data from executors to the driver.
Executors are agents that are responsible for executing different tasks.
While a driver coordinates execution of tasks.
Transformations create RDDs from each other, but when we want to
work with the actual dataset, at that point action is performed. When
the action is triggered after the result, new RDD is not formed like
transformation. Thus, Actions are Spark RDD operations that give
114
114
non-RDD values. The values of action are stored to drivers or to the SPARK
external storage system. It brings laziness of RDD into motion.
An action is one of the ways of sending data from Executer to the driver.
Executors are agents that are responsible for executing a task. While the
driver is a JVM process that coordinates workers and execution of the task.
Some of the actions of Spark are:
count()
Action count() returns the number of elements in RDD.
For example, RDD has values {1, 2, 2, 3, 4, 5, 5, 6} in this RDD
“rdd.count()” will give the result 8.
Count() example:
[php]val data = spark.read.textFile(“spark_test.txt”).rdd
val mapFile = data.flatMap(lines => lines.split(” “)).filter(value =>
value==”spark”)
println(mapFile.count())[/php]
Note – In above code flatMap() function maps line into words and count the
word “Spark” using count() Action after filtering lines containing “Spark”
from mapFile.
collect()
The action collect() is the common and simplest operation that returns our
entire RDDs content to driver program. The application of collect() is unit
testing where the entire RDD is expected to fit in memory. As a result, it
makes easy to compare the result of RDD with the expected result.
Action Collect() had a constraint that all the data should fit in the machine,
and copies to the driver.
Collect() example:
[php]val data =
spark.sparkContext.parallelize(Array((‘A’,1),(‘b’,2),(‘c’,3)))
val data2
=spark.sparkContext.parallelize(Array((‘A’,4),(‘A’,6),(‘b’,7),(‘c’,3),(‘c’,8)
))
val result = data.join(data2)
println(result.collect().mkString(“,”))[/php]
Note – join() transformation in above code will join two RDDs on the basis
of same key(alphabet). After that collect() action will return all the elements
to the dataset as an Array.
115
Big Data Analytics and take(n)
Visualization Lab
The action take(n) returns n number of elements from RDD. It tries to cut
the number of partition it accesses, so it represents a biased collection. We
cannot presume the order of the elements.
For example, consider RDD {1, 2, 2, 3, 4, 5, 5, 6} in this RDD “take (4)”
will give result { 2, 2, 3, 4}
Take() example:
[php]val data =
spark.sparkContext.parallelize(Array((‘k’,5),(‘s’,3),(‘s’,4),(‘p’,7),(‘p’,5),(‘
t’,8),(‘k’,6)),3)
val group = data.groupByKey().collect()
val twoRec = result.take(2)
twoRec.foreach(println)[/php]
Note – The take(2) Action will return an array with the first n elements of
the data set defined in the taking argument.
top()
If ordering is present in our RDD, then we can extract top elements from
our RDD using top(). Action top() use default ordering of data.
Top() example:
[php]val data = spark.read.textFile(“spark_test.txt”).rdd
val mapFile = data.map(line => (line,line.length))
val res = mapFile.top(3)
res.foreach(println)[/php]
Note – map() operation will map each line with its length. And top(3) will
return 3 records from mapFile with default ordering.
countByValue()
The countByValue() returns, many times each element occur in RDD.
For example, RDD has values {1, 2, 2, 3, 4, 5, 5, 6} in this RDD
“rdd.countByValue()” will give the result {(1,1), (2,2), (3,1), (4,1), (5,2),
(6,1)}
countByValue() example:
[php]val data = spark.read.textFile(“spark_test.txt”).rdd
val result= data.map(line => (line,line.length)).countByValue()
116
116
result.foreach(println)[/php] SPARK
Note – The countByValue() action will return a hashmap of (K, Int) pairs
with the count of each key.
reduce()
The reduce() function takes the two elements as input from the RDD and
then produces the output of the same type as that of the input elements. The
simple forms of such function are an addition. We can add the elements of
RDD, count the number of words. It accepts commutative and associative
operations as an argument.
Reduce() example:
[php]val rdd1 = spark.sparkContext.parallelize(List(20,32,45,62,8,5))
val sum = rdd1.reduce(_+_)
println(sum)[/php]
Note – The reduce() action in above code will add the elements of the source
RDD.
fold()
The signature of the fold() is like reduce(). Besides, it takes “zero value” as
input, which is used for the initial call on each partition. But, the condition
with zero value is that it should be the identity element of that operation.
The key difference between fold() and reduce() is that, reduce() throws an
exception for empty collection, but fold() is defined for empty collection.
For example, zero is an identity for addition; one is identity element for
multiplication. The return type of fold() is same as that of the element of
RDD we are operating on.
For example, rdd.fold(0)((x, y) => x + y).
Fold() example:
[php]val rdd1 = spark.sparkContext.parallelize(List((“maths”,
80),(“science”, 90)))
val additionalMarks = (“extra”, 4)
val sum = rdd1.fold(additionalMarks){ (acc, marks) => val add = acc._2 +
marks._2
(“total”, add)
}
println(sum)[/php]
Note – In above code additionalMarks is an initial value. This value will be
added to the int value of each record in the source RDD. 117
Big Data Analytics and aggregate()
Visualization Lab
It gives us the flexibility to get data type different from the input type. The
aggregate() takes two functions to get the final result. Through one function
we combine the element from our RDD with the accumulator, and the
second, to combine the accumulator. Hence, in aggregate, we supply the
initial zero value of the type which we want to return.
foreach()
When we have a situation where we want to apply operation on each
element of RDD, but it should not return value to the driver. In this case,
foreach() function is useful. For example, inserting a record into the
database.
Foreach() example:
[php]val data =
spark.sparkContext.parallelize(Array((‘k’,5),(‘s’,3),(‘s’,4),(‘p’,7),(‘p’,5),(‘
t’,8),(‘k’,6)),3)
val group = data.groupByKey().collect()
group.foreach(println)[/php]
Note – The foreach() action run a function (println) on each element of the
dataset group.
Mostly for production systems, we create RDD’s from files. here will see
how to create an RDD by reading data from a file.
val rdd = spark.sparkContext.textFile("/path/textFile.txt")
This creates an RDD for which each record represents a line in a file.
If you want to read the entire content of a file as a single record use
wholeTextFiles() method on sparkContext.
val rdd2 = spark.sparkContext.wholeTextFiles("/path/textFile.txt")
rdd2.foreach(record=>println("FileName : "+record._1+", FileContents
:"+record._2))
In this case, each text file is a single record. In this, the name of the file is
the first column and the value of the text file is the second column.
Creating from another RDD
You can use transformations like map, flatmap, filter to create a new RDD
from an existing one.
val rdd3 = rdd.map(row=>{(row._1,row._2+100)})
Above, creates a new RDD “rdd3” by adding 100 to each record on RDD.
this example outputs below.
(Python,100100)
(Scala,3100)
(Java,20100)
From existing DataFrames and DataSet
To convert DataSet or DataFrame to RDD just use rdd() method on any of
these data types.
val myRdd2 = spark.range(20).toDF().rdd
toDF() creates a DataFrame and by calling rdd on DataFrame returns back
RDD.
The data is kept in random access memory(RAM) instead of some slow disk
drives and is processed in parallel. Using this we can detect a pattern,
analyze large data. This has become popular because it reduces the cost of
memory. So, in-memory processing is economic for applications. The two
main columns of in-memory computation are-
119
Big Data Analytics and ⮚ RAM storage
Visualization Lab
❖ MEMORY_ONLY
❖ MEMORY_AND_DISK
❖ MEMORY_ONLY_SER
❖ MEMORY_AND_DISK_SER
❖ DISK_ONLY
122
122
1. When we need a data to analyze it is already available on the go or we SPARK
can retrieve it easily.
2. It is good for real-time risk management and fraud detection.
3. The data becomes highly accessible.
4. The computation speed of the system increases.
5. Improves complex event processing.
6. Cached a large amount of data.
7. It is economic, as the cost of RAM has fallen over a period of time.
Llazy evaluation in Spark means that the execution will not start until an
action is triggered. In Spark, the picture of lazy evaluation comes when
Spark transformations occur.
Transformations are lazy in nature meaning when we call some operation
in RDD, it does not execute immediately. Spark maintains the record of
which operation is being called(Through DAG).
Apache Spark Lazy Evaluation Feature.
Apache Spark Lazy Evaluation Explanation.
In MapReduce, much time of developer wastes in minimizing the number
of MapReduce passes. It happens by clubbing the operations together.
While in Spark we do not create the single execution graph, rather we club
many simple operations. Thus it creates the difference between Hadoop
MapReduce vs Apache Spark.
In Spark, driver program loads the code to the cluster. When the code
executes after every operation, the task will be time and memory
consuming. Since each time data goes to the cluster for evaluation.
Advantages of Lazy Evaluation in Spark Transformation
There are some benefits of Lazy evaluation in Apache Spark-
a. Increases Manageability
By lazy evaluation, users can organize their Apache Spark program
into smaller operations. It reduces the number of passes on data by
grouping operations.
b. Saves Computation and increases Speed
Spark Lazy Evaluation plays a key role in saving calculation
overhead. Since only necessary values get compute. It saves the trip
between driver and cluster, thus speeds up the process.
123
Big Data Analytics and c. Reduces Complexities
Visualization Lab
The two main complexities of any operation are time and space
complexity. Using Apache Spark lazy evaluation we can overcome
both. Since we do not execute every operation, Hence, the time gets
saved. It let us work with an infinite data structure. The action is
triggered only when the data is required, it reduces overhead.
d. Optimization
It provides optimization by reducing the number of queries. Learn
more about Apache Spark Optimization.
Spark performs lazy evaluation
Example 1
In the first step, we created a list of 10 million numbers and made an RDD
with four partitions below. And we can see the result in the below output
image.
val data = (1 to 100000).toList val rdd = sc.parallelize(data,4)
println("Number of partitions is "+rdd.getNumPartitions)
124
124
Now if you observe MapPartitionsRDD[15] at map is dependent on SPARK
ParallelCollectionRDD[14]. Now, let's go ahead and add one more
transformation to add 20 to all the elements of the list.
//Adding 5 to each value in rdd val rdd3 = rdd2.map(x => x+20) //rdd2
objetc println(rdd3) //getting rdd lineage rdd3.toDebugString rdd3.collect
From the above examples, we can able to understand that spark lineage is
maintained using DAG. 125
Big Data Analytics and
Visualization Lab
Example 2
Here we see how unnecessary steps or transformations are skipped due to
spark Lazy evaluation during the execution process.
import org.apache.spark.sql.functions.
_ val df1 = (1 to 100000).toList.toDF("col1")
//scenario 1
println("scenario 1")
df1.withColumn("col2",lit(2)).explain(true);
//scenario 2 println("scenario 2")
df1.withColumn("col2",lit(2)).drop("col2").explain(true);
In this, we created a dataframe with column "col1" at the very first step. If
you observe Scenario-1, I have created a column "col2" using withColumn()
function and after that applied explain() function to analyze the physical
execution plan. The below image contains a logical plan, analyzed logical
plan, optimized logical plan, and physical plan. In the analyzed logical plan,
if you observe there is only one projection stage, Projection main indicates
the columns moving forward for further execution.
⮚ Time efficient
⮚ Cost efficient
Installing Spark
To install Spark on your computer, perform the following steps:
• First ensure you have a JVM installed on your machine by typing java
-version at a command line
• Go to http://spark.apache.org and click on “Download”
• Select release 1.3.0
• Select “Prebuilt for Hadoop 2.4 and later”
• Select “Direct download”
• Click on the Download Spark link
Unpacking Spark
• In your home directory, create a subdirectory spark in which to store
Spark
• Move the downloaded file into this directory
• Unpack the file. If you are on a Unix system (Linux, Mac OS) type
the following at the command
128
128
line: tar xvf spark-1.3.0-bin-hadoop2.4.tgz. If you are on Windows, you can SPARK
download 7zip
from http://www.7zip.org and use it to extract the contents.
Downloading a sample dataset
As part of today’s lab, we will be processing a large file containing
approximately 4 million words.
• Download a compressed version of that file from
http://jmg3.web.rice.edu/32big.zip using either wget or by navigating
to that URL in your browser.
• Extract the 32big.txt file from that ZIP using 7zip or the unzip utility.
You should store 32big.txt under the new directory you just created
beneath your home directory, spark/spark-1.3.0-bin-hadoop2.4.
Running Spark
• cd into the new directory: cd spark/spark-1.3.0-bin-hadoop2.4
• You can now start an interactive session with Spark by calling the
Spark Shell ./bin/spark-shell.
This will start a read-eval-print loop for Scala, similar to what you might
have seen for many scripting languages, such as Python. You can try it out
by evaluating some simple Scala expressions:
scala> 1 + 2
res0: Int = 3
As you type expressions at the prompt, the resulting values are displayed,
along with new variables that they are bound to (in this case, res0), which
you can use in subsequent expressions.
When invoking spark-shell, you can alter the number of cores and bytes of
memory that Spark uses with the keyword local. For example, to invoke
spark-shell with one core and 2GB of memory, we write:
./bin/spark-shell --master local[1] --driver-memory 2G
Interacting with Spark
If you are still in a Spark session, close it using Ctrl-D and create a new one
with 1 CPU core and 2GB of memory:
./bin/spark-shell --master local[1] --driver-memory 2G
In your interactive session, you can interact with Spark via the global “spark
context” variable sc. Try this out by creating a simple RDD from the text in
the large 32big.txt file you downloaded:
scala> val textFile = sc.textFile("32big.txt")
129
Big Data Analytics and You now have a handle on an RDD. The elements in the RDD correspond
Visualization Lab
to the lines in the 32big.txt file.
We can find out the number of lines using count:
scala> textFile.count()
As shown in class, we can also use the map/reduce pattern to count the
number of occurences of each word in the RDD:
scala> val wordCounts = textFile.flatMap(line => line.split(" ")).
map(word => (word, 1)).
reduceByKey((a, b) => a + b)
We can then view the result of running our word count via the collect
operation:
scala> wordCounts.collect()
How long does this operation take when you only use one core? Note that
Spark reports execution times for alloperations, there should be a line at the
end of collect’s output that is labeled with ”INFO DAGSchedule” and starts
with ”Job 0 finished” which reports a time. Take note of this time, as we
will compare it to an execution with more than 1 core later.
If you run this collect operation repeatedly, i.e.:
scala> wordCounts.collect()
scala> wordCounts.collect()
scala> wordCounts.collect()
does the execution time change after the first execution? Can you explain
this change?
You can also test the speedup with more cores. Kill your current Spark
session by pressing Ctrl+D, and launch a new one that uses 4 cores
(assuming your laptop has 4 or more cores):
./bin/spark-shell --master local[4] --driver-memory 2G
Now, rerun the previous commands:
scala> val textFile = sc.textFile("32big.txt")
scala> val wordCounts = textFile.flatMap(line => line.split(" ")).
map(word => (word, 1)).
reduceByKey((a, b) => a + b)
scala> wordCounts.collect()
130
130
How does execution time change, compared to the time you noted for a SPARK
single core?
Selective Wordcount
Your next task is to alter our map/reduce operation on textFile so that only
words of length 5 are counted,
and then display the counts of all (and only) words of length 5.
Hints:
• Scala syntax for if expressions is:
if testExpr thenExpr else elseExpr
An if expression can be used in any context that an expression can be
used, and returns the value returned by whichever branch of the if
expression is executed.
• The length of a string s in Scala can be found by using the methods.
length
• The elements of a pair can be retrieved using the accessors _1 and _2.
For example:
scala> (1,2)._1
res2: Int = 1
• Collections in Scala (including RDDs) have a method filter that takes
a boolean test and returns a new RDD that contains only the elements
for which the testing function passed. For example, the following
application of filter to a list of ints returns a new list containing only
the even elements:
scala> List(1,2,3,4).filter(n => n % 2 == 0)
res1: List[Int] = List(2, 4)
Estimating π
Exit the Spark Shell using Ctrl+D and then restart it with just one core:
./bin/spark-shell --master local[1] --driver-memory 2G
We can now walk through a Spark program to estimate π in parallel from
random trials. We will alter the
number of cores that Spark makes use of and observe the impact on
performance.
At your read-eval-print loop, first define the number of random trials:
scala> val NUM_SAMPLES = 1000000000
131
Big Data Analytics and Now we can estimate π with the following code snippet:
Visualization Lab
scala> val count = sc.parallelize(1 to NUM_SAMPLES).map{i =>
val x = Math.random()
val y = Math.random()
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
You can then print out an estimate of π as follows:
println("Pi is roughly " + 4.0 * count / NUM_SAMPLES)
Now exit the Spark shell and restart with 2 cores, and then 4 cores. Do you
observe a speedup?
6.8 EXERCISES
132
132
• Implement then a second version where A-B and B-A are considered SPARK
as the same road.
8. Compute the hottest month (defined by column "Month") on average
over the years considering all temperatures (column "ProbTair")
reported in the dataset.
6.9 QUESTIONS
6.10 QUIZ
134
134
4. Spark SQL provides a domain-specific language to manipulate SPARK
___________ in Scala, Java, or Python.
a) Spark Streaming
b) Spark SQL
c) RDDs
d) All of the mentioned
5. Point out the wrong statement.
a) For distributed storage, Spark can interface with a wide variety,
including Hadoop Distributed File System (HDFS)
b) Spark also supports a pseudo-distributed mode, usually used
only for development or testing purposes
c) Spark has over 465 contributors in 2014
d) All of the mentioned
6. ______________ leverages Spark Core fast scheduling capability to
perform streaming analytics.
a) MLlib
b) Spark Streaming
c) GraphX
d) RDDs
7. ____________ is a distributed machine learning framework on top of
Spark.
a) MLlib
b) Spark Streaming
c) GraphX
d) RDDs
8. ________ is a distributed graph processing framework on top of
Spark.
a) MLlib
b) Spark Streaming
c) GraphX
d) All of the mentioned
9. GraphX provides an API for expressing graph computation that can
model the __________ abstraction.
a) GaAdt
b) Spark Core
c) Pregel
d) None of the mentioned
135
Big Data Analytics and 10. Spark architecture is ___________ times as fast as Hadoop disk-based
Visualization Lab
Apache Mahout and even scales better than Vowpal Wabbit.
a) 10
b) 20
c) 50
d) 100
11. Users can easily run Spark on top of Amazon’s __________
a) Infosphere
b) EC2
c) EMR
d) None of the mentioned
12. Point out the correct statement.
a) Spark enables Apache Hive users to run their unmodified
queries much faster
b) Spark interoperates only with Hadoop
c) Spark is a popular data warehouse solution running on top of
Hadoop
d) None of the mentioned
13. Spark runs on top of ___________ a cluster manager system which
provides efficient resource isolation across distributed applications.
a) Mesjs
b) Mesos
c) Mesus
d) All of the mentioned
14. Which of the following can be used to launch Spark jobs inside
MapReduce?
a) SIM
b) SIMR
c) SIR
d) RIS
15. Point out the wrong statement.
a) Spark is intended to replace, the Hadoop stack
b) Spark was designed to read and write data from and to HDFS,
as well as other storage systems
c) Hadoop users who have already deployed or are planning to
deploy Hadoop Yarn can simply run Spark on YARN
d) None of the mentioned
136
136
16. Which of the following language is not supported by Spark? SPARK
a) Java
b) Pascal
c) Scala
d) Python
17. Spark is packaged with higher level libraries, including support for
_________ queries.
a) SQL
b) C
c) C++
d) None of the mentioned
18. Spark includes a collection over ________ operators for transforming
data and familiar data frame APIs for manipulating semi-structured
data.
a) 50
b) 60
c) 70
d) 80
19. Spark is engineered from the bottom-up for performance, running
___________ faster than Hadoop by exploiting in memory computing
and other optimizations.
a) 100x
b) 150x
c) 200x
d) None of the mentioned
20. Spark powers a stack of high-level tools including Spark SQL,
MLlib for _________
a) regression models
b) statistics
c) machine learning
d) reproductive research
137
Big Data Analytics and
Visualization Lab
6.11 VIDEO LECTURES
6.12 MOOCS
REFERENCES
1. https://techvidvan.com/tutorials/spark-rdd-features/
2. https://data-flair.training/blogs/apache-spark-rdd-features/
3. https://sparkbyexamples.com/apache-spark-rdd/spark-rdd-
transformations/
4. https://data-flair.training/blogs/spark-rdd-operations-
transformations-actions/
5. https://sparkbyexamples.com/apache-spark-rdd/different-ways-to-
create-spark-rdd/
6. https://data-flair.training/blogs/spark-in-memory-computing/
7. https://www.projectpro.io/recipes/explain-spark-lazy-evaluation-
detail
8. https://data-flair.training/blogs/apache-spark-rdd-persistence-
caching/
HTTPS://WWW.SIMPLILEARN.COM/TOP-APACHE-SPARK-
INTERVIEW-QUESTIONS-AND-ANSWERS-
ARTICLE#APACHE_SPARK_INTERVIEW_QUESTIONS
139
Big Data Analytics and
Visualization Lab
7
DATA VISUALIZATION
Unit Structure :
7.0 Introduction
7.1 Data Visualization
7.2 Data Visualization Snd Big Data
7.3 Types of Data Visualizations
7.4 Common Data Visualization Use Cases
7.5 Basic Data Visualization Principles
7.6 Connecting to Data
7.7 Tableau
7.8 Custom SQL Queries Tableau
7.9 Visualizing Treemaps Using Tableau
7.10 Creating A Story with Tableau Public
7.11 Saving and Publishing Your Tableau Public Workbook
7.12 A Dashboard with Tableau
7.13 Questions
7.14 Quiz
7.15 Video Lectures
7.16 MOOCS
7.17 References
7.0 INTRODUCTION
The increased popularity of big data and data analysis projects have made
visualization more important than ever. Companies are increasingly using
machine learning to gather massive amounts of data that can be difficult and
slow to sort through, comprehend and explain. Visualization offers a means
to speed this up and present information to business owners and
stakeholders in ways they can understand.
141
Big Data Analytics and Big data visualization often goes beyond the typical techniques used in
Visualization Lab
normal visualization, such as pie charts, histograms and corporate graphs.
It instead uses more complex representations, such as heat maps and fever
charts. Big data visualization requires powerful computer systems to collect
raw data, process it and turn it into graphical representations that humans
can use to quickly draw insights.
While big data visualization can be beneficial, it can pose several
disadvantages to organizations. They are as follows:
Sales and marketing. Research from the media agency Magna predicts that
half of all global advertising dollars will be spent online by 2020. Data
visualization makes it easy to see traffic trends over time as a result of
marketing efforts.
Politics. A common use of data visualization in politics is a geographic map
that displays the party each state or district voted for.
Healthcare. Healthcare professionals frequently use choropleth maps to
visualize important health data. A choropleth map displays divided
geographical areas or regions that are assigned a certain color in relation to
a numeric variable.
Scientists. Scientific visualization, sometimes referred to in shorthand as
SciVis, allows scientists and researchers to gain greater insight from their
experimental data than ever before.
Finance. Finance professionals must track the performance of their
investment decisions when choosing to buy or sell an asset.
Logistics. Shipping companies can use visualization tools to determine the
best global shipping routes.
Data scientists and researchers. Visualizations built by data scientists are
typically for the scientist's own use, or for presenting the information to a
select audience.
DATA VISUALIZATION TOOLS AND VENDORS
Data visualization tools can be used in a variety of ways. The most common
use today is as a business intelligence (BI) reporting tool. Users can set up
visualization tools to generate automatic dashboards that track company
performance across key performance indicators (KPIs) and visually
interpret the results.
While Microsoft Excel continues to be a popular tool for data visualization,
others have been created that provide more sophisticated abilities:
Each of the options in the ‘Add Chart Element’ menu allows you to choose
from a set of pre-populated options, or to open a menu with more options.
The ‘Change Chart Type’ button will allow you to change the type of chart
for all the data on the chart, or a selected series. 145
Big Data Analytics and
Visualization Lab
Format Tab
The Format tab contains the standard outline and fill color options.
In the very top-left section of the Format tab is the ‘Chart Elements’ drop-
down menu. The list in this drop-down menu consists of everything in your
chart including titles, axes, error bars, and every series. If you have a lot of
objects on your chart, this drop-down menu will help you to easily find and
select what you need.
146
146
Data Visualization
Select a specific data range to use as labels in your chart. This comes in
quite handy when, for example, you want to add custom labels to a
scatterplot. Instead of having to do the labeling manually, you can select the
data labels series in the spreadsheet. Among the new chart types is a
Treemap, Histogram, Box & Whisker, and Waterfall chart.
147
Big Data Analytics and Overlaid Gridlines
Visualization Lab
The Overlaid Gridline chart is a column chart with gridlines on top of the
columns. This type of chart allows viewers to absorb the column data as
segments rather than single columns. Use the OverlaidGridline tab in the
Advanced Data Visualizations with Excel 2016 Hands-On.xlsx spreadsheet
to create the chart.
148
148
3. We’re now going to add the four “Line” series to the chart. Data Visualization
4. You will now have a clustered column chart, five series for each
group.
5. If we were to simply change the lines to white, they would end in the
middle of the bars of the A and E groups. We now move each of those
four lines to the “secondary axis” so we can get them to stretch
through the bars. To do so, first select a line, right-click, and select
“Format Data Series”. Go to the “Series Options” tab and select the
“Secondary Axis” option. 149
Big Data Analytics and
Visualization Lab
6. You’ll notice that a new y-axis has appeared on the right side of the
graph. When you’re done moving all four series to the secondary axis,
this new y-axis should go from 0 to 25.
150
150
8. Change the colors of the lines to white using the “Format” tab option. Data Visualization
9. We fix that by changing how the data points line up with the tick
marks. In a default line graph in Excel, the data markers line up
between the tick marks; notice how the line begins in the middle of
the A bar, between the y-axis and the tick mark between the A and B
groups. By placing the data markers on the tick marks, we can extend
the lines through the bars. To do so, we’ll format the secondary x-axis
(by rightclicking and navigating to the “Axis position” options under
“Axis Options” in the “Format axis” menu.
10. Add your vertical primary axis line, select the axis and add the line
under the “Format Axis” menu.
151
Big Data Analytics and 11. We want to extend the data series for each “Line” series through row
Visualization Lab
11. One way to do this is to right-click on the graph, select the “Select
Data” option, and edit each of the 4 “Line” series to extend the data
series.
Alternatively, you can select the line on the chart and you’ll notice
that your data are selected in the spreadsheet. You can then drag the
selection box to extend the data series.
12. We need to now change where the data markers line up with the tick
marks. Once again, format the secondary x-axis and change the
“Position Axis” back to “Between tick marks”.
152
152
13. We also want to turn off the secondary horizontal axis and set the Data Visualization
“Line Color” to “No line”.
14. Repeat the process in Step 13 for the secondary vertical axis, remove
the gridlines and style the rest as you see fit.
153
Big Data Analytics and Overlaid Gridlines with a Formula
Visualization Lab
In this version of the Overlaid Gridlines chart, I create a stacked column
chart. Each section of the chart is given a white outline so that it appears
like there are gridlines. Use the OverlaidGridlines_Formula tab in the
Advanced Data Visualizations with Excel 2016 Hands-On.xlsx spreadsheet
to create the chart.
Create a stacked column chart from cells C16:M20. These are the cells that
contain the formula.
To plot the rows, select the chart and the “Switch Row/Column” button in
the “Design” tab of the ribbon.
154
154
Change the fill of each shape under the “Shape Fill” dropdown in the Data Visualization
“Format” menu to the same color. Similarly, change the color of the “Shape
Outline” to white and increase the thickness to your desired weight. Of
course, delete the existing (default) gridlines, legend, etc.
155
Big Data Analytics and
Visualization Lab
⮚ Where you can get publicly available data and how to use it
⮚ Pivoting fields
Public data
The data sets that are publicly available or the ones that you have compiled
on your own, are ideal for Tableau Public. Public data is readily available
online. Tableau Public maintains a catalog of publicly available data. Much
of this data is produced by various governments, economic groups, and
sports fans, along with a link to, and a rating for each source. You can find
it at http://public.tableau.com/s/resources. The Google Public Data Explorer
has a large collection of public data, including economic forecasts and
global public health data. This tool is unique because it allows users to make
simple visualizations from all the original data sources without having to
investigate the source data, though most of it is available by linking
available resources.
Tables and databases
Data is stored in tables. A table is an array of items, and it can be as simple
as a single word, letter, or number, or as complicated as millions (or more)
of rows of transactions with timestamps, qualitative attributes (such as size
or color), and numeric facts, such as the quantity of the purchased goods.
Both a single text file of data and a worksheet in an excel workbook are
156
156 tables. When grouped together in a method that has been designed to enable
a user to retrieve data from them, they constitute a database. Typically, Data Visualization
when we think of databases, we think of the Database Management Systems
(DBMS) and languages that we use to make sense of the data in tables, such
as Oracle, Teradata, or Microsoft's SQL Server. Currently, the Hadoop and
NoSQL platforms are very popular because they are comparatively low-cost
and can store very large sets of data, but Tableau Public does not enable a
connection to these platforms.
Tableau Public is designed in such a way that it allows users in a single data
connection to join tables of data, which may or may not have been
previously related to each other, as long as they are in the same format.
The most common format of publicly available data is in a text file or a
Character-Separated Values (CSV) file. CSV files are useful because they
are simple. Many public data sources do allow data to be downloaded as
Excel documents. The World Bank has a comprehensive collection, and we
will demonstrate the connective capabilities of Tableau Public using one of
its data products. Tables can be joined in Tableau Public by manually
identifying the common field among the tables. Tableau Public connects to
four different data sources, namely Access, Excel, text file (CSV or TXT),
and OData; the first two data sources are bundled with Microsoft Office (in
most cases), and the second two are freely available to everyone, regardless
of the operating system that they are using.
Connecting to the data in Tableau Public
Tableau Public has a graphical user interface (GUI) that was designed to
enable users to load data sources without having to write code. Since the
only place to save Tableau Public documents is in Tableau's Cloud, data
sources are automatically extracted and packaged with the workbook.
Connecting to data from a local file is illustrated as follows with detailed
screenshots:
1. Click on the Connect to Data Link option from the Data menu.
2. Select the data source type.
3. Select the file or website to which you want to connect.
4. For a Microsoft Access, Microsoft Excel, or a text file, determine
whether the connection is to one table or multiple tables or it requires
a custom SQL connection:
157
Big Data Analytics and 5. When all the selections have been made, click on Ok.
Visualization Lab
We will connect to the World Bank's environment indicators. You can
download this data, which is formatted either for Microsoft Excel or as a
text file, at www.worldbank.org.
158
158
To load the file into Tableau Public, you can download the Tableau Public Data Visualization
workbook by visiting https://public.tableau.com/profile/tableau.data.
stories#!/.
1. Open a new instance of Tableau Public.
2. From the Connect pane, click on the data file type to which you'd like
to connect. In this case, we are using an excel file.
3. Browse to the file to which you would like to connect.
4. Drag a table from the list of tables, which is a list of different
worksheets in this case, along with the workbook onto the workspace.
5. Note that the values in the data source are now populating the space
below the workspace, but at least with this data set, there is no
complete set of field headers. We will edit the data source by using
the data interpreter in the next exercise.
Connecting to web-based data sources
The steps required to connect to OData are different from the steps required
to connect to the previously mentioned sources because they involve web
servers and network security. These steps are a subset of the steps in
Desktop Professional that are used to connect to a server:
1. Enter the URL of the website.
2. Select the authentication method.
3. Establish the connection.
4. Name the data source.
In order to refresh a web-based data source, perform the following steps:
1. Right-click on the data source name in the data pane.
2. Click on Edit Connection.
3. In the previous dialog box, which will be populated with the
connection parameters, click on the Connect button in step 3 of the
preceding list.
7.7 TABLEAU
Connect data source: Connect Tableau to any data source like MS-Excel,
MySQL, and Oracle. Tableau connects data in two ways Live connect and
160
160 Extract.
Analyze & Visualize: Analyze the data by filtering, sorting and Visualize Data Visualization
the data using the relevant chart provided. Tableau automatically analyzes
the data as follows: Dimensions: Tableau treats any field containing
qualitative, categorical information as dimension. All dimensions are
indicated by “Blue” color.
Measures: Tableau treats any field containing numeric information as a
measure. All measures are indicated by “Green” color Tableau suggests the
recommended charts based on the dimensions and measures.
Share: Tableau Dashboards can be shared as word documents, pdf files,
images.
Maps in Tableau
Tableau automatically assigns geographic roles with common
geographically names, such as Country, State/Province, City etc. Fields
with geographic role will automatically generates longitude and latitude
coordinates on a map view. Fields with assigned geographic roles will have
a globe icon next to them. In Tableau, there are two types of maps to choose
from when creating a visualization with geographic data:
❖ Filled Maps.
Symbol Maps: These are simple maps that use a type of mark to represent
a data point, such as a filled circle.
162
162 ['us states data 3$'].[Region] AS [Region]
FROM ['us states data 3$'] Data Visualization
UNION
SELECT ['us states data 4$'].[State] AS [State],
['us states data 4$'].[Population] AS [Population],
['us states data 4$'].[Region] AS [Region]
FROM ['us states data 4$']
Steps to visualize using SQL in Tableau:
1. The following excel workbook consists of 4 sheets consisting of US
population state wise.
3. Now your Tableau window looks like this with an additional sheet
named as New Custom SQL.
163
Big Data Analytics and 4. Click on New Custom SQL then you will be prompted to Edit Custom
Visualization Lab
SQL as shown.
164
164 10. Now visualize the data derived.
Live data implementation Tableau Data Visualization
165
Big Data Analytics and 6. After clicking on Map Services you will be prompted a window as
Visualization Lab
shown
8. Now click on OK and then close the WMS server window pane and a
map can be seen as shown
166
166
9. Change the map type to Filled map. Data Visualization
10. Change the color transparency to less than 20% so that the required
map is visible.
1. The data is about the various products sold at a fashion store in the
cities of different states of the united states
2. Start with opening tableau and connect the data, which is of type excel
to Tableau
167
Big Data Analytics and 3. Click on sheet1 which is at the bottom of the page to visualize the data
Visualization Lab
4. Now to the left of the screen you can see a small window showing
Dimensions and Measures. The dimensions of the data are state, city,
store name, product, SKU number, year, quarter, month and week.
The measures of the data are Quantity sold and sales revenue
5. You can also see rows and columns at the top of the page, now drag
sales revenue from the measures and drop it in the rows section. You
can see a bar chart with single bar showing sales revenues
6. Now drag states from the dimension to the column section, you can
see a bar chart showing the sales revenues of different states
7. Now click on Show Me button on the top right corner of the screen,
you can see various maps suggested by the Tableau tool to visualize
the data. The maps which are bright in color shows us that those maps
are suitable to visualize the given data
8. Select the Tree Map by clicking it, you will see a Tree Map
representing various states in green color with different shades
9. The rectangle which is large and bright represents the highest quantity
according to sales revenue and similarly the smaller and lighter shade
of the rectangles show less sales revenue
10. Now select the city tab from the dimension section and drop it on the
Label in the Marks section. You can observe that the rectangles are
split into even smaller rectangles according to the cities in each state
11. Now select the product from the dimension section and drop it on the
Label in the Marks section. You can observe that the rectangles are
split into even smaller rectangles according to the products.
12. Now select the Quantity sold tab from the dimension section and drop
it on the Label in the Marks section. You can see that the actual
quantity sold of each product.
DatatreeMap.xlsx
168
168
Data Visualization
2. The data with states and their geographical data is shown in Tableau.
3. Drag and drop states onto columns and Total Area onto rows.
4. Drag and drop state onto color and total area onto tool tip.
5. Now select the filled map. Then the map is represented with the
geographical data of each state as shown below.
169
Big Data Analytics and Visualization of Combo and Gauge Charts using QlikView
Visualization Lab
1. For this we have taken two different datasets.
• Dataset -1 (Cost_Student). Contains data which representing amount
spend by the countries, on a student per year for primary to secondary
education and postsecondary education.
• Dataset - 2 (Sales_Rep). Contains data, which represents the sales
made by the representatives of a stationary chain around USA over
particular dates.
170
170
4. On closing the pop up, user will be able to see a Main blank sheet. Data Visualization
Now have quick look on to the toolbar.
5. Over the tool bar below to the selections labe1, click on the Edit Script
icon (Icon looks like pen on a paper).
6. A new Edit Script window opens up (We use this window to load an
Excel file), bottom of the window under the Data from files Click
Table Files button, and choose the saved Excel file, then click open.
7. A new File Wizard Type window opens up. Now select the Tables
drop down and choose Cost_Student if it has not defaulted and click
the Labels drop down and choose Embedded Labels. Click next until,
next button disables (If it disables you are on final screen). Select
check box load all and click finish.
8. On Finish you will be able to see Edit Script screen , with script added
to load excel file(Script shown below) On Edit Script window tool bar
below file , click the reload icon.It asks you save the file. LOAD *
FROM [C:\Manusha\DataViz\DataViz-Qlik\QlikView_Dataset.xls]
(biff, embedded labels, table is Cost_Per_Student$);
9. Now a sheet property window opens, please choose Country (Which
works as a filter).
10. Right Click anywhere on the sheet, a window opens choose New
Sheet Object and select Table box. New window open again, add all
the available fields and click apply. You will be able to see a table box
with content same as to excel sheet.
11. Right Click anywhere on the sheet, a window opens choose New
Sheet Object and select Chart. New window open again, Select
Combo Chart from chart type and click next.
12. Choose Country as dimension and click next now choose the
expressions. • Expression 1 for bar chart Sum (PRSCIUSD), now
click Add button to add one more expression Sum (PSCIUSD) and
click finish.
13. Now we have seen the Combo Chart.
14. If you observe both pre-Secondary and post-secondary are on same
axis. Let us split them. Right click on the chart and choose the
properties, and navigate to the Axes tab, Select Post-Secondary
Expression and choose the position Right (Top) click apply and ok.
You will see charts similar to below.
171
Big Data Analytics and
Visualization Lab
15. Now click on layout shown in the tool bar and add new sheet. A new
sheet will open.
16. Repeat step 7 to step 12, in step 9 choose sales_Rep sheet and in step
11 Choose Rep to the filed display in list boxes.
17. Right Click anywhere on the sheet, a window opens choose New
Sheet Object and select Chart. New window open again, Select Gauge
from chart type and click next.
18. Gauge charts have no dimensions, so we do not select any dimension,
will be moving directly to expression.
19. Since we are calculating the performance, we add a division between
sales and targets (Sum (sales)/sum (targets)), add label for expression
(Sales Achieved) finish you will see a chart like below.
20. Let us make it more readable, right click, choose the properties,
navigate to presentation add a segment in segment setup (we can go
with 2 segments as well for clear user readability we choose 3) and
change the colors of each segment by choosing color band (Red,
Yellow, Green).
21. Now look for Show scale section left –below to segment setup add
margin units a 4 and show labels on every major entry margin units 1
172
172
(Select the check box Show scale if it is not selected). 22. Now Switch Data Visualization
to Number tab since we are representing it in percentage select the
radio button Fixed to and select the check box Show in percentage
and click apply.
23. In order to see the needle value (Speedometer current value). Right
click on the chart and select properties move to presentation tab and
select Add button next to Text in Chart (located below right corner of
the window) and add this
formula(=num(Sum(Sales)/sum(Target),'##%') ), and click apply,
value appears on the top left corner of the chart. Press CTL +SHIFT
to drag the text anywhere in the chart. Final Gauge chart with data
along with chart looks like below.
With Tableau public, you are able to organize your data in order to tell a
meaningful story. This is beneficial when you are doing a presentation,
creating an article, or uploading to a website, as it helps your audience
understand your data. Stories are created through assembling the different
worksheets and dashboards. We can highlight important data points, add
text box and pictures to help convey our story. We will use our health
expenditure worksheets to create a tailoring in story and illustrate the
changes in Canada’s spending in a meaningful way. To begin, select “New
Story” at the bottom right of your screen
173
Big Data Analytics and
Visualization Lab
Use the arrows located on the side of the caption field to navigate to Sheet
2. Click on “Add a caption” and rename Sheet 2 to “Provincial Health
Expenditure from 1975-2018”.
174
174
In this story, we are going to narrow in and draw attention to the province Data Visualization
or territory that is spending the most amount of money on health. Drag an
additional copy of “Sheet 1” and drop it between the two existing sheets.
Select “Add a caption” and rename it to “Ontario”.
On the map, click on the province Ontario and then navigate to the caption
field and select “Update”. Your screen will show Ontario highlighted from
the rest of Canada.
175
Big Data Analytics and We can add a textbox to label the highlighted pointed by dragging “Drag to
Visualization Lab
add text” on to the line graph. Write a key message in the textbox, such as
“Ontario had the highest health expenditure in Canada in 2016, spending
$87,195.70M”. Select “OK”.
You can the edit the text box by selecting “More options” which will open
a drop-down menu. Expand the text box by dragging the borders in order to
show the full message.
We have now created a story with three sheets of how Ontario had the
highest health expenditure in the year 2016. If you choose to add a
dashboard, it will allow your audience to play with data. You can navigate
between the story as shown below:
176
176
7.11 SAVING AND PUBLISHING YOUR TABLEAU Data Visualization
PUBLIC WORKBOOK
Once satisfied with your workbook, which includes sheets, dashboards, and
stories, you can publish it to the Tableau Public website. This is the only
way to save your work when using Tableau Public, so make sure to do it if
you wish to return to the workbook in the future. Once ready to publish,
select the “Save to Tableau Public As…” option under the “File” tab.
From here, you will likely be prompted to log in to the Tableau Public
website. You can create an account for free if you have not already.
177
Big Data Analytics and Your workbook will then be uploaded to the Tableau Public server. This
Visualization Lab
may take a couple minutes.
On Tableau Public, all saved data visualizations are uploaded and available
to all other users. This means you can use other users’ work for your
website, presentations, or research as well. To search for interesting data,
there is a search function (D) at the top right corner of the webpage, along
with highlighted visualizations (A), authors (B), and blogs (C).
178
178
When you wish to access your published workbook, or any public Data Visualization
workbooks you’ve downloaded, in the future, simply open the Tableau
Public application and your workbooks will be there waiting for you!
Dashboards are a great way to combine your data visualizations and have
them interact with one another. A lot of businesses use dashboards to keep
up-to-date in real time about key performance indicators at a glance. In this
example, we will combine just two of our data visualizations, the map and
the line graph from the first section of the tutorial, but in reality, it can be
used to combine much visualization at once.
The first step in creating your dashboard is to open up the Dashboard tab at
the bottom of the screen:
179
Big Data Analytics and This is your Dashboard Sheet. On the left side you can see that there is a list
Visualization Lab
of the sheets you have made from your current data source. To build your
dashboard, drag the sheet you want in to the centre where it says Drop sheets
here. For our purposes, we will need to drag Sheet 1 and Sheet 2 where the
map and line graph are saved. When you drag, you will notice an area of
your screen will shade over where your graph will drop when you put it
down. Organize your dashboard to look like the following:
Now to add titles to the graphs that were chosen, double click on the
automatic titles generated based on the sheet name, and a new window
should appear, type in a title that describes the graph like so:
180
180
Data Visualization
Now, to add an interactive layer between the graphs, we can choose a graph
that can act as a filter to the other. We will choose the line graph to act as a
filter to the map. To do this, click on the line graph and a grey sidebar should
appear. From this bar, click the filter icon to use this graph as a filter:
Now, when you click a given line, it will be highlighted on the above map:
181
Big Data Analytics and
Visualization Lab
7.13 QUESTIONS
7.14 QUIZ
184
184
12. What Are The Components Of A Dashboard? Data Visualization
a. Horizontal
b. Vertical
c. Image Extract
d. All of the above
13. How Many Maximum Tables Can You Join In Tableau?
a. 2
b. 8
c. 16
d. 32
14. Which of the accompanying isn’t a Pattern Line display?
a. Straight Pattern Line
b. Binomial Pattern Line
c. Exponential Pattern Line
d. Logarithmic Pattern Line
15. For creating variable size bins we use ______
a. Sets
b. Groups
c. Calculated fields
d. Table Calculations
16. Can Tableau be installed on MacOS?
a. True
b. False
17. Default aggregation used for tree map
a. Avg
b. Sum
c. Count
d. Countd
18. What are different Tableau products?
a. Server
b. Reader
c. Online
d. All of above
185
Big Data Analytics and 19. Which one is not the best fitment for Tableau
Visualization Lab
a. Self Service BI
b. Mobile enablement
c. Big Data connectivity
d. Enterprise Deployment
20. Which type of chart is not available in Tableau V8
a. Pie Chart
b. Speed Dial
c. Bullet Chart
d. Bubble Chart
7.16 MOOCS
186
186
7.17 REFERENCES Data Visualization
1. https://www.techtarget.com/searchbusinessanalytics/definition/data-
visualization
2. https://www.ibm.com/cloud/learn/data-visualization
3. https://www.tableau.com/learn/articles/data-visualization
4. https://policyviz.com/wp-
content/uploads/woocommerce_uploads/2017/07/A-Guide-to-
Advanced-Data-Visualization-in-Excel-2016-Final.pdf
187