BDA Lab ManuaL
BDA Lab ManuaL
Course Objectives
1. The purpose of this course is to provide the students with the knowledge of Big
data Analytics principles and techniques.
2. This course is also designed to give an exposure of the frontiers of Big data Analytics
Course Outcomes
1. Use Excel as an Analytical tool and visualization tool.
2. Ability to program using HADOOP and Map reduce.
3. Ability to perform data analytics using ML in R.
4. Use cassandra to perform social media analytics.
List of Experiments
1. Implement a simple map-reduce job that builds an inverted index on the set of input
documents (Hadoop)
2. Process big data in HBase
3. Store and retrieve data in Pig
4. Perform Social media analysis using cassandra
5. Buyer event analytics using Cassandra on suitable product sales data.
6. Using Power Pivot (Excel) Perform the following on any dataset
a) Big Data Analytics
b) Big Data Charting
7. Use R-Project to carry out statistical analysis of big data
8. Use R-Project for data visualization of social media data
TEXT BOOKS:
1. Big Data Analytics, Seema Acharya, Subhashini Chellappan, Wiley 2015.
2. Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for
Today’s Business, Michael Minelli, Michehe Chambers, 1st Edition, Ambiga Dhiraj, Wiely
CIO Series, 2013.
3. Hadoop: The Definitive Guide, Tom White, 3rd Edition, O‟Reilly Media, 2012.
4. Big Data Analytics: Disruptive Technologies for Changing the Game, Arvind Sathi,
1st Edition,
IBM Corporation, 2012.
REFERENCES:
1. Big Data and Business Analytics, Jay Liebowitz, Auerbach Publications, CRC press (2013).
2. Using R to Unlock the Value of Big Data: Big Data Analytics with Oracle R Enterprise and
Oracle R Connector for Hadoop, Tom Plunkett, Mark Hornick, McGraw-Hill/Osborne Media
(2013), Oracle press.
3. Professional Hadoop Solutions, Boris lublinsky, Kevin t. Smith, Alexey Yakubovich,
Wiley, ISBN: 9788126551071, 2015.
4. Understanding Big data, Chris Eaton, Dirk deroos et al., McGraw Hill, 2012.
5. Intelligent Data Analysis, Michael Berthold, David J. Hand, Springer, 2007.
6. Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams
with Advanced Analytics, Bill Franks, 1st Edition, Wiley and SAS Business Series,
2012.
INDEX
6.
Perform Social media analysis using Cassandra
7. Buyer event analytics using Cassandra on suitable product sales
data.
8. Using Power Pivot (Excel) Perform the following on any dataset
a) Big Data Analytics
b) Big Data Charting
9.
Use R-Project to carry out statistical analysis of big data
10. Use R-Project for data visualization of social media data
Experiment-01
Implement Hadoop step-by-step
Preparations
A. Make sure that you are using Windows 10 and are logged in as admin.
G. Run the Java installation file jdk-8u191-windows-x64. Install direct in the folder C:\Java,
or move the items from the folder jdk1.8.0 to the folder C:\Java. It should look like this:
A new window will open with two tables and buttons. The upper table is for User
variables and the lower for System variables.
B. Make a New User variable [1]. Name it JAVA_HOME and set it to the Java bin-folder [2].
Click OK [3].
C. Make another New User variable [1]. Name it HADOOP_HOME and set it to the
hadoop-2.8.0 bin-folder [2]. Click OK [3].
D. Now add Java and Hadoop to System variables path: Go to path [1] and click edit [2]. The
editor window opens. Chose New [3] and add the address C:\Java\bin [4]. Chose New
again
[5] and add the address C:\hadoop-2.8.0\bin [6]. Click OK [7] in the editor window and OK
[8] to change the System variables.
1.1Configuration
A. Go to the file C:\Hadoop\Hadoop-2.8.0\etc\hadoop\core-site.xml [1]. Right-click on the file
and edit with Notepad++ [2].
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
D. Under C:\Hadoop-2.8.0 create a folder named data [1] with two subfolders, “datanode”
and “namenode” [2].
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop-2.8.0\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop-2.8.0\data\datanode</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
B. Delete the bin file C:\Hadoop\Hadoop-2.8.0\bin [1, 2] and replace it with the new
bin-folder from Hadoop Configuration.zip. [3].
1.1Testing
A. Search for cmd [1] and open the Command Prompt [2]. Write
hdfs namenode –format [3] and push enter.
If this first test works the Command Prompt will run a lot of information. It is a good sign!
B. Now you must change directory in the Command Prompt. Write cd C:\hadoop-
2.8.0\sbin
And push enter. In the sbin folder, write start-all.cmd and push enter.
If the configuration is right, four apps will start running and it will look something like this:
C. Now open a browser and write in the address field localhost:8088 and push
enter. Can you see the little hadoop elephant? Then you have made a really
good work!
D. Last test - try to write localhost:50070 instead.
If you can see the overview you have implemented Hadoop on your PC.
Congratulations, you did it!!!
***********************
To close the running programs, run “stop-all.cmd” in tho command prompt
Experiment-02
Hadoop Shell Commands
1. DFShell
The HDFS shell is invoked by bin/hadoop dfs <args>. All the HDFS shell commands take
path URIs as arguments. The URI format is scheme://autority/path. For HDFS the scheme is hdfs, and
for the local filesystem the scheme is file. The scheme and authority are optional. If not specified, the
default scheme specified in the configuration is used. An HDFS file or directory such as /parent/child
can be specified as hdfs://namenode:namenodeport/parent/child or simply as /parent/child (given that
your configuration is set to point to namenode:namenodeport). Most of the commands in HDFS shell
behave like corresponding Unix commands. Differences are described with each of the commands.
Error information is sent to stderr and the output is sent to stdout.
2. cat
Usage: hadoop dfs -cat URI [URI …]
Copies source paths to stdout. Example:
Exit Code:
Returns 0 on success and -1 on error.
3. chgrp
Usage: hadoop dfs -chgrp [-R] GROUP URI [URI …]
Change group association of files. With -R, make the change recursively through the directory
structure. The user must be the owner of files, or else a super-user. Additional information is in
the Permissions User Guide.
4. chmod
Usage: hadoop dfs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI …]
Change the permissions of files. With -R, make the change recursively through the directory structure.
The user must be the owner of the file, or else a super-user. Additional information
is in the Permissions User Guide.
5. chown
Usage: hadoop dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ]
Change the owner of files. With -R, make the change recursively through the directory structure.
The user must be a super-user. Additional information is in the Permissions User Guide.
6. copyFromLocal
Usage: hadoop dfs -copyFromLocal <localsrc> URI
Similar to put command, except that the source is restricted to a local file reference.
7. copyToLocal
Usage: hadoop dfs -copyToLocal [-ignorecrc] [-crc] URI
<localdst>
Similar to get command, except that the destination is restricted to a local file reference.
8. cp
Usage: hadoop dfs -cp URI [URI …] <dest>
Copy files from source to destination. This command allows multiple sources as well in which
case the destination must be a directory.
Example:
• hadoop dfs -cp /user/hadoop/file1 /user/hadoop/file2
• hadoop dfs -cp /user/hadoop/file1 /user/hadoop/file2
/user/hadoop/dir
Exit Code:
Returns 0 on success and -1 on error.
9. du
Usage: hadoop dfs -du URI [URI …]
Displays aggregate length of files contained in the directory or the length of a file in case its
just a file.
Example:
hadoop dfs -du /user/hadoop/dir1 /user/hadoop/file1
hdfs://host:port/user/hadoop/dir1
Exit Code:
Returns 0 on success and -1 on error.
10. dus
Usage: hadoop dfs -dus <args>
Displays a summary of file lengths.
11. expunge
Usage: hadoop dfs -expunge
Empty the Trash. Refer to HDFS Design for more information on Trash feature.
12. get
Usage: hadoop dfs -get [-ignorecrc] [-crc] <src> <localdst>
Copy files to the local file system. Files that fail the CRC check may be copied with the
-ignorecrc option. Files and CRCs may be copied using the -crc option. Example:
• hadoop dfs -get /user/hadoop/file localfile
• hadoop dfs -get hdfs://host:port/user/hadoop/file localfile
Exit Code:
Returns 0 on success and -1 on error.
13. getmerge
Usage: hadoop dfs -getmerge <src> <localdst> [addnl]
Takes a source directory and a destination file as input and concatenates files in src into the destination
local file. Optionally addnl can be set to enable adding a newline character at the end of each file.
14. ls
Usage: hadoop dfs -ls <args>
For a file returns stat on the file with the following format:
filename <number of replicas> filesize modification_date
modification_time permissions userid groupid
For a directory it returns list of its direct children as in unix. A directory is listed as:
dirname <dir> modification_time modification_time permissions userid
groupid
Example:
hadoop dfs -ls /user/hadoop/file1 /user/hadoop/file2
hdfs://host:port/user/hadoop/dir1 /nonexistentfile Exit Code:
Returns 0 on success and -1 on error.
15. lsr
Usage: hadoop dfs -lsr <args>
Recursive version of ls. Similar to Unix ls -R.
16. mkdir
Usage: hadoop dfs -mkdir <paths>
Takes path uri's as argument and creates directories. The behavior is much like unix mkdir -p creating
parent directories along the path.
Example:
• hadoop dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
• hadoop dfs -mkdir hdfs://host1:port1/user/hadoop/dir
hdfs://host2:port2/user/hadoop/dir
Exit Code:
Returns 0 on success and -1 on error.
17. movefromLocal
Usage: dfs -moveFromLocal <src> <dst>
Displays a "not implemented" message.
18. mv
Usage: hadoop dfs -mv URI [URI …] <dest>
Moves files from source to destination. This command allows multiple sources as well in which case
the destination needs to be a directory. Moving files across filesystems is not permitted.
Example:
• hadoop dfs -mv /user/hadoop/file1 /user/hadoop/file2
• hadoop dfs -mv hdfs://host:port/file1 hdfs://host:port/file2 hdfs://host:port/file3
hdfs://host:port/dir1
Exit Code:
Returns 0 on success and -1 on error.
19. put
Usage: hadoop dfs -put <localsrc> ... <dst>
Copy single src, or multiple srcs from local file system to the destination filesystem. Also reads input
from stdin and writes to destination filesystem.
• hadoop dfs -put localfile /user/hadoop/hadoopfile
• hadoop dfs -put localfile1 localfile2 /user/hadoop/hadoopdir
• hadoop dfs -put localfile hdfs://host:port/hadoop/hadoopfile
• hadoop dfs -put - hdfs://host:port/hadoop/hadoopfile
Reads the input from stdin.
Exit Code:
Returns 0 on success and -1 on error.
20. rm
Usage: hadoop dfs -rm URI [URI …]
Delete files specified as args. Only deletes non empty directory and files. Refer to rmr for recursive
deletes.
Example:
• hadoop dfs -rm hdfs://host:port/file /user/hadoop/emptydir
Exit Code:
Returns 0 on success and -1 on error.
21. rmr
Usage: hadoop dfs -rmr URI [URI …]
Recursive version of delete. Example:
• hadoop dfs -rmr /user/hadoop/dir
• hadoop dfs -rmr hdfs://host:port/user/hadoop/dir
Exit Code:
Returns 0 on success and -1 on error.
22. setrep
Usage: hadoop dfs -setrep [-R] <path>
Changes the replication factor of a file. -R option is for recursively increasing the replication factor of
files within a directory.
Example:
• hadoop dfs -setrep -w 3 -R /user/hadoop/dir1
Exit Code:
Returns 0 on success and -1 on error.
23. stat
Usage: hadoop dfs -stat URI [URI …]
Returns the stat information on the path. Example:
24.tail
Usage: hadoop dfs -tail [-f] URI
Displays last kilobyte of the file to stdout. -f option can be used as in Unix. Example:
25. test
Usage: hadoop dfs -test -[ezd] URI
Options:
-e check to see if the file exists. Return 0 if true.
-z check to see if the file is zero length. Return 0 if true
-d check return 1 if the path is directory else return 0.
Example:
• hadoop dfs -test -e filename
26. text
Usage: hadoop dfs -text <src>
Takes a source file and outputs the file in text format. The allowed formats are zip and
TextRecordInputStream.
27. touchz
Usage: hadoop dfs -touchz URI [URI …]
Create a file of zero length. Example:
Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple
programming models. The Hadoop framework application works in an
environment that provides distributed storage and computation across clusters
of computers. Hadoop is designed to scale up from a single server to thousands of
machines, each offering local computation and storage.
Hadoop Architecture
As the sequence of the name MapReduce implies, the reduce task is always
performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over
multiple computing nodes. Under the MapReduce model, the data processing
primitives are called mappers and reducers.
Decomposing a data processing application into mappers and reducers is
sometimes nontrivial. But, once we write an application in the MapReduce form,
scaling the application to run over hundreds, thousands, or even tens of thousands
of machines in a cluster is merely a configuration change. This simple scalability
is what has attracted many programmers to use the MapReduce model.
The Algorithm
The MapReduce framework operates on <key, value> pairs, that is, the framework
views the input to the job as a set of <key, value> pairs and produces a set of <key,
value> pairs as the output of the job, conceivably of different types.
The key and the value classes should be in serialized manner by the framework
and hence, need to implement the Writable interface. Additionally, the key classes
have to implement the Writable- Comparable interface to facilitate sorting by the
framework. Input and Output types of a MapReduce job − (Input) <k1, v1>
→ map
→ <k2, v2> → reduce → <k3, v3>(Output).
Input Output
cd Desktop
mkdir Lab mkdir Lab/Input
mkdir Lab/tutorial_classes
4. Add the file attached with this document “input.txt” in the directory Lab/Input.
5. Type the following command to export the hadoop classpath into bash.
export HADOOP_CLASSPATH=$(hadoop classpath)
Make sure it is now exported.
echo $HADOOP_CLASSPATH
6. It is time to create these directories on HDFS rather than
locally. Type the following commands.
hadoop fs -mkdir
/WordCountTutorial hadoop
fs -mkdir
/WordCountTutorial/Input
hadoop fs -put Lab/Input/input.txt
/WordCountTutorial/Input
7. Go to localhost:9870 from the browser, Open “Utilities →
Browse File System” and you should see the
directories and 昀椀les we placed in the 昀椀le system.
8. Then, back to local machine where we will compile the
WordCount.java file. Assuming we are currently in the Desktop
directory.
cd Lab
javac -classpath $HADOOP_CLASSPATH -d
tutorial_classes WordCount.java
Put the output files in one jar file (There is a dot at the end)
jar -cvf WordCount.jar -C tutorial_classes .
Program:
First Create Indexmapper.java class
Packagemr03.inverted_index;
import java.util.StringTokenizer;
@Override
protected void map(LongWritable key, Text value,
Context context) throws
IOException, InterruptedException {
FileSplit split = (FileSplit) context.getInputSplit();
StringTokenizer tokenizer = new
StringTokenizer(value.toString());
while (tokenizer.hasMoreTokens()) { String
fileName =
split.getPath().getName().split("\\.")[0];
//remove special char using
// tokenizer.nextToken().replaceAll("[^a- zA-
Z]", "").toLowerCase()
//check for empty words
wordAtFileNameKey.set(tokenizer.nextToken () +
"@" + fileName);
context.write(wordAtFileNameKey, ONE_STRING);
}
}
}
IndexReducer.java
package mr03.inverted_index;
import
org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
IndexDriver.java
package
mr03.invert
ed_index;
import
org.apache.hadoop.fs.FileSystem
; import
org.apache.hadoop.mapreduce.J
ob; import
org.apache.hadoop.mapreduce.lib.input.FileInputForm
at; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFo
rmat; import org.apache.hadoop.conf.Configuration;
import
org.apache.hadoop.fs.Path;
import
org.apache.hadoop.io.Text;
FileSystem fs = FileSystem.get(conf);
boolean exists = fs.exists(new
Path(output));
if(exists) {
fs.delete(new Path(output), true);
}
Job job = Job.getInstance(conf);
job.setJarByClass(IndexDriver.class);
job.setMapperClass(IndexMapper.class);
job.setCombinerClass(IndexCombiner.clas
s);
job.setReducerClass(IndexReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
}
IndexCombiner.java
packag
e
mr03.in
verted_
in de x;
import
java.io.IOException;
import
org.apache.hadoop.io.Tex
t;
import org.apache.hadoop.mapreduce.Reducer;
for(Text value:values)
{
sum += Integer.parseInt(value.toString());
}
int splitIndex = key.toString().indexOf("@");
fileAtWordFreqValue.set(key.toString().substring(splitIndex+1)
+":"
+sum);
key.set(key.toString().substring(0,splitInde x));
context.write(key, fileAtWordFreqValue);
}
}
Output:
Experiment 4. Process big data in HBase
Aim:To create a table and process the big data in Hbase Resources:Hadoop,oracle virtual
box,Hbase
Theory:
Hbase is an open source and sorted map data built on Hadoop. It is column oriented and
horizontally scalable.
It is based on Google's Big Table.It has set of tables which keep data in key value format. Hbase is
well suited for sparse data sets which are very common in big data use cases. Hbase provides APIs
enabling development in practically any programming language. It is a part of the Hadoop ecosystem
that provides random real-time read/write access to data in the Hadoop File System.
● RDBMS get exponentially slow as the data becomes large
● Expects data to be highly structured, i.e. ability to fit in a well-defined schema
● Any change in schema might require a downtime
● For sparse datasets, too much of overhead of maintaining NULL values
Features of Hbase
● Horizontally scalable: You can add any number of columns anytime.
● Automatic Failover: Automatic failover is a resource that allows a system administrator to
automatically switch data handling to a standby system in the event of system compromise
● Integrations with Map/Reduce framework: Al the commands and java codes
internally implement Map/ Reduce to do the task and it is built over Hadoop
Distributed File System.
● sparse, distributed, persistent, multidimensional sorted map, which is indexed by
rowkey, column key,and timestamp.
● Often referred as a key value store or column family-oriented database, or storing versioned
maps of maps.
● fundamentally, it's a platform for storing and retrieving data with random access.
● It doesn't care about datatypes(storing an integer in one row and a string in another
for the same column).
● It doesn't enforce relationships within your data.
● It is designed to run on a cluster of computers, built using commodity hardware.
Cloudera VM is recommended as it has Hbase pre installed on it.
Starting Hbase: Type Hbase shell in terminal to start the hbase.
Step 5:hbase(main):001:0>version
Version will gives you the version of hbase
Create Table Syntax
hbase(main):011:0> create
'newtbl','knowledge'
hbase(main):011:0>describe 'newtbl'
hbase(main):011:0>status
1 servers, 0 dead, 15.0000 average load
data
Veri昀椀cation
After disabling the table, you can still sense its
existence through list and exists commands. You cannot scan it. It will give you
the following error.
hbase(main):028:0> scan 'newtbl'
ROW COLUMN + CELL
ERROR: newtbl is disabled.
is_disabled
This command is used to find whether a table is disabled. Its syntax is as follows.
hbase> is_disabled 'table name'
disable_all
This command is used to disable all the tables matching the given regex. The syntax for
disable_all command is given below.
hbase> disable_all 'r.*'
Suppose there are 5 tables in HBase, namely raja, rajani, rajendra, rajesh, and raju. The
following code will disable all the tables starting with raj.
rajendra
rajesh
raju
Disable the above 5 tables (y/n)?
y
5 tables successfully disabled
Veri昀椀cation
After enabling the table, scan it. If you can see the schema, your table is successfully
enabled.
is_enabled
This command is used to find whether a table is enabled. Its syntax is as follows:
hbase> is_enabled 'table name'
The following code verifies whether the table named emp is enabled. If it is enabled, it
will return true and if not, it will return false.
hbase(main):031:0> is_enabled 'newtbl'
true
0 row(s) in 0.0440 seconds
describe
This command returns the description of the table. Its syntax is as follows:
hbase> describe 'table name'
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of
Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the
Hadoop subproject). Pig's language layer currently consists of a textual language called Pig
Latin, which has the following key properties:
● Optimization opportunities. The way in which tasks are encoded permits the system
to optimize their execution automatically, allowing the user to focus on semantics rather
than efficiency.
Operator Description
Filtering
Sorting
Diagnostic Operators
Employee Age
Name City
ID
001 Angelina 22 LosAngeles
002 Jackie 23 Beijing
003 Deepika 22 Mumbai
004 Pawan 24 Hyderabad
005 Rajani 21 Chennai
006 Amitabh 22 Mumbai
Step-1: Create a Directoryin HDFS with the name pigdir in the required path using mkdir:
$ hdfs dfs -mkdir /bdalab/pigdir
Step-2: The input 昀椀le of Pig contains each tuple/record in individual lines with the en琀
椀琀椀es separated by a delimiter ( “,”).
In the local file system, create an input In the local file system, create an
file student_data.txt containing data as input file employee_data.txt
shown below. containing data as shown below.
001,Jagruthi,21,Hyderabad,9. 001,Angelina,22,LosAngel
1 002,Praneeth,22,Chennai,8.6 es 002,Jackie,23,Beijing
003,Sujith,22,Mumbai,7.8 003,Deepika,22,Mumbai
004,Sreeja,21,Bengaluru,9.2 004,Pawan,24,Hyderabad
005,Mahesh,24,Hyderabad,8.8 005,Rajani,21,Chennai
006,Rohit,22,Chennai,7.8 006,Amitabh,22,Mumbai
007,Sindhu,23,Mumbai,8.3
Step-3: Move the 昀椀le from the local 昀椀le system to HDFS using put (Or) copyFromLocal
command and verify using -cat command
To get the path of the 昀椀le student_data.txt type the below
command readlink -f student_data.txt
$ hdfs dfs -put /home/hadoop/Desktop/student_data.txt /bdalab/pigdir/
$ hdfs dfs -cat /bdalab/pigdir/student_data
$ hdfs dfs -put /home/hadoop/Desktop/employee_data /bdalab/pigdir/
Step-4: Apply Rela琀椀onal Operator – LOAD to load the data from the 昀椀le
student_data.txt into Pig by execu琀椀ng the following Pig La琀椀n statement in
the Grunt shell. Rela琀椀onal Operators are NOT case sensi琀椀ve.
$ pig => will direct to grunt> shell
grunt> student = LOAD ' /bdalab/pigdir/student_data.txt' USING PigStorage(',')as (
id:int, name:chararray, age:int, city:chararray, cgpa:double );
grunt>employee = LOAD ' /bdalab/pigdir/employee_data.txt’ USING
PigStorage(',')as ( id:int, name:chararray, age:int, city:chararray);
Step-5: Apply Rela琀椀onal Operator – STORE to Store the rela琀椀on in the HDFS directory
“/pig_output/” as shown below.
grunt> STORE student INTO ' /bdalab/pigdir/pig_output/ ' USING PigStorage (',');
grunt> STORE employee INTO ' /bdalab/pigdir/pig_output/ ' USING PigStorage (',');
Resources: Cassandra
Procedure:
Cassandra is a distributed database for low latency, high throughput services that handle real
time workloads comprising of hundreds of updates per second and tens of thousands of reads
per second.
If one is coming from a relational database background with strong ACID semantics, then one
must take the time to understand the eventual consistency model.
Understand Cassandra’s architecture very well and what it does under the hood. With
Cassandra 2.0 you get lightweight transaction and triggers, but they are not the same as the
traditional database transactions one might be familiar with. For example, there are no foreign
key constraints available – it has to be handled by one’s own application. Understanding one’s
use cases and data access patterns clearly before modeling data with Cassandra and to read all
the available documentation is a must.
Capture
This command captures the output of a command and adds it to a file. For example, take a look
at the following code that captures the output to a file named Outputfile.
cqlsh> CAPTURE '/home/hadoop/CassandraProgs/Outputfile'
When we type any command in the terminal, the output will be captured by the file
given. Given below is the command used and the snapshot of the output file.
cqlsh:tutorialspoint> select * from emp;
This command shows the current consistency level, or sets a new consistency level.
cqlsh:tutorialspoint> CONSISTENCY
Current consistency level is 1.
Copy
This command copies data to and from Cassandra to a file. Given below is an
example to copy the table named emp to the file myfile.
This command describes the current cluster of Cassandra and its objects. The variants of
this command are explained below.
Range ownership:
-658380912249644557 [127.0.0.1]
-2833890865268921414 [127.0.0.1]
-6792159006375935836 [127.0.0.1]
Describe Keyspaces − This command lists all the keyspaces in a cluster.
Given below is the usage of this command.
This command is used to describe a user-defined data type. Given below is the usage of
this command.
This command lists all the user-defined data types. Given below is the usage of this command.
Assume there are two user-defined data types: card and card_details.
card_details card
Expand
This command is used to expand the output. Before using this command, you have to turn the
expand command on. Given below is the usage of this command.
@ Row 1
-----------+------------
emp_id | 1
emp_city | Hyderabad
emp_name | ram
emp_phone | 9848022338
emp_sal | 50000
@ Row 2
-----------+------------
emp_id | 2
emp_city | Delhi
emp_name | robin
emp_phone | 9848022339
emp_sal | 50000
@ Row 3
-----------+------------
emp_id | 4
emp_city | Pune
emp_name | rajeev
emp_phone | 9848022331
emp_sal | 30000
@ Row 4
-----------+------------
emp_id | 3
emp_city | Chennai
emp_name | rahman
emp_phone | 9848022330
emp_sal
Note − You| 50000
can turn the expand option off using the following command.
(4 rows)
cqlsh:tutorialspoint> expand off;
Disabled Expanded output.
Exit
This command displays the details of current cqlsh session such as Cassandra version, host, or
data type assumptions. Given below is the usage of this command.
Using this command, you can execute the commands in a file. Suppose
our input file is as follows −
Then you can execute the file containing the commands as shown below.
Aim: To perform the buyer event analysis using Cassandra on sales data
Theory:
Users can access Cassandra through its nodes using Cassandra Query Language (CQL). CQL
treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt to
work with CQL or separate application language drivers.
Clients approach any of the nodes for their read-write operations. That node (coordinator) plays
a proxy between the client and the nodes holding the data.
Write Operations
Every write activity of nodes is captured by the commit logs written in the nodes. Later the
data will be captured and stored in the mem-table. Whenever the mem-table is full, data will
be written into the SStable data file. All writes are automatically partitioned and replicated
throughout the cluster. Cassandra periodically consolidates the SSTables, discarding
unnecessary data.
Read Operations
During read operations, Cassandra gets values from the mem-table and checks the bloom filter
to find the appropriate SSTable that holds the required data.
Apache is an open-source platform. This web server delivers web-related content using the
internet. It has gained huge popularity over the last few years, as the most used web server
software. Cassandra is a database management system that is open-source. It has the capacity
to handle a large amount of data across servers. It was first developed by Facebook for the
inbox search feature and was released as an open-source project back in 2008.
The following year, Cassandra became a part of Apache incubation, and combined with
Apache, it has reached new heights. To put it in simple terms, Apache Cassandra is a powerful
open-source distributed database system that can work efficiently to handle a massive amount
DATA-MODELLING
The way data is modeled is a major difference between Cassandra & MySQL. .
Let us consider a platform where users can post. Now, you have commented on a post of
another user. In these two databases, the information will be stored differently. In Cassandra,
you can store the data in a single table. The comments for each user is stored in the form of a
In MySQL, you have to make two tables with one-to-many relationships between them. As
MySQL does not permit unstructured data such as a List or a Map, one-to-many relationships
READ PERFORMANCE
The query to retrieve the comments made by a user(for example ‘5’) in MySQL, will look like
this.
When you utilize indexing in MySQL, it saves the data like a binary tree.
one lookup.
WRITE PERFORMANCE
2. Then update it
Cassandra leverages an append-only model. Insert & update have no fundamental difference.
If you want to insert a row that comes with the same primary key as an existing row, the row
will be replaced. Or, if you update a row with a non-existent primary key, Cassandra will create
the row. Cassandra is very fast and stores large swathes of data on commodity hardware
TRANSACTIONS
MySQL facilitates ACID transactions like any other Relational Database Management System
● Atomicity
● Consistency
● Isolation
● Durability
On the other hand, Cassandra has certain limitations to provide ACID transactions. Cassandra
can achieve consistency if data duplication is not allowed. But, that will kill Cassandra’s
availability. So, the systems that require ACID transactions must avoid NoSQL databases.
Procedure:
db.employee.insert(
{
empid:
'1', firstname:
'FN', lastname:
'LN', gender:
'M'
}
cqlsh>
SELECT TTL(name) FROM learn_cassandra.todo_by_user_email WHERE
user_email='[email protected]';
ttl(name)
43
(1 rows)
cqlsh>
SELECT * FROM learn_cassandra.todo_by_user_email WHERE
user_email='[email protected]';
(0 rows)
cqlsh>
INSERT INTO learn_cassandra.todo_by_user_email (user_email, creation_date, name)
VALUES(' ('[email protected]', '2021-03-14 16:07:19.622+0000', 'Insert query');
cqlsh>
UPDATE learn_cassandra.todo_by_user_email SET
name = 'Update query'
WHERE user_email = '[email protected]' AND creation_date = '2021-03-
14 16:10:19.622+0000';
(2 rows)
Let’s only update if an entry already exists, by using IF EXISTS:
cqlsh>
UPDATE learn_cassandra.todo_by_user_email SET
name = 'Update query with LWT'
WHERE user_email = '[email protected]' AND creation_date = '2021-03-
14 16:07:19.622+0000' IF EXISTS;
[applied]
True
cqlsh>
[applied]
True
Experiment:8 - (a) using a power pivot(Excel) perform the following on any data set
Aim: To perform the big data analytics using power pivot in Excel
Theory: Power Pivot is an Excel add-in you can use to perform powerful data analysis and create sophisticated
data models. With Power Pivot, you can mash up large volumes of data from various sources, perform information
analysis rapidly, and share insights easily.
In both Excel and in Power Pivot, you can create a Data Model, a collection of tables with relationships. The data
model you see in a workbook in Excel is the same data model you see in the Power Pivot window. Any data you
import into Excel is available in Power Pivot, and vice versa.
Procedure:
Open the Microsoft Excel and go to data menu and click get data
Import the Twitter data set and click load to button Now from the
Click the diagram view and give the relationships between the tables
Go to the Insert menu and click pivot table
Select the columns and u can perform drill down and rollup operations using pivot table
Aim :To create variety of charts using Excel for the given data
Resources:Microsoft Excel
Theory:
When your data sets are big, you can use Excel Power Pivot that can handle
hundreds of millions of rows of data. The data can be in external data
sources and Excel Power Pivot builds a Data Model that works on a memory
optimization mode. You can perform the calculations, analyze the data and
arrive at a report to draw conclusions and decisions. The report can be
either as a Power PivotTable or Power PivotChart or a combination of both.
You can utilize Power Pivot as an ad hoc reporting and analytics solution.
Thus, it would be possible for a person with hands-on experience with Excel
to perform the high-end data analysis and decision making in a matter of
few minutes and are a great asset to be included in the dashboards.
Click the OK button. New worksheet gets created in Excel window and an empty Power
PivotTable appears.
As you can observe, the layout of the Power PivotTable is similar to that of PivotTable.
The PivotTable Fields List appears on the right side of the worksheet. Here, you will 昀椀nd some di昀
昀erences from PivotTable. The Power PivotTable Fields list has two tabs − ACTIVE and ALL, that
appear below the title and above the 昀椀elds list. ALL tab is highlighted. The ALL tab displays all
the data tables in the Data Model and ACTIVE tab displays all the data tables that are
chosen for the Power PivotTable at hand.
● Click the table names in the PivotTable Fields list
under ALL. The corresponding 昀椀elds with check boxes will
appear.
● Each table name will have the symbol on the left side.
● If you place the cursor on this symbol, the Data Source and the Model Table
Name of that data table will be displayed.
As you can observe, all the tables in the data model are displayed in the PivotChart Fields
list.
hidden.
Note that display of Field Buttons and/or Legend depends on the context of the
PivotChart. You need to decide what is required to be displayed.
As in the case of Power PivotTable, Power PivotChart Fields list also contains two tabs − ACTIVE and
ALL. Further, there are 4 areas −
● AXIS (Categories)
● LEGEND (Series)
● ∑ VALUES
● FILTERS
As you can observe, Legend gets populated with ∑ Values. Further, Field Buttons get added to the
PivotChart for the ease of 昀椀ltering the data that is being displayed. You can click on the arrow on
a Field Button and select/deselect values to be displayed in the Power PivotChart.
You can have the following Table and Chart Combinations in Power Pivot.
● Chart and Table (Horizontal) - you can create a Power PivotChart and a
Power PivotTable, one next to another horizontally in the same
worksheet.
Chart and Table (Vertical) - you can create a Power PivotChart and a Power PivotTable,
one below another vertically in the same worksheet.
These combinations and some more are available in the dropdown list that appears when
you click on PivotTable on the Ribbon in the Power Pivot window.
Click on the pivot chart and can develop multiple variety of charts
Output:
lOMoAR cPSD| 40175462
Procedure:
step 2 :wget -c
https://download1.rstudio.org/desktop/jammy/amd64/rstudi
o-2022.07.2-576-amd64.deb
step 4:rstudio
launch R studio
procedure:
-->install.packages("gapminder")
-->library(gapminder)
lOMoAR cPSD| 40175462
-->data(gapminder)
output:
A tibble: 1,704 × 6
-->summary(gapminder)
summary(gapminder)
output:
(Other) 1632
-->x<-mean(gapminder$gdpPercap)
-->x
output:[1] 7215.327
-->attach(gapminder)
-->median(pop)
output:[1] 7023596
-->hist(lifeExp)
lOMoAR cPSD| 40175462
-->boxplot(lifeExp)
will plot the below images
-->plot(lifeExp - gdpPercap)
-->install.packages("dplyr")
-->gapminder %>%
+ filter(year == 2007) %>%
+ group_by(continent) %>%
+ summarise(lifeExp = median(lifeExp))
output:
# A tibble: 5 × 2
continent lifeExp
<fct> <dbl>
1 Africa 52.9
2 Americas 72.9
3 Asia 72.4
4 Europe 78.6
5 Oceania 80.7
-->install.packages("ggplot2")
--> library("ggplot2")
-->ggplot(gapminder, aes(x = continent, y = lifeExp))
+ geom_boxplot(outlier.colour = "hotpink") +
geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 1/4)
lOMoAR cPSD| 40175462
output:
-->head(country_colors, 4)
output:
Nigeria Egypt Ethiopia
"#7F3B08" "#833D07"
"#873F07"
Congo, Dem. Rep.
"#8B4107"
-->head(continent_colors)
mtcars
mpg cyl disp hp drat gear carb
wt qsec vs a
m
3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.00
Max. :5.424 Max. :22.90 Max. :1.0000 Max. :1.0000 Max. :5.00
0
0
0
carb
Min. :1.000 1st
Qu.:2.000
Median :2.000
Mean :2.812
3rd Qu.:4.000
Max. :8.000
> Data_Cars <- mtcars
>
> max(Data_Cars$hp)
[1] 335
> min(Data_Cars$hp)
[1] 52
> Data_Cars <- mtcars
>
> which.max(Data_Cars$hp)
[1] 31
> which.min(Data_Cars$hp)
[1] 19
> Data_Cars <- mtcars
> rownames(Data_Cars)[which.max(Data_Cars$hp)]
[1] "Maserati Bora"
> rownames(Data_Cars)[which.min(Data_Cars$hp)]
lOMoAR cPSD| 40175462
median(Data_Cars$wt)
[1] 3.325
Data_Cars <- mtcars
names(sort(-table(Data_Cars$wt)))[1]
# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48
lm() Function
This function creates the relationship model between the predictor and the response vari-
able.
Syntax
The basic syntax for lm() function in linear regression is −
lm(formula,data)
Following is the description of the parameters used −
● formula is a symbol presenting the relation between x and y.
● data is the vector on which the formula will be applied.
Create Relationship Model & get the Coefficient
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
print(relation)
Result:
lOMoAR cPSD| 40175462
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-38.4551 0.6746
print(summary(relation))
Result:
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775
Coefficients:
Estimate Std. Error t value Pr(>|t|) (Intercept)
-38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
predict() Function
Syntax
The basic syntax for predict() in linear regression is −
predict(object, newdata)
Following is the description of the parameters used −
● object is the formula which is already created using the lm() function.
● newdata is the vector containing the new value for predictor variable.
lOMoAR cPSD| 40175462
#
App
ly
the
lm()
fun
ctio
Result:
n. 1
rela
76.22869
tion
<-
Visualize the Regression Graphically
lm(
y~x)
# Create the predictor and response variable.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
#y <-
Find weight
c(63, of a91, 47, 57, 76, 72, 62, 48)
81, 56,
person with height
relation <- lm(y~x)
170. a <-
data.frame(x = 170)
# Give the
resul
chart file a
t <-
name. png(file
predi
=
ct(rel
"linearregressi
ation,
on.png")
a)
print(
# Plot the chart.
result
)plot(y,x,col = "blue",main = "Height & Weight Regression",
abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab =
"Height in cm")
Bar Plot
There are two types of bar plots- horizontal and vertical which represent data points as
horizontal or vertical bars of certain lengths proportional to the value of the data item. They
are generally used for continuous and categorical variable plotting. By setting the
horiz parameter to true and false, we can get horizontal and vertical bar plots respectively.
Histogram
A histogram is like a bar chart as it uses bars of varying height to represent data distribution.
However, in a histogram values are grouped into consecutive intervals called bins. In a
Histogram, continuous values are grouped and displayed in these bins whose size can be varied.
For a histogram, the parameter xlim can be used to specify the interval within which all
values are to be displayed.
Another parameter freq when set to TRUE denotes the frequency of the various values in the
histogram and when set to FALSE, the probability densities are represented on the y-axis such
that they are of the histogram adds up to one.
Histograms are used in the following scenarios:
● To verify an equal and symmetric distribution of the data.
lOMoAR cPSD| 40175462
Box Plot
The statistical summary of the given data is presented graphically using a boxplot. A boxplot
depicts information like the minimum and maximum data point, the median value, first and
third quartile, and interquartile range.
Box Plots are used for:
● To give a comprehensive statistical description of the data through a visual cue.
● To identify the outlier points that do not lie in the inter-quartile range of data.
Scatter Plot
A scatter plot is composed of many points on a Cartesian plane. Each point denotes the value
taken by two parameters and helps us easily identify the relationship between them.
Scatter Plots are used in the following scenarios:
● To show whether an association exists between bivariate data.
● To measure the strength and direction of such a relationship.
Heat Map
Heatmap is defined as a graphical representation of data using colors to visualize the value of
the matrix. heatmap() function is used to plot heatmap.
Syntax: heatmap(data)
Parameters: data: It represent matrix data, such as values of rows and columns
Return: This function draws a heatmap.
Procedure:
Step I : Facebook Developer Registration
Step2:click on tools
install.packages("Rfacebook")
install.packages("RcolorBrewer
") install.packages("Rcurl")
install.packages("rjson")
install.packages("h琀琀r")
library(Rfacebook)
library(h琀琀puv)
library(RcolorBrew
er)
acess_token="EAATgfMOrIRoBAOR9XUl3VGzbLMuWGb9FqGkTK3PFBuRy
UVZA WAL7ZBw0xN3AijCsPiZBylucovck4YUhU昀欀
WLMZBo640k2ZAupKgsaKog9736lec
P8E52qkl5de8M963oKG8KOCVUXqqLiRcI7yIbEONeQt0eyLI6LdoeZA65Hy
xf8so1 UMbywAdZCZAQBpNiZAPPj7G3UX5jZAvUpRLZCQ5SIG"
op琀椀ons(RCurlop琀椀ons=list(verbose=FALSE,capath=system.昀椀
le("CurlSSL","cacert. pem",package = "Rcurl"),ssl.verifypeer=FALSE))
me<-getUsers("me",token=acess_token) View(me)
myFriends<-getFriends(acess_token,simplify = FALSE) table(myFriends)
pie(table(myFriends$gender))
output
lOMoAR cPSD| 40175462