20dce017 Bda Pracfil

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

FACULTY OF TECHNOLOGY AND ENGINEERING

DEVANG PATEL INSTITUTE OF ADVANCE TECHNOLOGY


AND RESEARCH
DEPARTMENT OF COMPUTER ENGINEERING

A.Y. 2023-24 [ODD]

LAB MANUAL

CE449: BIG DATA ANALYTICS


Semester: 7th Academic Year: 2023-24
Subject Code: CE449 Subject Name: Big Data Analytics
Student Id: 20DCE017 Student Name: Raj Chauhan

PRACTICAL INDEX
Sr. AIM Assigned Completion Grade Assessment Signature
No. Date Date Date
1 To install Hadoop framework,
configure it and setup a single
nodecluster. Use web based tools
to monitor your Hadoop setup.
2 To implement file management
tasks in Hadoop HDFS and
perform Hadoop commands.
3 To implement basic functions and
commands in R Programming. To
build WordCloud, a text mining
method using R for easy to
understand and better
visualization than a data table.
4 To implement a word count
application using the MapReduce
programming model.

To implement program that count


the occurrences of word based on
the length.
5 A. To design and
implement MapReduce
algorithms to take a
very large file of
integers and produce as
output:
a) The largest integer
b) The average of all the
integers.
c) The same set of integers, but
with each integer appearing
only once.
d) The count of the number of
distinct integers in the input.
B. To design an application to find
mutual friend using map reduce.
6 To implement basic CRUD

20DCE017 CE449 : Big Data Analytics 1


operations (create, read, update,
delete) inMongoDB and
Cassandra.
7 To develop a MapReduce
application and implement a
program that analyzes weather
data.
8 To Install and Run Hive. Use Hive
to create, alter, and drop databases,
tables, views,functions, and
indexes. To create HDFS tables
and load them in Hive and
implement joining of tables in
Hive.
9 To install and run Pig and then
write Pig Latin scripts to sort,
group,join, project, and filter your
data.
10 To install, deploy & configure
Apache Spark Cluster. To Select the
fields from the dataset using Spark
SQL. To explore Spark shell and
read from HDFS
11 To perform Sentiment Analysis
using Twitter data, Scala and Spark
12 To perform Graph Path and
Connectivity analytics and
implement basicqueries after
loading data using Neo4j
13 To perform case study of the
following platforms for solving any
bigdata analytic problem of your
choice.
(1) Amazon web services,
(2) Microsoft Azure,
(3) Google App engine

20DCE017 CE449 : Big Data Analytics 2


PRACTICAL 1
Aim: To install Hadoop framework, configure it and setup a single node
cluster. Use web-based tools to monitor your Hadoop setup.
Practical:
THEORY:
Hadoop:

 The Apache™ Hadoop® project develops open-source software for reliable,


scalable, distributed computing.
 The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming
models.
 It is designed to scale up from single servers to thousands of machines, each offering
local computation and storage. Rather than rely on hardware to deliver high-
availability, the library itself is designed to detect and handle failures at the
application layer, so delivering a highly-available service on top of a cluster of
computers, each ofwhich may be prone to failures.
 First install Docker on windows machine from official website.
 Besides this parallelly also download WSL2 with ubantu from Microsoft store.

20DCE017 CE449: Big Data Analytics 3


 After downloading all the above-mentioned requirements run and install the docker on
your windows machine.Then after download Hadoop from github as the link given here
https://github.com/big-data-europe/docker-hadoop
 On the completing all the download simply run this command.
o docker-compose up -d

20DCE017 CE449: Big Data Analytics 4


 This will automatically downloads all the requirements and make a docker container.

 Then click on NameNode Cli

 Now open up a browser and go to localhost:9870

20DCE017 CE449: Big Data Analytics 5


CONCLUSION:
In this practical, we learned about how to install Hadoop.

20DCE017 CE449: Big Data Analytics 6


PRACTICAL 2
Aim: To implement file management tasks in Hadoop HDFS and perform
Hadoop commands.
Practical:
There are many more commands in "$HADOOP_HOME/bin/hadoop fs" than are demonstrated
here, although these basic operations will get you started. Running ./bin/hadoop dfs with no
additional arguments will list all the commands that can be run with the FsShell system.
Furthermore, $HADOOP_HOME/bin/hadoop fs -help commandName will display a short
usage summary for the operation in question, if you are stuck.
 Start the docker and run the Hadoop in it.
 Open the NameNode Cli from docker and navigate to the localhost:9870.

 Here given Cli is for performing command on the NameNode.

Commands :-
1.) Ls :-
This command is use to list all the files which is been present is the hadoop file sys.

2.) Mkdir:-
To make new directory use this command.

20DCE017 CE449: Big Data Analytics 7


3.) Touchz:-

4.) Hadoop version :-

5.) Hadoop find :-

20DCE017 CE449: Big Data Analytics 8


6.) copyToLocal :-

7.) rmdir :-

CONCLUSION:
In this practical, we performed various basic commands on Hadoop to create and remove or
copy files into system.

20DCE017 CE449: Big Data Analytics 9


PRACTICAL 3
AIM: To implement basic functions and commands in R Programming. To
build WordCloud, a text mining method using R for easy to understand and
better visualization than a data table.
CODE:
install.packages("tm")
install.packages("SnowballC")
install.packages("wordcloud")
install.packages("RColorBrewer")
library("wordcloud")
library("RColorBrewer")
# To choose the text file text = readLines(file.choose())
# VectorSource() function # creates a corpus of
# character v ectors docs = Corpus(VectorSource(text))
# Text transformation toSpace = content_transformer(

function (x, pattern)


gsub(pattern, " ", x))
docs1 = tm_map(docs, toSpace, "/") docs1 = tm_map(docs,
toSpace, "@") docs1 = tm_map(docs, toSpace, "#")

strwrap(docs1)
# Cleaning the Text docs1 = tm_map(docs1, content_transformer(tolower))

docs1 = tm_map(docs1, removeNumbers) docs1 = tm_map(docs1, stripWhitespace)


# Build a term-document matrix dtm =
TermDocumentMatrix(docs) m = as.matrix(dtm) v =
sort(rowSums(m), decreasing = TRUE)
d = data.frame(word = names(v), freq = v)
head(d, 10)
# Generate the Word cloud wordcloud(words =

20DCE017 CE449: Big Data Analytics 10


d$word, freq = d$freq, min.freq = 1, max.words = 200, random.order = FALSE, rot.per =
0.35, colors = brewer.pal(8, "Dark2"))

OUTPUT :-

CONCLUSION:

In this practical, we learnt about R and implemented wordCloud using R.

20DCE017 CE449: Big Data Analytics 11


PRACTICAL 4
AIM: To Implement a Word Count Application using MapReduce API.
THEORY:

o MapReduce is a programming paradigm that enables massive scalability across


hundreds or thousands of servers in a Hadoop cluster. As the processing component,
MapReduce is the heart of Apache Hadoop. The term "MapReduce" refers to two
separate and distinct tasks that Hadoop programs perform. The first is the map job,
which takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).

o The reduce job takes the output from a map as input and combines those data tuples
into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce
job is always performed after the map job.

o MapReduce programming offers several benefits to help you gain valuable insights
from your big data: Scalability. Businesses can process petabytes of data stored in the
Hadoop Distributed File System (HDFS).

CODE:
import java.io.IOException;
import java.util.StringTokenizer;
import
org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path; import
org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.Text; import
org.apache.hadoop.mapreduce.Job; import
org.apache.hadoop.mapreduce.Mapper;
import
org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {

public static class TokenizerMapper extends


Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1); private
Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException { StringTokenizer itr =

20DCE017 CE449: Big Data Analytics 12


new StringTokenizer(value.toString()); while
(itr.hasMoreTokens()) { word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> { private
IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,


Context context
) throws IOException, InterruptedException { int
sum = 0;
for (IntWritable val : values) { sum
+= val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
OUTPUT:
o Create a directory in local system and create two directories inside it called Classes and
Input. We will also create input.txt file in Input directory and write random words we
want to count.

20DCE017 CE449: Big Data Analytics 13


o Now we compile the java code using open-jdk which we installed previously and create
a JAR file from it.

o Now we run the jar file on the Hadoop file system using Hadoop jar command and then
we can see the final output.

20DCE017 CE449: Big Data Analytics 14


CONCLUSION:
In this practical, we learnt about the mapReduce Paradigm in detail and also executed a
simple wordcount program in MapReduce in Java Language.

20DCE017 CE449: Big Data Analytics 15


PRACTICAL 5
AIM: A. To design and implement MapReduce algorithms to take a very
large file of integers and produce as output: a) The largest integer b) The
average of all the integers. c) The same set of integers, but with each integer
appearing only once. d) The count of the number of distinct integers in the
input. B. To design an application to find mutual friend using map reduce.
CODE:
//Reducer.java
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Reducer
{
public static void main(String args[]) throws IllegalStateException, IOException,
ClassNotFoundException, InterruptedException
{Configuration conf = new Configuration();
Job job = Job.getInstance(conf,"Practicle 4");job.setJarByClass(Reducer.class);
job.setMapperClass(BDA_Mapper.class); job.setReducerClass(BDA_Reducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(LongWritable.class);job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
System.exit(job.waitForCompletion(true)?0:1); } }

//BDA_Mapper.java

import java.util.*;
import java.io.*;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Mapper;
public class BDA_Mapper extends Mapper<LongWritable,Text,Text,LongWritable>
{ private TreeMap<String,Integer> tmap;
@Override
public void setup(Context context) throws IOException,InterruptedException
{tmap = new TreeMap<String,Integer>();}
20DCE017 CE449: Big Data Analytics 16
@Override
public void map(LongWritable key,Text value,Context context) throws
IOException,InterruptedException
{if(tmap.containsKey(value.toString().trim()))
{int count=tmap.get(value.toString().trim());tmap.put(value.toString().trim(),count+1);}
else {tmap.put(value.toString().trim(),1);}}
@Override
public void cleanup(Context context) throws IOException,InterruptedException
{ for(Map.Entry<String,Integer> entry:tmap.entrySet())
{String number = entry.getKey();int count = entry.getValue();
context.write(new Text(number),new LongWritable(count));}}}

//BDA_Reducer.java
import java.util.*;
import java.io.*;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
public class BDA_Reducer extends Reducer<Text,LongWritable,Text,LongWritable>
{private TreeMap<String , Long> tmap2;
private int max = Integer.MIN_VALUE,unique=0,cnt=0;
private long sum=0;
@Override
public void setup(Context context) throws IOException,InterruptedException
{tmap2 = new TreeMap<String,Long>();}
@Override
public void reduce(Text key,Iterable<LongWritable> values,Context context) throws
IOException,InterruptedException
{String number = key.toString();long count=0;
for(LongWritable val:values)
{count+=val.get();sum+=((int)val.get())*Integer.parseInt(number.trim());}
tmap2.put(number,count);cnt+=count;
if(max<Integer.parseInt(number.trim())) max=Integer.parseInt(number.trim());
unique++;}
@Override
public void cleanup(Context context) throws IOException,InterruptedException
{ for(Map.Entry<String,Long> entry:tmap2.entrySet())
{Long count = entry.getValue();String name = entry.getKey();
context.write(new Text(name),new LongWritable(count));}
context.write(new Text("MAX NUMBER = "),new LongWritable(max));
context.write(new Text("AVERAGE = "),new LongWritable(sum/cnt));

20DCE017 CE449: Big Data Analytics 17


context.write(new Text("Total Unique Numbers = "),new
LongWritable(unique));} }

OUTPUT:
 MyProject -> then select Build Path-> Click on Configure Build Path and select Add
External jars…. and add jars from it’s download location then click -> Apply and Close.
 Now export the project as jar file. Right-click on MyProject choose Export.. and go to
Java -> JAR file click -> Next and choose your export destination then click -> Next.
choose Main Class by clicking -> Browse and then click -> Finish -> Ok.
 In Eclipse go to export

CONCLUSION:
In this practical, we learnt about the mapReduce Paradigm in detail and also executed a
simple operations on large integers file in MapReduce in Java Language.
20DCE017 CE449: Big Data Analytics 18
PRACTICAL 6
AIM: To implement basic CRUD operations (create, read, update, delete) in
MongoDB and Cassandra.
CODE:
 sudo docker network cassandra-network
 sudo docker network ls
 sudo docker run -p 9042:9042 --rm -it -d -e CASSANDRA_PASSWORD=temp --
network cassandra-network cassandra
 sudo docker ps
 sudo docker exec -it 05f99656aef3 bash
 cqlsh -u cassandra -p temp
 CREATE KEYSPACE IF NOT EXISTS charusat_db WITH REPLICATION ={
'class':'NetworkTopologyStrategy','datacenter1':3};
 describe charusat_db;
 use charusat_db;
 CREATE TABLE depstar(id int PRIMARY KEY ,firstname text,lastname text,email
text);
 select * from depstar;
 INSERT INTO depstar(id,firstname,lastname,email)
VALUES(1,'abc','xyz','[email protected]');
 INSERT INTO depstar(id,firstname,lastname,email)
VALUES(2,'def','xyz','[email protected]');
 update depstar set firstname='temp' where id=1;
 delete from depstar where id=1;

Mongo Db

 sudo docker run -p 27017:27017 -d -it --network cassandra-network --rm -e


MONGO_INITDB_ROOT_USERNAME=root -e
MONGO_INITDB_ROOT_PASSWORD=temp mongo:4.4.6
 sudo docker ps
 sudo docker exec 2e36ee901ed7 bash
 mongo -u root -p temp
 use depstar;
 db;
 db.newCollection.insertOne({_id:1,firstname:"abc",lastname:"xyz",email:"19dce000
@charusat.edu.in"});
 db.newCollection.insertOne({_id:2,firstname:"def",lastname:"xyz",email:"19dce999
@charusat.edu.in"});
 db.newCollection.find({})
 db.newCollection.updateOne({_id:1},{$set:{"firstname":"temp"}});
 db.newCollection.deleteOne({_id:1})
20DCE017 CE449: Big Data Analytics 19
OUTPUT:

CONCLUSION:
In this practical, we learnt about the CRUD operations in MongoDB and Cassandra.

20DCE017 CE449: Big Data Analytics 20


PRACTICAL 7
AIM: To develop a MapReduce application and implement a program that
analyzes weather data.
CODE:
// importing Libraries
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
public class MyMaxMin {
// Mapper
public static class MaxTemperatureMapper extends Mapper<LongWritable, Text,
Text, Text> { public static final int MISSING = 9999;
@Override
public void map(LongWritable arg0, Text Value, Context context) throws IOException,
InterruptedException {String line = Value.toString();
if (!(line.length() == 0)) {
String date = line.substring(6, 14);
float temp_Max = Float.parseFloat(line.substring(39, 45).trim());
float temp_Min = Float.parseFloat(line.substring(47, 53).trim());
if (temp_Max > 30.0)
{context.write(new Text("The Day is Hot Day :" + date),new
Text(String.valueOf(temp_Max)));}
if (temp_Min < 15) {
context.write(new Text("The Day is Cold Day :" + date),new
Text(String.valueOf(temp_Min)));}}}}

// Reducer
public static class MaxTemperatureReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text Key, Iterator<Text> Values, Context context)throws
IOException, InterruptedException { String temperature = Values.next().toString();
context.write(Key, new Text(temperature));}}
public static void main(String[] args) throws Exception {
20DCE017 CE449: Big Data Analytics 21
Configuration conf = new Configuration();
Job job = new Job(conf, "weather
example");job.setJarByClass(MyMaxMin.class);
job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(Text.class);
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path OutputPath = new Path(args[1]);FileInputFormat.addInputPath(job, new
Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
OutputPath.getFileSystem(conf).delete(OutputPath);
System.exit(job.waitForCompletion(true) ? 0 : 1);} }

OUTPUT:
 MyProject -> then select Build Path-> Click on Configure Build Path and select Add
External jars…. and add jars from it’s download location then click -> Apply and Close.
 Now export the project as jar file. Right-click on MyProject choose Export.. and go to
Java -> JAR file click -> Next and choose your export destination then click ->Next.
choose Main Class as MyMaxMin by clicking -> Browse and then click ->Finish-> Ok.

CONCLUSION:
In this practical, we learnt about the mapReduce Paradigm in detail and also executed a
simple wordcount program in MapReduce in Java Language.

20DCE017 CE449: Big Data Analytics 22


PRACTICAL 8
AIM: To Install and Run Hive. Use Hive to create, alter, and drop databases,
tables, views, functions, and indexes. To create HDFS tables and load them
in Hive and implement joining of tables in Hive.
CODE:
//hive-site.xml
<property>
<name>system:java.io.tmpdir</name>
<value>/tmp/hive/java</value>
</property>
<property>
<name>system:user.name</name>
<value>${user.name}</value>
</property>
 Hive database
1) CREATE DATABASE IF NOT EXISTS depstar;
2) ALTER DATABASE depstar SET OWNER USER hadoopuser;
3) DROP DATABASE IF EXISTS depstar;
 Hive view
1) CREATE VIEW std_id_3 AS SELECT * FROM students WHERE id=3;
2) ALTER VIEW std_id_3 AS SELECT * FROM students WHERE id>1;
3) DROP VIEW std_id_3;
 Hive Table
1) CREATE TABLE IF NOT EXISTS students ( id int, firstname String,lastname
String, email String) COMMENT ‘Student details’ ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’ LINES TERMINATED BY ‘\n’ STORED AS
TEXTFILE;
2) ALTER TABLE students RENAME TO std;
3) DROP TABLE IF EXISTS std;
 Join on table
1) CREATE TABLE IF NOT EXISTS students ( id int, firstname String,lastname
String, email String) COMMENT ‘Student details’ ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’ LINES TERMINATED BY ‘\n’ STORED AS
TEXTFILE;
2) CREATE TABLE IF NOT EXISTS dept ( id int, dept String) COMMENT 'Student
dept details' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n' STORED AS TEXTFILE;
3) LOAD DATA INPATH '/hive_data/hive_1.txt' INTO TABLE students;
LOAD DATA INPATH '/hive_data/hive_2.txt' INTO TABLE dept;

20DCE017 CE449: Big Data Analytics 23


4) select s.id,s.firstname,s.lastname,d.dept,s.email FROM students s JOIN dept d ON
s.id = d.id

OUTPUT:

20DCE017 CE449: Big Data Analytics 24


CONCLUSION:
In this practical, we learnt about the Hive and use of Hive to create, alter, and drop databases,
tables, views and implement joining of tables in Hive.

20DCE017 CE449: Big Data Analytics 25


PRACTICAL 9
AIM: To install and run Pig and then write Pig Latin scripts to sort, group,
join, project, and filter your data.
CODE:
 hdfs dfs -put dept.txt /hadoopuser/dept.txt
 hdfs dfs -ls /hadoopuser
 hdfs dfs -cat /hadoopuser/dept.txt
1)Sort of data
 student_details = LOAD 'hdfs://localhost:9000/hadoopuser/tp.txt' USING
PigStorage(',') as
(id:int,firstname:chararray,lastname:chararray,age:int,email:chararray);
 student_grp_details = LOAD 'hdfs://localhost:9000/hadoopuser/dept.txt' USING
PigStorage(',') as (id:int,dept:chararray);
 order_by = ORDER student_details BY age ASC;
 dump order_by
2) Group of data
 student_details_with_grp = LOAD 'hdfs://localhost:9000/hadoopuser/temp_grp1.txt'
USING PigStorage(',') as
(id:int,firstname:chararray,lastname:chararray,age:int,dept:chararray,email:chararray);
 group_by = GROUP student_details_with_grp BY dept;
 dump group_by
3) Join of data
 join_data = JOIN student_details BY id,student_grp_details BY id;
 dump join_data;
4) Filter data
 filter_table = FILTER student_details_with_grp BY dept=='ce';
 dump filter_table;
5) project data
 data = FOREACH student_details GENERATE firstname,lastname;
 dump data;
OUTPUT:

20DCE017 CE449: Big Data Analytics 26


1) Sort of data

2) Group of data

3) Join of Data

4) Filter data

20DCE017 CE449: Big Data Analytics 27


5) Project data

CONCLUSION:
In this practical, we learnt about the Pig Latin scripts to sort, group, join, project, and filter
data.

20DCE017 CE449: Big Data Analytics 28


PRACTICAL 10
AIM: To install, deploy & configure Apache Spark Cluster. To Select the
fields from the dataset using Spark SQL. To explore Spark shell and read
from HDFS
CODE:
var data =
spark.read.format(“csv”).option(“header”,”true”).load(“/home/hadoopuser/Chicago.csv”);
var df1 = data.select(“ID”,”Case Number”,”Description”).show();
var ds = spark.read.text(“hdfs://localhost:9000/hadoopuser/temp.txt”);
ds.count
ds.show();

OUTPUT:

20DCE017 CE449: Big Data Analytics 29


CONCLUSION:
In this practical, we learnt about the Apache Spark Cluster in detail and To explore Spark
shell.

20DCE017 CE449: Big Data Analytics 30


PRACTICAL 11

AIM: To perform Sentiment Analysis using Twitter data, Scala and Spark.

CODE:

from transformers import AutoTokenizer, AutoModelForSequenceClassification


from scipy.special import softmax

tweet = 'Great content! subscribed ●


• '

# precprcess tweet
tweet_words = []

for word in tweet.split(' '):


if word.startswith('@') and len(word) > 1:
word = '@user'

elif word.startswith('http'):
word = "http"
tweet_words.append(word)

tweet_proc = " ".join(tweet_words)

# load model and tokenizer


roberta = "cardiffnlp/twitter-roberta-base-sentiment"

model = AutoModelForSequenceClassification.from_pretrained(roberta)
tokenizer = AutoTokenizer.from_pretrained(roberta)

labels = ['Negative', 'Neutral', 'Positive']

# sentiment analysis
encoded_tweet = tokenizer(tweet_proc, return_tensors='pt')
# output = model(encoded_tweet['input_ids'], encoded_tweet['attention_mask'])
output = model(**encoded_tweet)

scores = output[0][0].detach().numpy()
scores = softmax(scores)

for i in range(len(scores)):
20DCE017 CE449: Big Data Analytics 31
l = labels[i]
s = scores[i]
print(l,s)

OUTPUT:

CONCLUSION: From this practical I learned, how to perform sentiment analysis using
spark.

20DCE017 CE449: Big Data Analytics 32


PRACTICAL 12

AIM: To perform Graph Path and Connectivity analytics and implement basic queries after loading
data using Neo4j

CODE:

1.Install Neo4j:

2.Creating Project:

20DCE017 CE449: Big Data Analytics 33


3.Create Data Cyphers:

4.Querying For Nodes:

 All Nodes:

20DCE017 CE449: Big Data Analytics 34


 All nodes with specific label:

 All nodes with priorities:

 Nodes where name is LeBron James:

20DCE017 CE449: Big Data Analytics 35


 Nodes where name is not LeBron James:

5.Querying Relationship:

 Get all Lakers Players:

20DCE017 CE449: Big Data Analytics 36


 Get Players and number of games played:

6.Graph Path Analysis:

 The Shortest path between two players:

7.Connectivity Analysis:

 Identify players with most teammates:

20DCE017 CE449: Big Data Analytics 37


CONCLUSION: From this practical I learned, how to perform Graph Path and Connectivity
analytics and implement basic queries after loading data using Neo4j.

20DCE017 CE449: Big Data Analytics 38


PRACTICAL 13

AIM: To perform case study of the following platforms for solving any big data analytic problem of
your choice.
(1) Amazon web services,
(2) Microsoft Azure,
(3) Google App engine

Problem Statement:
Imagine a global e-commerce company, "EcomCorp," wants to optimize its product recommendation
engine to increase customer engagement and sales. They have terabytes of customer data, including
purchase history, browsing behavior, and demographics. EcomCorp aims to leverage big data
analytics to provide personalized product recommendations to its customers in real-time.

To address this challenge, EcomCorp will evaluate three major cloud platforms for their big data
analytics capabilities: Amazon Web Services (AWS), Microsoft Azure, and Google App Engine. We
will analyze each platform's strengths and weaknesses in the context of this use case.

Amazon Web Services (AWS):

AWS offers a comprehensive suite of services for big data analytics. EcomCorp can use the
following AWS services to address their problem:

a. Amazon S3 (Simple Storage Service): Store large volumes of customer data in S3 buckets,
making it highly durable and scalable.

b. Amazon EMR (Elastic MapReduce): Deploy EMR clusters to process and analyze data using
popular tools like Apache Spark, Hadoop, and Presto. This allows EcomCorp to extract valuable
insights from their data.

c. Amazon Redshift: Utilize Redshift as a data warehousing solution to store and query large
datasets for analytics and reporting.

d. AWS Lambda: Trigger real-time recommendations based on customer actions, such as clicks or
purchases, using Lambda functions.

e. Amazon SageMaker: Implement machine learning models to improve recommendation accuracy,


leveraging SageMaker's built-in algorithms and model training capabilities.

Strengths:
Broad range of services specifically designed for big data analytics. Scalability to handle massive
datasets. Integration with machine learning and AI tools. Strong security and compliance features.

Weaknesses:

Learning curve for managing and optimizing services. Costs can increase as data volume
and processing requirements grow.

20DCE017 CE449: Big Data Analytics 39


Microsoft Azure:
Azure provides a robust set of services for big data analytics, including:

a. Azure Data Lake Storage: Store and manage large datasets in Azure Data Lake Storage
Gen2, ensuring high performance and security.

b. Azure HDInsight: Deploy HDInsight clusters with Hadoop, Spark, and other big data
frameworks to process and analyze data.

c. Azure SQL Data Warehouse: Use Azure SQL Data Warehouse for data warehousing
and complex queries.

Strengths:
Integration with other Microsoft products and services.
Seamless scaling capabilities.
Azure Databricks for advanced analytics and AI workloads.
Strong support for hybrid cloud solutions

Weaknesses:
Cost management can be challenging.
Learning curve for Azure-specific tools and services.

Google App Engine:


Google App Engine primarily focuses on application deployment and scaling rather than
big data analytics. However, Google Cloud Platform offers various services that can be
leveraged for big data analytics, such as

a. Google Cloud Storage: Store large datasets in Google Cloud Storage buckets.
b. Google BigQuery: Use BigQuery for ad-hoc SQL queries and analysis of structured
data.
c. Google Dataflow: Process streaming data and create real-time recommendations using
Dataflow.

Strengths:
Integration with other Google Cloud services.
Serverless computing for application deployment.
BigQuery's fast query performance for large datasets.

Weaknesses:
Limited big data analytics services compared to AWS and Azure.
Not as suitable for complex machine learning models without additional setup.

CONCLUSION: From this practical I learned, how to use different platform to solve any big
data analytics problem.

20DCE017 CE449: Big Data Analytics 40

You might also like