20dce017 Bda Pracfil
20dce017 Bda Pracfil
20dce017 Bda Pracfil
LAB MANUAL
PRACTICAL INDEX
Sr. AIM Assigned Completion Grade Assessment Signature
No. Date Date Date
1 To install Hadoop framework,
configure it and setup a single
nodecluster. Use web based tools
to monitor your Hadoop setup.
2 To implement file management
tasks in Hadoop HDFS and
perform Hadoop commands.
3 To implement basic functions and
commands in R Programming. To
build WordCloud, a text mining
method using R for easy to
understand and better
visualization than a data table.
4 To implement a word count
application using the MapReduce
programming model.
Commands :-
1.) Ls :-
This command is use to list all the files which is been present is the hadoop file sys.
2.) Mkdir:-
To make new directory use this command.
7.) rmdir :-
CONCLUSION:
In this practical, we performed various basic commands on Hadoop to create and remove or
copy files into system.
strwrap(docs1)
# Cleaning the Text docs1 = tm_map(docs1, content_transformer(tolower))
OUTPUT :-
CONCLUSION:
o The reduce job takes the output from a map as input and combines those data tuples
into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce
job is always performed after the map job.
o MapReduce programming offers several benefits to help you gain valuable insights
from your big data: Scalability. Businesses can process petabytes of data stored in the
Hadoop Distributed File System (HDFS).
CODE:
import java.io.IOException;
import java.util.StringTokenizer;
import
org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path; import
org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.Text; import
org.apache.hadoop.mapreduce.Job; import
org.apache.hadoop.mapreduce.Mapper;
import
org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
o Now we run the jar file on the Hadoop file system using Hadoop jar command and then
we can see the final output.
//BDA_Mapper.java
import java.util.*;
import java.io.*;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Mapper;
public class BDA_Mapper extends Mapper<LongWritable,Text,Text,LongWritable>
{ private TreeMap<String,Integer> tmap;
@Override
public void setup(Context context) throws IOException,InterruptedException
{tmap = new TreeMap<String,Integer>();}
20DCE017 CE449: Big Data Analytics 16
@Override
public void map(LongWritable key,Text value,Context context) throws
IOException,InterruptedException
{if(tmap.containsKey(value.toString().trim()))
{int count=tmap.get(value.toString().trim());tmap.put(value.toString().trim(),count+1);}
else {tmap.put(value.toString().trim(),1);}}
@Override
public void cleanup(Context context) throws IOException,InterruptedException
{ for(Map.Entry<String,Integer> entry:tmap.entrySet())
{String number = entry.getKey();int count = entry.getValue();
context.write(new Text(number),new LongWritable(count));}}}
//BDA_Reducer.java
import java.util.*;
import java.io.*;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
public class BDA_Reducer extends Reducer<Text,LongWritable,Text,LongWritable>
{private TreeMap<String , Long> tmap2;
private int max = Integer.MIN_VALUE,unique=0,cnt=0;
private long sum=0;
@Override
public void setup(Context context) throws IOException,InterruptedException
{tmap2 = new TreeMap<String,Long>();}
@Override
public void reduce(Text key,Iterable<LongWritable> values,Context context) throws
IOException,InterruptedException
{String number = key.toString();long count=0;
for(LongWritable val:values)
{count+=val.get();sum+=((int)val.get())*Integer.parseInt(number.trim());}
tmap2.put(number,count);cnt+=count;
if(max<Integer.parseInt(number.trim())) max=Integer.parseInt(number.trim());
unique++;}
@Override
public void cleanup(Context context) throws IOException,InterruptedException
{ for(Map.Entry<String,Long> entry:tmap2.entrySet())
{Long count = entry.getValue();String name = entry.getKey();
context.write(new Text(name),new LongWritable(count));}
context.write(new Text("MAX NUMBER = "),new LongWritable(max));
context.write(new Text("AVERAGE = "),new LongWritable(sum/cnt));
OUTPUT:
MyProject -> then select Build Path-> Click on Configure Build Path and select Add
External jars…. and add jars from it’s download location then click -> Apply and Close.
Now export the project as jar file. Right-click on MyProject choose Export.. and go to
Java -> JAR file click -> Next and choose your export destination then click -> Next.
choose Main Class by clicking -> Browse and then click -> Finish -> Ok.
In Eclipse go to export
CONCLUSION:
In this practical, we learnt about the mapReduce Paradigm in detail and also executed a
simple operations on large integers file in MapReduce in Java Language.
20DCE017 CE449: Big Data Analytics 18
PRACTICAL 6
AIM: To implement basic CRUD operations (create, read, update, delete) in
MongoDB and Cassandra.
CODE:
sudo docker network cassandra-network
sudo docker network ls
sudo docker run -p 9042:9042 --rm -it -d -e CASSANDRA_PASSWORD=temp --
network cassandra-network cassandra
sudo docker ps
sudo docker exec -it 05f99656aef3 bash
cqlsh -u cassandra -p temp
CREATE KEYSPACE IF NOT EXISTS charusat_db WITH REPLICATION ={
'class':'NetworkTopologyStrategy','datacenter1':3};
describe charusat_db;
use charusat_db;
CREATE TABLE depstar(id int PRIMARY KEY ,firstname text,lastname text,email
text);
select * from depstar;
INSERT INTO depstar(id,firstname,lastname,email)
VALUES(1,'abc','xyz','[email protected]');
INSERT INTO depstar(id,firstname,lastname,email)
VALUES(2,'def','xyz','[email protected]');
update depstar set firstname='temp' where id=1;
delete from depstar where id=1;
Mongo Db
CONCLUSION:
In this practical, we learnt about the CRUD operations in MongoDB and Cassandra.
// Reducer
public static class MaxTemperatureReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text Key, Iterator<Text> Values, Context context)throws
IOException, InterruptedException { String temperature = Values.next().toString();
context.write(Key, new Text(temperature));}}
public static void main(String[] args) throws Exception {
20DCE017 CE449: Big Data Analytics 21
Configuration conf = new Configuration();
Job job = new Job(conf, "weather
example");job.setJarByClass(MyMaxMin.class);
job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(Text.class);
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path OutputPath = new Path(args[1]);FileInputFormat.addInputPath(job, new
Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
OutputPath.getFileSystem(conf).delete(OutputPath);
System.exit(job.waitForCompletion(true) ? 0 : 1);} }
OUTPUT:
MyProject -> then select Build Path-> Click on Configure Build Path and select Add
External jars…. and add jars from it’s download location then click -> Apply and Close.
Now export the project as jar file. Right-click on MyProject choose Export.. and go to
Java -> JAR file click -> Next and choose your export destination then click ->Next.
choose Main Class as MyMaxMin by clicking -> Browse and then click ->Finish-> Ok.
CONCLUSION:
In this practical, we learnt about the mapReduce Paradigm in detail and also executed a
simple wordcount program in MapReduce in Java Language.
OUTPUT:
2) Group of data
3) Join of Data
4) Filter data
CONCLUSION:
In this practical, we learnt about the Pig Latin scripts to sort, group, join, project, and filter
data.
OUTPUT:
AIM: To perform Sentiment Analysis using Twitter data, Scala and Spark.
CODE:
# precprcess tweet
tweet_words = []
elif word.startswith('http'):
word = "http"
tweet_words.append(word)
model = AutoModelForSequenceClassification.from_pretrained(roberta)
tokenizer = AutoTokenizer.from_pretrained(roberta)
# sentiment analysis
encoded_tweet = tokenizer(tweet_proc, return_tensors='pt')
# output = model(encoded_tweet['input_ids'], encoded_tweet['attention_mask'])
output = model(**encoded_tweet)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
for i in range(len(scores)):
20DCE017 CE449: Big Data Analytics 31
l = labels[i]
s = scores[i]
print(l,s)
OUTPUT:
CONCLUSION: From this practical I learned, how to perform sentiment analysis using
spark.
AIM: To perform Graph Path and Connectivity analytics and implement basic queries after loading
data using Neo4j
CODE:
1.Install Neo4j:
2.Creating Project:
All Nodes:
5.Querying Relationship:
7.Connectivity Analysis:
AIM: To perform case study of the following platforms for solving any big data analytic problem of
your choice.
(1) Amazon web services,
(2) Microsoft Azure,
(3) Google App engine
Problem Statement:
Imagine a global e-commerce company, "EcomCorp," wants to optimize its product recommendation
engine to increase customer engagement and sales. They have terabytes of customer data, including
purchase history, browsing behavior, and demographics. EcomCorp aims to leverage big data
analytics to provide personalized product recommendations to its customers in real-time.
To address this challenge, EcomCorp will evaluate three major cloud platforms for their big data
analytics capabilities: Amazon Web Services (AWS), Microsoft Azure, and Google App Engine. We
will analyze each platform's strengths and weaknesses in the context of this use case.
AWS offers a comprehensive suite of services for big data analytics. EcomCorp can use the
following AWS services to address their problem:
a. Amazon S3 (Simple Storage Service): Store large volumes of customer data in S3 buckets,
making it highly durable and scalable.
b. Amazon EMR (Elastic MapReduce): Deploy EMR clusters to process and analyze data using
popular tools like Apache Spark, Hadoop, and Presto. This allows EcomCorp to extract valuable
insights from their data.
c. Amazon Redshift: Utilize Redshift as a data warehousing solution to store and query large
datasets for analytics and reporting.
d. AWS Lambda: Trigger real-time recommendations based on customer actions, such as clicks or
purchases, using Lambda functions.
Strengths:
Broad range of services specifically designed for big data analytics. Scalability to handle massive
datasets. Integration with machine learning and AI tools. Strong security and compliance features.
Weaknesses:
Learning curve for managing and optimizing services. Costs can increase as data volume
and processing requirements grow.
a. Azure Data Lake Storage: Store and manage large datasets in Azure Data Lake Storage
Gen2, ensuring high performance and security.
b. Azure HDInsight: Deploy HDInsight clusters with Hadoop, Spark, and other big data
frameworks to process and analyze data.
c. Azure SQL Data Warehouse: Use Azure SQL Data Warehouse for data warehousing
and complex queries.
Strengths:
Integration with other Microsoft products and services.
Seamless scaling capabilities.
Azure Databricks for advanced analytics and AI workloads.
Strong support for hybrid cloud solutions
Weaknesses:
Cost management can be challenging.
Learning curve for Azure-specific tools and services.
a. Google Cloud Storage: Store large datasets in Google Cloud Storage buckets.
b. Google BigQuery: Use BigQuery for ad-hoc SQL queries and analysis of structured
data.
c. Google Dataflow: Process streaming data and create real-time recommendations using
Dataflow.
Strengths:
Integration with other Google Cloud services.
Serverless computing for application deployment.
BigQuery's fast query performance for large datasets.
Weaknesses:
Limited big data analytics services compared to AWS and Azure.
Not as suitable for complex machine learning models without additional setup.
CONCLUSION: From this practical I learned, how to use different platform to solve any big
data analytics problem.