0% found this document useful (0 votes)

23 views32 pages

BDF Programs

Uploaded by

nadigenibharathi70

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views32 pages

BDF Programs

Uploaded by

nadigenibharathi70

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Exercise 1:

Build and set up Hadoop framework to run in single-node and multi-node setup.

Single-Node Hadoop Setup on Windows

Step 1: Install Java

1. Download Java: Download the Java Development Kit (JDK) from the Oracle JDK website or
OpenJDK.
2. Install Java: Follow the installation prompts and install it to a directory like
C:\Java\jdk1.8.0_281.
3. Set JAVA_HOME Environment Variable:
o Go to Control Panel > System and Security > System > Advanced system settings >
Environment Variables.
o Click New under System variables and set:
 Variable name: JAVA_HOME
 Variable value: C:\Java\jdk1.8.0_281
o Add %JAVA_HOME%\bin to the Path variable.

Step 2: Download Hadoop

1. Download Hadoop: Visit the Apache Hadoop releases page and download the latest stable
release (e.g., hadoop-3.3.6.tar.gz).
2. Extract Hadoop:
o Extract the downloaded file to a directory like C:\hadoop-3.3.6.

Step 3: Configure Hadoop for Single-Node Setup

1. Set Hadoop Environment Variables:

o Go to Control Panel > System and Security > System > Advanced system settings >
Environment Variables.
o Add a new system variable:
 Variable name: HADOOP_HOME
 Variable value: C:\hadoop
o Add %HADOOP_HOME%\bin to the Path variable.
2. Edit Configuration Files:
o Navigate to C:\hadoop-3.3.6\etc\hadoop.
o Edit (in Notepad) hadoop-env.cmd: Set JAVA_HOME by adding the following
line:

set JAVA_HOME=C:\Java\jdk1.8.0_281

o Edit (in Notepad) core-site.xml: Add the following configuration inside the
<configuration> tags:

<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/C:/hadoop/tmp</value>
</property>

o Edit (in Notepad) hdfs-site.xml: Add the following configurationbefore adding

the configurations ,create folder data in hadoop-3.3.6 and create two folders
namenode and datanode in data folder
o

Now add configurations in hdfs-site.xml

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name>dfs.namenode.name.dir</name>
<value>file:/C:/hadoop/data/namenode</value>
</property>

<property>
<name>dfs.datanode.data.dir</name>
<value>file:/C:/hadoop/data/datanode</value>
</property>

o Edit (in Notepad) mapred-site.xml: Rename mapred-site.xml.template to mapred-

site.xml and add:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

o Edit(in Notepad) yarn-site.xml: Add:

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

Step 4: Format the Hadoop Filesystem

1. Open Command Prompt and run:

C:\Users\Dell-PC>hdfs namenode -format

Step 5: Start Hadoop Daemons

1. Start HDFS (NameNode and DataNode):

o In Command Prompt, run:

C:\Users\Dell-PC>start-dfs.cmd

2. Start YARN (ResourceManager and NodeManager):

o Run:

C:\Users\Dell-PC>start-yarn.cmd

Step 6: Verify the Installation

1. Check HDFS:
o Open a web browser and go to http://localhost:9870.
2. Check YARN:
o Go to http://localhost:8088.

Multi-Node Hadoop Setup on Windows (Pseudo-Distributed Mode)

Step 1: Set Up Network Communication

1. Ensure that all nodes (Master and Slaves) can communicate via SSH. Install an SSH client
like OpenSSH or PuTTY on each node.

Step 2: Install Java and Hadoop on All Nodes

1. Follow Steps 1 to 3 from the Single-Node setup on each machine.

Step 3: Configure Master Node

1. Edit Configuration Files on Master Node:

o Edit core-site.xml:

<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value> 
</property>

o Edit hdfs-site.xml:

<property>
<name>dfs.replication</name>
<value>3</value> 
</property>

o Edit yarn-site.xml:

<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value> <!-- Replace with the IP or hostname of the master node --
>
</property>

2. Edit masters and slaves Files:

o masters: Add the hostname or IP of the master node (e.g., master).
o slaves: List the hostnames or IPs of the slave nodes (e.g., slave1, slave2, etc.).

Step 4: Distribute Configuration Files

1. Copy the configured core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml, masters,

and slaves files from the master node to all slave nodes.

Step 5: Start the Hadoop Cluster

1. Start HDFS:
o On the master node, run:

start-dfs.cmd

2. Start YARN:
o On the master node, run:

start-yarn.cmd

Step 6: Verify the Multi-Node Setup

1. Check HDFS:
o Go to http://master:9870 (replace master with the master node's IP or hostname).
2. Check YARN:
o Go to http://master:8088.
Exercise 2:

Write and implement a text file processing program using MapReduce model.

Write the Word Count Program

Create a simple Java program to implement the Word Count logic.

1. Create a Java File: Save the following code as WordCount.java.

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount

{
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>
{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException,
InterruptedException
{
String[] tokens = value.toString().split("\\s+");
for (String token : tokens)
{
word.set(token);
context.write(word, one);
}
}
}

public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable>

{

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception

{
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Step 3: Compile the Program

1. Open Command Prompt: Navigate to the directory where you saved WordCount.java.
2. Compile the Program: Use the following command to compile the Java file:

javac -classpath %HADOOP_HOME%\share\hadoop\common\hadoop-common-3.3.6.jar;

%HADOOP_HOME%\share\hadoop\mapreduce\hadoop-mapreduce-client-core-3.3.6.jar -d .
WordCount.java

3. Create a JAR File: Package the compiled classes into a JAR file:

jar -cvf wordcount.jar -C . .

Step 4: Prepare Input Data

1. Create an Input Directory in HDFS:

hadoop fs -mkdir /input

2. Upload a Text File to HDFS:

hadoop fs -put C:\path\to\your\inputfile.txt /input/

Replace C:\path\to\your\inputfile.txt with the actual path to your text file.

Step 5: Run the Word Count Program

1. Execute the Hadoop Job:

hadoop jar wordcount.jar WordCount /input /output

This command runs the Word Count job with the input data located at /input in HDFS and stores the
output at /output.

Step 6: View the Output

1. Check the Output:

hadoop fs -cat /output/part-r-00000

This command displays the word counts from the input text file.

(OR)

Write and implement a text file processing program using MapReduce model.
Stop-all.cmd
Exercise 3:

Implement a MapReduce program using pySpark for identifying potential customers.

To implement a PySpark MapReduce program in Jupyter Notebook for identifying potential

customers,steps to be followed:

1. Set Up the Environment

You need to install and configure the necessary software on Windows/Unix/Linux machine.

a. Install Java

 Download and install Java JDK.

 During installation, ensure that the path to java.exe is added to the environment variables. If
not:
1. Go to Control Panel → System → Advanced system settings.
2. Click Environment Variables and add a new variable:
 Variable name: JAVA_HOME
 Variable value: The path to your Java installation (e.g., C:\Program
Files\Java\jdk-11).
3. Edit the Path variable under system variables, and add %JAVA_HOME%\bin.

b. Install Apache Spark

 Download Apache Spark and extract it.

 Set the following environment variables:
o Variable name: SPARK_HOME
o Variable value: The Spark installation path (e.g., C:\spark).
 Edit the Path system variable, adding %SPARK_HOME%\bin.

c. Install Hadoop (Winutils)

 Download winutils.exe for Hadoop from here.

 Place it in C:\hadoop\bin and set HADOOP_HOME:
o Variable name: HADOOP_HOME
o Variable value: C:\hadoop.

d. Install Python and PySpark

 Install Python from python.org.
 Install PySpark using pip:

pip install pyspark

e. Install Jupyter Notebook

Install Jupyter Notebook with pip:

pip install notebook

f. Install findspark

findspark helps to configure PySpark within Jupyter notebooks:

pip install findspark

2. Set Up PySpark in Jupyter Notebook

Now, you can configure Jupyter to run PySpark code.

1. Open a terminal or Command Prompt and run:

jupyter notebook

This will open the Jupyter Notebook interface in web browser.

2. Create a new notebook.

3. In the first cell of the notebook, configure PySpark with findspark:

import findspark
findspark.init()

from pyspark.sql import SparkSession

3. Implement the MapReduce Program

Next, load the dataset, process the data, and apply MapReduce logic to identify potential customers.

a. Create the Dataset

First, need a dataset to work with. For example, create a simple CSV file customer_data.csv
containing customer details:

CustomerID,Age,Income,PurchaseHistory,Location
1,34,50000,"Electronics, Clothes","NY"
2,45,75000,"Groceries, Furniture","LA"
3,29,35000,"Electronics, Groceries","SF"
4,39,65000,"Clothes, Groceries","NY"

Place this file in a folder, say C:\path\to\your\data\customer_data.csv.

b. Write the Code in Jupyter Notebook

In the Jupyter Notebook, implement the MapReduce process step by step.

Step 1: Initialize the Spark Session

In a new cell, write the following to initialize Spark:

# Initialize Spark Session

spark = SparkSession.builder \
.appName("Potential Customers Identification") \
.master("local[*]") \
.getOrCreate()

This creates a Spark session that allows us to work with PySpark.

Step 2: Load the Dataset

Now, load the dataset into a Spark DataFrame:

# Load the customer dataset

data_file = "C:/path/to/your/data/customer_data.csv"
customer_df = spark.read.option("header", "true").csv(data_file)

# Show the first few rows of the data

customer_df.show()

Ensure that the path to the dataset is correct.

Step 3: Data Preprocessing

Convert columns like Age and Income to integers and remove rows with missing values:

from pyspark.sql.functions import col

# Data preprocessing
customer_df = customer_df.withColumn("Age", ol("Age").cast("int")) \
.withColumn("Income", col("Income").cast("int")) \.dropna()

customer_df.show()

Step 4: Apply Filter Criteria for Potential Customers

Now filter the dataset based on criteria. For instance, identify potential customers who have an
income greater than $60,000 and are aged between 30 and 50:

# Filter potential customers

potential_customers = customer_df.filter((col("Income") > 60000) & (col("Age") > 30) &
(col("Age") < 50))

# Show potential customers

potential_customers.show()

Step 5: Perform MapReduce to Count Potential Customers by Location

Convert the filtered DataFrame into an RDD and apply a MapReduce approach to count the number
of potential customers by location:

# Convert to RDD and apply MapReduce

customer_rdd = potential_customers.rdd
map_rdd = customer_rdd.map(lambda row: (row['Location'], 1))
reduce_rdd = map_rdd.reduceByKey(lambda x, y: x + y)

# Collect and display the results

result = reduce_rdd.collect()
for location, count in result:
print(f"Location: {location}, Potential Customers: {count}")

Step 6: Stop the Spark Session

After running the analysis, stop the Spark session to release resources:

# Stop the Spark session

spark.stop()

4. Run and Interpret Results

Run the Jupyter notebook cell by cell. The output showing the count of potential customers by
location is generated.

OUTPUT:

Location: NY, Potential Customers: 2

Location: LA, Potential Customers: 1
Exercise 4:

Consider the dataset (attached Ex.4.txt) and write a mapper and reducer program for finding the cost
of the item that is most expensive, for each location.

Ex.4.txt (dataset)

Date Time Location Item Cost How they Paid

2012-07-16 15:43 Las Vegas Men's Clothing 208.97 Visa
2012-06-11 16:17 Miami Crafts 84.11 Amex
2012-10-17 15:30 Tucson Crafts 489.93 Cash
2012-10-25 15:01 San Francisco Men's Clothing 388.3 Visa
2012-07-13 09:01 Dallas Consumer Electronics 145.63 Cash
2012-11-06 13:02 Tampa Garden 353.23 MasterCard
2012-09-07 12:58 Washington Women's Clothing 481.31 MasterCard
2012-08-05 16:34 San Jose DVDs 492.8 Discover
2012-04-22 13:12 Newark Consumer Electronics 410.37 Visa
2012-10-19 11:35 Memphis Garden 354.44 Discover
2012-10-10 13:17 Jersey City Books 369.07 Amex
2012-04-27 11:54 Plano Women's Clothing 4.65 Cash
2012-08-28 14:56 Buffalo Video Games 337.35 Discover
2012-09-17 13:09 Louisville Music 213.64 Discover
2012-02-24 12:05 Miami Women's Clothing 154.64 Cash
2012-01-02 10:04 LosAngeles Pet Supplies 164.5 Discover
2012-11-15 15:46 Birmingham Men's Clothing 1.64 Cash
2012-03-16 11:18 Mesa Toys 13.79 Visa
2012-06-25 10:05 Wichita Consumer Electronics 158.25 Amex
2012-04-05 17:03 Indianapolis Pet Supplies 152.77 Amex
2012-11-08 15:19 San Bernardino Video Games 332.43 Discover
2012-08-08 10:09 Indianapolis Health and Beauty 464.36 Amex
2012-03-02 09:25 Stockton Men's Clothing 180.61 Discover
2012-02-27 16:12 Austin Health and Beauty 48.09 Visa
2012-12-29 16:56 Buffalo Garden 386.56 Amex
2012-03-20 09:02 Santa Ana Books 2.75 Amex
2012-10-30 11:52 Gilbert DVDs 11.31 Amex
2012-02-03 11:02 New York DVDs 221.35 Visa
2012-07-26 16:16 Corpus Christi Health and Beauty 157.91 Amex
2012-07-20 11:46 Riverside Video Games 349.41 Visa
2012-10-04 12:25 Chicago Children's Clothing 364.53 MasterCard
2012-02-04 11:53 Fremont Video Games 404.17 Cash
2012-05-31 14:43 Rochester Video Games 460.39 Amex
2012-05-25 16:11 Raleigh Computers 61.22 MasterCard
2012-05-11 12:39 Chicago Pet Supplies 431.73 Cash
2012-04-07 11:39 Cincinnati Computers 288.32 Discover
2012-04-18 16:57 Rochester Consumer Electronics 342.62 Amex
2012-12-19 10:12 Pittsburgh Books 498.29 Cash
2012-01-21 14:50 Rochester Cameras 485.71 MasterCard
2012-11-15 09:23 Glendale Video Games 14.09 Amex
2012-01-07 14:20 Cincinnati Crafts 1.41 Amex
2012-10-20 14:53 Irvine Video Games 15.19 Discover
2012-03-04 12:11 Boston Video Games 397.21 Visa
2012-01-11 09:04 Scottsdale Garden 214.3 Discover
2012-08-11 10:57 Atlanta Garden 189.22 Visa
2012-05-22 13:08 Cincinnati Men's Clothing 443.78 Visa
2012-01-11 17:20 Lubbock Garden 27.68 Cash
2012-01-16 13:31 Cincinnati Cameras 129.6 Cash
2012-02-10 10:39 Santa Ana Computers 282.13 MasterCard
2012-03-22 09:57 Aurora DVDs 82.38 Discover

Step 1: Java Mapper and Reducer

Create two classes: a Mapper and a Reducer.

Mapper Code

This class will emit the location as the key and the item cost as the value.

import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class MaxCostMapper extends Mapper<LongWritable, Text, Text, Text>

{

@Override
protected void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException
{
// Split the input line into columns
String[] columns = value.toString().split("\t");

// Extract the location and cost (assuming location is at index 2, and cost is at index
4)
if (columns.length >= 5)
{
String location = columns[2];
String cost = columns[4];

try
{
// Emit the location as the key and cost as the value
context.write(new Text(location), new Text(cost));
}
catch (NumberFormatException e)
{
// If there's an issue with parsing the cost, skip this record
e.printStackTrace();
}
}
}
}

Reducer Code

This class will receive the location and list of costs, and it will output the maximum cost for each
location.
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class MaxCostReducer extends Reducer<Text, Text, Text, Text>

{

@Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
InterruptedException
{
double maxCost = 0.0;

// Iterate through all costs for a given location

for (Text value : values)
{
try
{
double cost = Double.parseDouble(value.toString());
if (cost > maxCost)
{
maxCost = cost;
}
}
catch (NumberFormatException e)
{
e.printStackTrace();
}
}

// Output the location (key) and the maximum cost (value)

context.write(key, new Text(String.valueOf(maxCost)));
}
}

Step 2: Driver Code

The driver class sets up and starts the MapReduce job.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MaxCostDriver

{
public static void main(String[] args) throws Exception
{
if (args.length != 2)
{
System.err.println("Usage: MaxCostDriver <input path> <output path>");
System.exit(-1);
}

// Configure the MapReduce job

Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Find Max Cost Per Location");

job.setJarByClass(MaxCostDriver.class);
job.setMapperClass(MaxCostMapper.class);
job.setReducerClass(MaxCostReducer.class);

// Set output key and value types

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

// Set input and output file paths

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

// Wait for the job to complete and exit based on the result
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Step 3: Compiling and Running the Job

Compiling the Program

You can compile the Java files using javac and the Hadoop libraries.

javac -classpath `hadoop classpath` -d . MaxCostMapper.java

javac -classpath `hadoop classpath` -d . MaxCostReducer.java
javac -classpath `hadoop classpath` -d . MaxCostDriver.java
jar -cvf MaxCostJob.jar *.class

Running the Job

Once compiled, you can run the job using the Hadoop command-line tool.

1. Place your input file in the HDFS:

hadoop fs -mkdir -p /user/username/input

hadoop fs -put input.txt /user/username/input/

2. Run the MapReduce job:

hadoop jar MaxCostJob.jar MaxCostDriver /user/username/input/input.txt /user/username/output

3. View the results:

hadoop fs -cat /user/username/output/part-r-00000

Exercise 5:

Install Apache Spark and configure it to run on a single machine. Create workers on different
machines and configure a multi-node setup

To install Apache Spark and configure it for both single-node and multi-node (cluster) setups on a
Windows machine, follow the detailed steps below.

1.1. Install Java (JDK)

Spark requires Java to run. Download and install the JDK from the Oracle website or the OpenJDK
website.

1. After installing, set the JAVA_HOME environment variable:

o Right-click This PC > Properties > Advanced system settings > Environment
Variables.
o Click New under System Variables and set:
 Variable Name: JAVA_HOME
 Variable Value: C:\path\to\jdk (the path to the JDK directory, e.g.,
C:\Program Files\Java\jdk-1.x.x).
2. Add the bin directory of JDK to the Path variable:
o Under System Variables, select Path and click Edit.
o Add: C:\path\to\jdk\bin.
3. Verify Java installation by running this in Command Prompt:

java -version

1.2. Install Python (for PySpark, optional)

If you want to use PySpark (Spark with Python), install Python from the official website.

 Add Python to your system’s Path during installation.

 Verify by running:

python --version

1.3. Install WinUtils (Optional for HDFS Usage)

If you're using Hadoop’s HDFS, you will need winutils.exe. Download it from the WinUtils
repository and place it in a directory like C:\hadoop\bin.

2. Download and Install Apache Spark

2.1. Download Spark

 Go to the Apache Spark download page and download the latest pre-built version for
Hadoop.
 Choose:
o Package Type: Pre-built for Apache Hadoop.
o Version: Choose a stable version.
2.2. Extract Spark Files

1. Extract the Spark archive to a directory (e.g., C:\spark).

2. Set environment variables for Spark:
o Go to Advanced system settings > Environment Variables.
o Click New and set:
 Variable Name: SPARK_HOME
 Variable Value: C:\spark (path to the Spark directory).
3. Add Spark's bin directory to the Path variable:
o In the Path variable, click Edit and add: C:\spark\bin.

3. Configure Apache Spark for a Single-Node Setup (Standalone Mode)

Spark operates in a standalone mode with one master and worker running on a single machine by
default. Follow these steps to set it up:

3.1. Create a Configuration Directory

Go to C:\spark\conf and copy the following template files:

 spark-env.sh.template (Rename it to spark-env.cmd)

 slaves.template (Rename it to slaves)

3.2. Configure spark-env.cmd

Edit the spark-env.cmd file to set the necessary environment variables. Add the following lines to
specify the master and worker behavior:

set SPARK_MASTER_HOST=localhost
set SPARK_LOCAL_IP=localhost

You can also configure memory allocation, CPU cores, etc., as needed.

3.3. Start Spark in Standalone Mode

 Open Command Prompt, navigate to the C:\spark\sbin directory, and start the master:

start-master.cmd

This will start the Spark Master node on localhost:8080.

 Start the worker (slave) node by running:

start-worker.cmd spark://localhost:7077

You can access the Spark UI at http://localhost:8080.

4. Configure a Multi-Node Cluster Setup

In a multi-node setup, you will have a master node and multiple worker nodes on different machines.
Follow these steps for a cluster setup.
4.1. Prerequisites for Each Machine

 Install Java on all machines.

 Download and extract the same Spark version on each machine.
 Ensure that the machines can communicate with each other (via IP or hostname).

4.2. Set Up the Master Node (on One Machine)

The master node will coordinate the workers.

1. Edit spark-env.cmd on the Master Node:

o Set the master host:

set SPARK_MASTER_HOST=<Master_IP_or_Hostname>
set SPARK_LOCAL_IP=<Master_IP_or_Hostname>

o Make sure that the SPARK_LOCAL_IP and SPARK_MASTER_HOST are set to the
master machine's IP address or hostname.
2. Start the Spark Master:
o Navigate to the sbin directory in Spark and run:

start-master.cmd

4.3. Set Up Worker Nodes (on Each Worker Machine)

Each worker node will run tasks assigned by the master.

1. Edit spark-env.cmd on Each Worker: Set the following in spark-env.cmd on each worker
machine:

set SPARK_LOCAL_IP=<Worker_IP_or_Hostname>

2. Add Worker IP Addresses to the Master’s slaves File: On the master machine, edit the
slaves file located at C:\spark\conf. Add the IP addresses or hostnames of all the worker
machines. For example:

worker1_ip
worker2_ip
worker3_ip

3. Start Workers:
o On each worker node, start the worker by pointing it to the master:

start-worker.cmd spark://<Master_IP_or_Hostname>:7077

o Each worker should now connect to the master.

4.4. Verify Multi-Node Setup

To verify that the cluster is set up properly:

1. Go to http://<Master_IP_or_Hostname>:8080 in a browser. The Spark UI should show all the
connected workers under the "Workers" tab.
2. You can now run Spark jobs across multiple nodes.

5. Running Spark Jobs

Once the cluster is set up, you can submit jobs from any node (master or workers) using the spark-
submit command.

Example of submitting a Python (PySpark) job:

spark-submit --master spark://<Master_IP_or_Hostname>:7077 <path_to_your_script.py>

This will run the job across all available worker nodes.

6. Stopping the Cluster

 Stop the master:

stop-master.cmd

 Stop the workers:

stop-worker.cmd
Exercise 6:

Install Mesos and configure Spark to run with Mesos and perform dynamic resource allocation.

1. Install Mesos and Configure Spark on Linux

Step 1: Install Dependencies

Mesos requires several dependencies. Run the following commands to install them:

sudo apt-get update

sudo apt-get install -y tar wget git build-essential python-dev python-six libcurl4-nss-dev
libsasl2-dev maven

Step 2: Install Java

Spark requires Java to run, so ensure it's installed:

sudo apt-get install openjdk-11-jdk

java -version

Step 3: Download and Build Mesos

Mesos must be built from the source. Here are the steps:

cd /usr/local
sudo git clone https://github.com/apache/mesos.git
cd mesos
sudo git checkout 1.11.0 # Replace with the latest version if needed
sudo ./bootstrap
sudo mkdir build
cd build
sudo ../configure
sudo make -j4
sudo make install

Step 4: Set Up Mesos Master and Slave

Create a directory for Mesos configuration:

sudo mkdir -p /var/lib/mesos

sudo echo '1' > /var/lib/mesos/mesos-master-id

Start the Mesos master:

sudo mesos-master --work_dir=/var/lib/mesos --log_dir=/var/log/mesos

To set up the Mesos slave:

sudo mesos-agent --master=<Master IP>:5050 --work_dir=/var/lib/mesos --

log_dir=/var/log/mesos
Step 5: Download and Configure Apache Spark

Download the Spark binaries:

wget https://archive.apache.org/dist/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz
tar -xzf spark-3.4.0-bin-hadoop3.tgz
cd spark-3.4.0-bin-hadoop3

Step 6: Configure Spark to Use Mesos

Edit the spark-env.sh file to configure Spark with Mesos:

export SPARK_MASTER=mesos://<Mesos-Master-IP>:5050
export MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so

Step 7: Enable Dynamic Resource Allocation

To enable dynamic resource allocation, edit the conf/spark-defaults.conf:

spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true

Step 8: Run a Spark Job

Run Spark jobs on the Mesos cluster:

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master mesos://

Exercise 7:

Using pySpark in an interactive mode perform the following tasks: i) load data from a CSV file
(CRAN package download logs http://cranii) logs.rstudio.com/) iii) display the first n rows of
the data iv) transform each row of data into an array use countbykey method to find the
number of downloads for a package

Code:

from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("CRAN Package Download Logs").getOrCreate()

# Load the CSV file

df = spark.read.csv("your_path_to_file.csv", header=True, inferSchema=True)

# Display the first n rows

df.show(n)

# Convert DataFrame to RDD and transform rows to array

rdd = df.rdd.map(lambda row: row.asDict().values())

# Map to (package, 1) and count occurrences

package_counts = df.rdd.map(lambda row: (row['package'], 1)).countByKey()

# Display the package counts

for package, count in package_counts.items():
print(f"Package: {package}, Downloads: {count}")

Output:

CSV file
date, package, version, country
2024-11-01, ggplot2, 3.3.5, US
2024-11-01, dplyr, 1.0.2, US
2024-11-01, ggplot2, 3.3.5, CA
2024-11-02, dplyr, 1.0.2, US
2024-11-02, tidyverse, 1.3.0, IN
2024-11-02, ggplot2, 3.3.5, US
2024-11-02, dplyr, 1.0.2, IN

Package: ggplot2, Downloads: 3

Package: dplyr, Downloads: 3
Package: tidyverse, Downloads: 1
Exercise 8:

Implement the following Spark Dataframe and SQL operations;

i) Create a spark dataframe from python list and RDD

ii) Change the data frame properties
iii) filter and aggregate the data
iv) transform a dataframe column
v) build a view with the Spark DataFrame

Code:

from pyspark.sql import SparkSession

from pyspark.sql.functions import avg, concat, lit
from pyspark.sql.types import FloatType

# Initialize Spark session

spark = SparkSession.builder.appName("Spark DataFrame Operations").getOrCreate()

# Create DataFrame from a Python list

data_list = [{"name": "Alice", "age": 29, "city": "New York"},
{"name": "Bob", "age": 31, "city": "San Francisco"},
{"name": "Cathy", "age": 25, "city": "Los Angeles"}]
df_from_list = spark.createDataFrame(data_list)

# Create DataFrame from RDD

data_rdd = spark.sparkContext.parallelize([("David", 22, "Chicago"), ("Eva", 35, "Seattle")])
columns = ["name", "age", "city"]
df_from_rdd = data_rdd.toDF(columns)

# Change DataFrame properties (rename and cast)

df_renamed = df_from_list.withColumnRenamed("age", "years")
df_casted = df_from_list.withColumn("age", df_from_list["age"].cast(FloatType()))

# Filter and aggregate

df_filtered = df_from_list.filter(df_from_list.age > 25)
df_aggregated = df_from_list.agg(avg("age").alias("average_age"))

# Transform column
df_transformed = df_from_list.withColumn("city", concat(df_from_list["city"], lit(" - USA")))

#Build a view and query it

df_transformed.createOrReplaceTempView("people_view")
result_df = spark.sql("SELECT name, years, city FROM people_view WHERE years > 30")

# Show results
df_from_list.show()
df_from_rdd.show()
df_renamed.show()
df_casted.show()
df_filtered.show()
df_aggregated.show()
df_transformed.show()
result_df.show()
Output:
df_from_list.show()
----------------------------------
Name age city
-----------------------------------
Alice 29 New York
Bob 31 San Francisco
Cathy 25 Los Angeles
------------------------------------

df_from_rdd.show()
-----------------------------------
Name age city
-----------------------------------
David 22 Chicago
Eva 35 Seattle
-----------------------------------
df_renamed.show()
-----------------------------------
Name years city
-----------------------------------
Alice 29 New York
Bob 31 San Francisco
Cathy 25 Los Angeles
-----------------------------------
df_casted.show()
-----------------------------------
Name age city
-----------------------------------
Alice 29.0 New York
Bob 31.0 San Francisco
Cathy 25.0 Los Angeles
-----------------------------------
df_casted.printSchema()
root
|-- name: string (nullable = true)
|-- age: float (nullable = true)
|-- city: string (nullable = true)
df_filtered.show()
-----------------------------------
Name age city
-----------------------------------
Alice 29 New York
Bob 31 San Francisco
-----------------------------------
df_aggregated.show()
-----------------------------------
average_age
-----------------------------------
28.333333333
-----------------------------------
df_transformed.show()
-----------------------------------
name age city
-----------------------------------
Alice 29 New York - USA
Bob 31 San Francisco - USA
Cathy 25 Los Angeles - USA
-----------------------------------

result_df.show()
-----------------------------------
Name years city
-----------------------------------

Bob 31.0 San Francisco - USA

-----------------------------------
Exercise 9:

Write a pySpark program to count the number of occurrences of words in a text and use
explicit caching.

Code:

from pyspark.sql import SparkSession

# Initialize Spark session

spark = SparkSession.builder.appName("WordCountWithCaching").getOrCreate()

# Load text data

text_rdd = spark.sparkContext.textFile("sample_text.txt")

# Split lines into words

words_rdd = text_rdd.flatMap(lambda line: line.split(" "))

# Map each word to (word, 1)

word_pairs_rdd = words_rdd.map(lambda word: (word, 1))

# Cache the RDD to improve performance for repeated operations

word_pairs_rdd = word_pairs_rdd.cache()

# Count the occurrences of each word

word_counts_rdd = word_pairs_rdd.reduceByKey(lambda a, b: a + b)

# Collect and print the word counts

word_counts = word_counts_rdd.collect()
for word, count in word_counts:
print(f"Word: {word}, Count: {count}")

Output:

Word: hello, Count: 4

Word: world, Count: 2
Word: Spark, Count: 1
Word: PySpark, Count: 1
Exercise 10:

Analyze the impact of number of worker cores on a parallelized operation and use Caching to
reduce computation time.

Code:

PySpark

1. Initialize Spark and Load Data

Start a Spark session with a specified number of cores.

from pyspark.sql import SparkSession

from pyspark import SparkConf

# Configure Spark session with a specified number of cores

conf = SparkConf().setAppName("WorkerCoreAnalysis").set("spark.executor.cores", "4")
spark = SparkSession.builder.config(conf=conf).getOrCreate()

2. Create a Large RDD for Testing

We create an RDD with a large number of integers to simulate a heavy computation.

# Create an RDD with a large range of numbers

data_rdd = spark.sparkContext.parallelize(range(1, 10000000))

3. Apply Transformations and Actions with and without Caching

Perform a map and filter transformation on data_rdd, then count the elements and calculate
the sum. First, we do this without caching, and then with caching to observe the performance
improvement.

from time import time

# Without caching
start_time = time()
filtered_rdd = data_rdd.map(lambda x: x * 2).filter(lambda x: x % 3 == 0)
count_result = filtered_rdd.count()
sum_result = filtered_rdd.sum()
print("Without Caching -> Count:", count_result, ", Sum:", sum_result)
print("Time taken without caching:", time() - start_time, "seconds")

# With caching
start_time = time()
filtered_rdd = data_rdd.map(lambda x: x * 2).filter(lambda x: x % 3 == 0).cache()
count_result = filtered_rdd.count()
sum_result = filtered_rdd.sum()
print("With Caching -> Count:", count_result, ", Sum:", sum_result)
print("Time taken with caching:", time() - start_time, "seconds")

Run the code to observe the time taken in each case:

o Without Caching: Each action (count and sum) re-runs the transformations, leading
to longer computation times.
o With Caching: The transformations (map and filter) are executed only once and
stored in memory, so subsequent actions (count and sum) access the cached data,
reducing computation time significantly.

Different Core Settings

To analyze the effect of the number of cores, change the spark.executor.cores setting (e.g.,
from 2 to 4, 8, etc.) and observe the performance impact. Number of cores depends on the
dataset size, transformations, and hardware resources.

Expected Observations

 Increased Cores: As the number of cores increases, the time taken to perform the
transformations and actions should decrease due to higher parallelism, up to a certain point.
 Caching Benefits: With caching, repeated actions (like count and sum in this example) on the
same data take less time because they access cached data rather than re-running
transformations.

Output:

Dataset: An RDD with 10 million integers.

Cluster Configuration: 4-worker Spark cluster, each with 4 cores (total 16 cores).
Operation: Transformations (e.g., map and filter) followed by count and sum actions.

Without Caching

 Transformation (map + filter with count action): ~6 seconds

 Transformation (map + filter with sum action): ~6 seconds
 Total Time Without Caching: 6 + 6 = ~12 seconds

With Caching

 Transformation (map + filter) and Caching: ~6 seconds

 Count Action on Cached Data: ~0.5 seconds
 Sum Action on Cached Data: ~0.5 seconds
 Total Time With Caching: 6 + 0.5 + 0.5 = ~7 seconds

URT180 User Manual
No ratings yet
URT180 User Manual
22 pages
Implementation Methodology: Preparation
No ratings yet
Implementation Methodology: Preparation
25 pages
Top 100 Most Repeated Microsoft Word Mcqs Set-1
100% (1)
Top 100 Most Repeated Microsoft Word Mcqs Set-1
32 pages
Cloud PDF
No ratings yet
Cloud PDF
47 pages
Chapter 7 Re
100% (1)
Chapter 7 Re
18 pages
BDA Record
No ratings yet
BDA Record
58 pages
Hadoop Week 3
No ratings yet
Hadoop Week 3
60 pages
20CSPL701 - Bda - Record 2024-2025
No ratings yet
20CSPL701 - Bda - Record 2024-2025
54 pages
Hadoop Mini Project
No ratings yet
Hadoop Mini Project
8 pages
Installation of Hadoop
No ratings yet
Installation of Hadoop
37 pages
Bda Exp1 Chinmay
No ratings yet
Bda Exp1 Chinmay
13 pages
BDA Lab
No ratings yet
BDA Lab
13 pages
Running Jar Program
No ratings yet
Running Jar Program
3 pages
BDA Lab Assignment 3 PDF
No ratings yet
BDA Lab Assignment 3 PDF
17 pages
Wordcount
No ratings yet
Wordcount
3 pages
Step 2 - First MapReduce Program
No ratings yet
Step 2 - First MapReduce Program
25 pages
Part B Assignment - No - 11
No ratings yet
Part B Assignment - No - 11
6 pages
BDA Lab Manual - Organized
No ratings yet
BDA Lab Manual - Organized
69 pages
Cp5261 Da Lab Me-Cse 2021 - Edit
No ratings yet
Cp5261 Da Lab Me-Cse 2021 - Edit
88 pages
Bda 1
No ratings yet
Bda 1
6 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
34 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
59 pages
Exp 5 - 9
No ratings yet
Exp 5 - 9
25 pages
DSBDA GRP B Print
No ratings yet
DSBDA GRP B Print
21 pages
Lab3 BigData-MapReduce
No ratings yet
Lab3 BigData-MapReduce
8 pages
Big Datalab
No ratings yet
Big Datalab
4 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
20 pages
Big Data File
No ratings yet
Big Data File
16 pages
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
No ratings yet
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
11 pages
BDT Lab Manual
No ratings yet
BDT Lab Manual
48 pages
Ex No 04
No ratings yet
Ex No 04
4 pages
Hadoop Lab Notes: Nicola Tonellotto November 15, 2010
No ratings yet
Hadoop Lab Notes: Nicola Tonellotto November 15, 2010
9 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
58 pages
02-Wordcount Mapreduce
No ratings yet
02-Wordcount Mapreduce
5 pages
Practical 2c
No ratings yet
Practical 2c
2 pages
Big Data Akshat
No ratings yet
Big Data Akshat
57 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
42 pages
B1 Instructions
No ratings yet
B1 Instructions
9 pages
WordCount Program Hadoop Task 2
No ratings yet
WordCount Program Hadoop Task 2
7 pages
Part B Assignment - No - 1
No ratings yet
Part B Assignment - No - 1
6 pages
CS246 TA Session: Hadoop Tutorial: Peyman Kazemian 1/11/2011
No ratings yet
CS246 TA Session: Hadoop Tutorial: Peyman Kazemian 1/11/2011
13 pages
BDA Manual
No ratings yet
BDA Manual
41 pages
104 Da11-13
No ratings yet
104 Da11-13
14 pages
Import Import Import Import Import Import Import Import Public Class Extends Implements
No ratings yet
Import Import Import Import Import Import Import Import Public Class Extends Implements
7 pages
Hadoop and Map Reduce
No ratings yet
Hadoop and Map Reduce
27 pages
MapReduce Programs
No ratings yet
MapReduce Programs
10 pages
Palak
No ratings yet
Palak
10 pages
Business Organization and Systems
No ratings yet
Business Organization and Systems
7 pages
Dsbda 11
No ratings yet
Dsbda 11
15 pages
Hadoop Phase3 Notes
No ratings yet
Hadoop Phase3 Notes
4 pages
BDA Lab 8 Manual
No ratings yet
BDA Lab 8 Manual
7 pages
Bda Megh
No ratings yet
Bda Megh
50 pages
BDC Output 3
No ratings yet
BDC Output 3
4 pages
CS-702 (D) BigData
No ratings yet
CS-702 (D) BigData
61 pages
Big Data Lab Manual Printout
No ratings yet
Big Data Lab Manual Printout
51 pages
Unit IV Programming Model
No ratings yet
Unit IV Programming Model
30 pages
Bda Lab S
No ratings yet
Bda Lab S
92 pages
6.1.3 Lab - Implement VRF-Lite - ILM
No ratings yet
6.1.3 Lab - Implement VRF-Lite - ILM
30 pages
Developing A Simple Map-Reduce Program For Hadoop: Big Data Course CS6350 Professor: Dr. Latifur Khan
No ratings yet
Developing A Simple Map-Reduce Program For Hadoop: Big Data Course CS6350 Professor: Dr. Latifur Khan
22 pages
SCTP in Theory and Practice-Sample
0% (1)
SCTP in Theory and Practice-Sample
55 pages
Bda Manual
No ratings yet
Bda Manual
33 pages
Big Data
No ratings yet
Big Data
28 pages
Practical-1: Aim: Hadoop Configuration and Single Node Cluster Setup and Perform File Management Task in
No ratings yet
Practical-1: Aim: Hadoop Configuration and Single Node Cluster Setup and Perform File Management Task in
61 pages
Prerequisites: Single Node Setup Cluster Setup
No ratings yet
Prerequisites: Single Node Setup Cluster Setup
5 pages
Hadoop Administrator Training - Lab Hand Book
No ratings yet
Hadoop Administrator Training - Lab Hand Book
12 pages
Forerunner: ™ Atm Switch User'S Manual
No ratings yet
Forerunner: ™ Atm Switch User'S Manual
220 pages
Data Entry Task Emails
No ratings yet
Data Entry Task Emails
11 pages
Greenhouse Manager Interview Questions and Answers 38570
No ratings yet
Greenhouse Manager Interview Questions and Answers 38570
13 pages
Cpe Reviewer
No ratings yet
Cpe Reviewer
5 pages
Unix Top100 e
No ratings yet
Unix Top100 e
3 pages
Topic 5 Calculus Review HL
No ratings yet
Topic 5 Calculus Review HL
120 pages
Testing Code Security
No ratings yet
Testing Code Security
292 pages
Omron Fins Ethernet Manual
No ratings yet
Omron Fins Ethernet Manual
86 pages
Prince Mishra Resume
No ratings yet
Prince Mishra Resume
2 pages
Assembler Directives of 8086
100% (5)
Assembler Directives of 8086
1 page
Thesis Typeface Download
100% (3)
Thesis Typeface Download
6 pages
Basic Logitech Mouse For Gaming
No ratings yet
Basic Logitech Mouse For Gaming
1 page
ITT04101-Computer Generations
No ratings yet
ITT04101-Computer Generations
5 pages
Markerless Human Motion Capture Through Visual Hull and Articulated ICP
No ratings yet
Markerless Human Motion Capture Through Visual Hull and Articulated ICP
5 pages
Unit 4
No ratings yet
Unit 4
11 pages
4th Year DW& DM Kai075 Unit 1
No ratings yet
4th Year DW& DM Kai075 Unit 1
25 pages
Januarius T. Manipol - Profile - PDF - 03152024
No ratings yet
Januarius T. Manipol - Profile - PDF - 03152024
4 pages
Invent Ilogic PDF
No ratings yet
Invent Ilogic PDF
4 pages
CV Daniar Heri Kurniawan New 1
No ratings yet
CV Daniar Heri Kurniawan New 1
4 pages
DataKinetics Batch Optimization Whitepaper
No ratings yet
DataKinetics Batch Optimization Whitepaper
7 pages
! Diet Problem Given in The Note : Model Title
No ratings yet
! Diet Problem Given in The Note : Model Title
2 pages
Monitor Your Industrial Plant From Anywhere: The World'S #1-Selling Industrial Alarm Notification Software
No ratings yet
Monitor Your Industrial Plant From Anywhere: The World'S #1-Selling Industrial Alarm Notification Software
2 pages
Lvpei BD Call 090321
No ratings yet
Lvpei BD Call 090321
1 page
Firebase Storage for Angular: A reliable file upload solution for your applications
From Everand
Firebase Storage for Angular: A reliable file upload solution for your applications
Abdelfattah Ragab
No ratings yet
Quick Configuration of Openldap and Kerberos in Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos in Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet