0% found this document useful (0 votes)
23 views32 pages

BDF Programs

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views32 pages

BDF Programs

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Exercise 1:

Build and set up Hadoop framework to run in single-node and multi-node setup.

Single-Node Hadoop Setup on Windows

Step 1: Install Java

1. Download Java: Download the Java Development Kit (JDK) from the Oracle JDK website or
OpenJDK.
2. Install Java: Follow the installation prompts and install it to a directory like
C:\Java\jdk1.8.0_281.
3. Set JAVA_HOME Environment Variable:
o Go to Control Panel > System and Security > System > Advanced system settings >
Environment Variables.
o Click New under System variables and set:
 Variable name: JAVA_HOME
 Variable value: C:\Java\jdk1.8.0_281
o Add %JAVA_HOME%\bin to the Path variable.

Step 2: Download Hadoop

1. Download Hadoop: Visit the Apache Hadoop releases page and download the latest stable
release (e.g., hadoop-3.3.6.tar.gz).
2. Extract Hadoop:
o Extract the downloaded file to a directory like C:\hadoop-3.3.6.

Step 3: Configure Hadoop for Single-Node Setup

1. Set Hadoop Environment Variables:


o Go to Control Panel > System and Security > System > Advanced system settings >
Environment Variables.
o Add a new system variable:
 Variable name: HADOOP_HOME
 Variable value: C:\hadoop
o Add %HADOOP_HOME%\bin to the Path variable.
2. Edit Configuration Files:
o Navigate to C:\hadoop-3.3.6\etc\hadoop.
o Edit (in Notepad) hadoop-env.cmd: Set JAVA_HOME by adding the following
line:

set JAVA_HOME=C:\Java\jdk1.8.0_281

o Edit (in Notepad) core-site.xml: Add the following configuration inside the
<configuration> tags:

<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/C:/hadoop/tmp</value>
</property>

o Edit (in Notepad) hdfs-site.xml: Add the following configurationbefore adding


the configurations ,create folder data in hadoop-3.3.6 and create two folders
namenode and datanode in data folder
o

Now add configurations in hdfs-site.xml

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name>dfs.namenode.name.dir</name>
<value>file:/C:/hadoop/data/namenode</value>
</property>

<property>
<name>dfs.datanode.data.dir</name>
<value>file:/C:/hadoop/data/datanode</value>
</property>

o Edit (in Notepad) mapred-site.xml: Rename mapred-site.xml.template to mapred-


site.xml and add:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

o Edit(in Notepad) yarn-site.xml: Add:

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

Step 4: Format the Hadoop Filesystem

1. Open Command Prompt and run:

C:\Users\Dell-PC>hdfs namenode -format

Step 5: Start Hadoop Daemons

1. Start HDFS (NameNode and DataNode):


o In Command Prompt, run:

C:\Users\Dell-PC>start-dfs.cmd

2. Start YARN (ResourceManager and NodeManager):


o Run:

C:\Users\Dell-PC>start-yarn.cmd

Step 6: Verify the Installation

1. Check HDFS:
o Open a web browser and go to http://localhost:9870.
2. Check YARN:
o Go to http://localhost:8088.

Multi-Node Hadoop Setup on Windows (Pseudo-Distributed Mode)

Step 1: Set Up Network Communication

1. Ensure that all nodes (Master and Slaves) can communicate via SSH. Install an SSH client
like OpenSSH or PuTTY on each node.

Step 2: Install Java and Hadoop on All Nodes

1. Follow Steps 1 to 3 from the Single-Node setup on each machine.

Step 3: Configure Master Node

1. Edit Configuration Files on Master Node:


o Edit core-site.xml:

<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value> <!-- Replace 'master' with the IP or hostname
of your master node -->
</property>

o Edit hdfs-site.xml:

<property>
<name>dfs.replication</name>
<value>3</value> <!-- Assuming 3 nodes; adjust accordingly -->
</property>

o Edit yarn-site.xml:

<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value> <!-- Replace with the IP or hostname of the master node --
>
</property>

2. Edit masters and slaves Files:


o masters: Add the hostname or IP of the master node (e.g., master).
o slaves: List the hostnames or IPs of the slave nodes (e.g., slave1, slave2, etc.).

Step 4: Distribute Configuration Files

1. Copy the configured core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml, masters,


and slaves files from the master node to all slave nodes.

Step 5: Start the Hadoop Cluster

1. Start HDFS:
o On the master node, run:

start-dfs.cmd

2. Start YARN:
o On the master node, run:

start-yarn.cmd

Step 6: Verify the Multi-Node Setup

1. Check HDFS:
o Go to http://master:9870 (replace master with the master node's IP or hostname).
2. Check YARN:
o Go to http://master:8088.
Exercise 2:

Write and implement a text file processing program using MapReduce model.

Write the Word Count Program

Create a simple Java program to implement the Word Count logic.

1. Create a Java File: Save the following code as WordCount.java.

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount


{
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>
{

private final static IntWritable one = new IntWritable(1);


private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException,
InterruptedException
{
String[] tokens = value.toString().split("\\s+");
for (String token : tokens)
{
word.set(token);
context.write(word, one);
}
}
}

public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable>


{

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception


{
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Step 3: Compile the Program

1. Open Command Prompt: Navigate to the directory where you saved WordCount.java.
2. Compile the Program: Use the following command to compile the Java file:

javac -classpath %HADOOP_HOME%\share\hadoop\common\hadoop-common-3.3.6.jar;


%HADOOP_HOME%\share\hadoop\mapreduce\hadoop-mapreduce-client-core-3.3.6.jar -d .
WordCount.java

3. Create a JAR File: Package the compiled classes into a JAR file:

jar -cvf wordcount.jar -C . .

Step 4: Prepare Input Data

1. Create an Input Directory in HDFS:

hadoop fs -mkdir /input

2. Upload a Text File to HDFS:

hadoop fs -put C:\path\to\your\inputfile.txt /input/

Replace C:\path\to\your\inputfile.txt with the actual path to your text file.

Step 5: Run the Word Count Program

1. Execute the Hadoop Job:

hadoop jar wordcount.jar WordCount /input /output


This command runs the Word Count job with the input data located at /input in HDFS and stores the
output at /output.

Step 6: View the Output

1. Check the Output:

hadoop fs -cat /output/part-r-00000

This command displays the word counts from the input text file.

(OR)

Write and implement a text file processing program using MapReduce model.
Stop-all.cmd
Exercise 3:

Implement a MapReduce program using pySpark for identifying potential customers.

To implement a PySpark MapReduce program in Jupyter Notebook for identifying potential


customers,steps to be followed:

1. Set Up the Environment

You need to install and configure the necessary software on Windows/Unix/Linux machine.

a. Install Java

 Download and install Java JDK.


 During installation, ensure that the path to java.exe is added to the environment variables. If
not:
1. Go to Control Panel → System → Advanced system settings.
2. Click Environment Variables and add a new variable:
 Variable name: JAVA_HOME
 Variable value: The path to your Java installation (e.g., C:\Program
Files\Java\jdk-11).
3. Edit the Path variable under system variables, and add %JAVA_HOME%\bin.

b. Install Apache Spark

 Download Apache Spark and extract it.


 Set the following environment variables:
o Variable name: SPARK_HOME
o Variable value: The Spark installation path (e.g., C:\spark).
 Edit the Path system variable, adding %SPARK_HOME%\bin.

c. Install Hadoop (Winutils)

 Download winutils.exe for Hadoop from here.


 Place it in C:\hadoop\bin and set HADOOP_HOME:
o Variable name: HADOOP_HOME
o Variable value: C:\hadoop.

d. Install Python and PySpark


 Install Python from python.org.
 Install PySpark using pip:

pip install pyspark

e. Install Jupyter Notebook

Install Jupyter Notebook with pip:

pip install notebook

f. Install findspark

findspark helps to configure PySpark within Jupyter notebooks:

pip install findspark

2. Set Up PySpark in Jupyter Notebook

Now, you can configure Jupyter to run PySpark code.

1. Open a terminal or Command Prompt and run:

jupyter notebook

This will open the Jupyter Notebook interface in web browser.

2. Create a new notebook.


3. In the first cell of the notebook, configure PySpark with findspark:

import findspark
findspark.init()

from pyspark.sql import SparkSession

3. Implement the MapReduce Program

Next, load the dataset, process the data, and apply MapReduce logic to identify potential customers.

a. Create the Dataset

First, need a dataset to work with. For example, create a simple CSV file customer_data.csv
containing customer details:

CustomerID,Age,Income,PurchaseHistory,Location
1,34,50000,"Electronics, Clothes","NY"
2,45,75000,"Groceries, Furniture","LA"
3,29,35000,"Electronics, Groceries","SF"
4,39,65000,"Clothes, Groceries","NY"

Place this file in a folder, say C:\path\to\your\data\customer_data.csv.


b. Write the Code in Jupyter Notebook

In the Jupyter Notebook, implement the MapReduce process step by step.

Step 1: Initialize the Spark Session

In a new cell, write the following to initialize Spark:

# Initialize Spark Session


spark = SparkSession.builder \
.appName("Potential Customers Identification") \
.master("local[*]") \
.getOrCreate()

This creates a Spark session that allows us to work with PySpark.

Step 2: Load the Dataset

Now, load the dataset into a Spark DataFrame:

# Load the customer dataset

data_file = "C:/path/to/your/data/customer_data.csv"
customer_df = spark.read.option("header", "true").csv(data_file)

# Show the first few rows of the data

customer_df.show()

Ensure that the path to the dataset is correct.

Step 3: Data Preprocessing

Convert columns like Age and Income to integers and remove rows with missing values:

from pyspark.sql.functions import col

# Data preprocessing
customer_df = customer_df.withColumn("Age", ol("Age").cast("int")) \
.withColumn("Income", col("Income").cast("int")) \.dropna()

customer_df.show()

Step 4: Apply Filter Criteria for Potential Customers

Now filter the dataset based on criteria. For instance, identify potential customers who have an
income greater than $60,000 and are aged between 30 and 50:

# Filter potential customers


potential_customers = customer_df.filter((col("Income") > 60000) & (col("Age") > 30) &
(col("Age") < 50))

# Show potential customers


potential_customers.show()

Step 5: Perform MapReduce to Count Potential Customers by Location

Convert the filtered DataFrame into an RDD and apply a MapReduce approach to count the number
of potential customers by location:

# Convert to RDD and apply MapReduce


customer_rdd = potential_customers.rdd
map_rdd = customer_rdd.map(lambda row: (row['Location'], 1))
reduce_rdd = map_rdd.reduceByKey(lambda x, y: x + y)

# Collect and display the results


result = reduce_rdd.collect()
for location, count in result:
print(f"Location: {location}, Potential Customers: {count}")

Step 6: Stop the Spark Session

After running the analysis, stop the Spark session to release resources:

# Stop the Spark session


spark.stop()

4. Run and Interpret Results

Run the Jupyter notebook cell by cell. The output showing the count of potential customers by
location is generated.

OUTPUT:

Location: NY, Potential Customers: 2


Location: LA, Potential Customers: 1
Exercise 4:

Consider the dataset (attached Ex.4.txt) and write a mapper and reducer program for finding the cost
of the item that is most expensive, for each location.

Ex.4.txt (dataset)

Date Time Location Item Cost How they Paid


2012-07-16 15:43 Las Vegas Men's Clothing 208.97 Visa
2012-06-11 16:17 Miami Crafts 84.11 Amex
2012-10-17 15:30 Tucson Crafts 489.93 Cash
2012-10-25 15:01 San Francisco Men's Clothing 388.3 Visa
2012-07-13 09:01 Dallas Consumer Electronics 145.63 Cash
2012-11-06 13:02 Tampa Garden 353.23 MasterCard
2012-09-07 12:58 Washington Women's Clothing 481.31 MasterCard
2012-08-05 16:34 San Jose DVDs 492.8 Discover
2012-04-22 13:12 Newark Consumer Electronics 410.37 Visa
2012-10-19 11:35 Memphis Garden 354.44 Discover
2012-10-10 13:17 Jersey City Books 369.07 Amex
2012-04-27 11:54 Plano Women's Clothing 4.65 Cash
2012-08-28 14:56 Buffalo Video Games 337.35 Discover
2012-09-17 13:09 Louisville Music 213.64 Discover
2012-02-24 12:05 Miami Women's Clothing 154.64 Cash
2012-01-02 10:04 LosAngeles Pet Supplies 164.5 Discover
2012-11-15 15:46 Birmingham Men's Clothing 1.64 Cash
2012-03-16 11:18 Mesa Toys 13.79 Visa
2012-06-25 10:05 Wichita Consumer Electronics 158.25 Amex
2012-04-05 17:03 Indianapolis Pet Supplies 152.77 Amex
2012-11-08 15:19 San Bernardino Video Games 332.43 Discover
2012-08-08 10:09 Indianapolis Health and Beauty 464.36 Amex
2012-03-02 09:25 Stockton Men's Clothing 180.61 Discover
2012-02-27 16:12 Austin Health and Beauty 48.09 Visa
2012-12-29 16:56 Buffalo Garden 386.56 Amex
2012-03-20 09:02 Santa Ana Books 2.75 Amex
2012-10-30 11:52 Gilbert DVDs 11.31 Amex
2012-02-03 11:02 New York DVDs 221.35 Visa
2012-07-26 16:16 Corpus Christi Health and Beauty 157.91 Amex
2012-07-20 11:46 Riverside Video Games 349.41 Visa
2012-10-04 12:25 Chicago Children's Clothing 364.53 MasterCard
2012-02-04 11:53 Fremont Video Games 404.17 Cash
2012-05-31 14:43 Rochester Video Games 460.39 Amex
2012-05-25 16:11 Raleigh Computers 61.22 MasterCard
2012-05-11 12:39 Chicago Pet Supplies 431.73 Cash
2012-04-07 11:39 Cincinnati Computers 288.32 Discover
2012-04-18 16:57 Rochester Consumer Electronics 342.62 Amex
2012-12-19 10:12 Pittsburgh Books 498.29 Cash
2012-01-21 14:50 Rochester Cameras 485.71 MasterCard
2012-11-15 09:23 Glendale Video Games 14.09 Amex
2012-01-07 14:20 Cincinnati Crafts 1.41 Amex
2012-10-20 14:53 Irvine Video Games 15.19 Discover
2012-03-04 12:11 Boston Video Games 397.21 Visa
2012-01-11 09:04 Scottsdale Garden 214.3 Discover
2012-08-11 10:57 Atlanta Garden 189.22 Visa
2012-05-22 13:08 Cincinnati Men's Clothing 443.78 Visa
2012-01-11 17:20 Lubbock Garden 27.68 Cash
2012-01-16 13:31 Cincinnati Cameras 129.6 Cash
2012-02-10 10:39 Santa Ana Computers 282.13 MasterCard
2012-03-22 09:57 Aurora DVDs 82.38 Discover

Step 1: Java Mapper and Reducer

Create two classes: a Mapper and a Reducer.

Mapper Code

This class will emit the location as the key and the item cost as the value.

import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class MaxCostMapper extends Mapper<LongWritable, Text, Text, Text>


{

@Override
protected void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException
{
// Split the input line into columns
String[] columns = value.toString().split("\t");

// Extract the location and cost (assuming location is at index 2, and cost is at index
4)
if (columns.length >= 5)
{
String location = columns[2];
String cost = columns[4];

try
{
// Emit the location as the key and cost as the value
context.write(new Text(location), new Text(cost));
}
catch (NumberFormatException e)
{
// If there's an issue with parsing the cost, skip this record
e.printStackTrace();
}
}
}
}

Reducer Code

This class will receive the location and list of costs, and it will output the maximum cost for each
location.
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class MaxCostReducer extends Reducer<Text, Text, Text, Text>


{

@Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
InterruptedException
{
double maxCost = 0.0;

// Iterate through all costs for a given location


for (Text value : values)
{
try
{
double cost = Double.parseDouble(value.toString());
if (cost > maxCost)
{
maxCost = cost;
}
}
catch (NumberFormatException e)
{
e.printStackTrace();
}
}

// Output the location (key) and the maximum cost (value)


context.write(key, new Text(String.valueOf(maxCost)));
}
}

Step 2: Driver Code

The driver class sets up and starts the MapReduce job.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MaxCostDriver


{
public static void main(String[] args) throws Exception
{
if (args.length != 2)
{
System.err.println("Usage: MaxCostDriver <input path> <output path>");
System.exit(-1);
}

// Configure the MapReduce job


Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Find Max Cost Per Location");

job.setJarByClass(MaxCostDriver.class);
job.setMapperClass(MaxCostMapper.class);
job.setReducerClass(MaxCostReducer.class);

// Set output key and value types


job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

// Set input and output file paths


FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

// Wait for the job to complete and exit based on the result
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Step 3: Compiling and Running the Job

Compiling the Program

You can compile the Java files using javac and the Hadoop libraries.

javac -classpath `hadoop classpath` -d . MaxCostMapper.java


javac -classpath `hadoop classpath` -d . MaxCostReducer.java
javac -classpath `hadoop classpath` -d . MaxCostDriver.java
jar -cvf MaxCostJob.jar *.class

Running the Job

Once compiled, you can run the job using the Hadoop command-line tool.

1. Place your input file in the HDFS:

hadoop fs -mkdir -p /user/username/input


hadoop fs -put input.txt /user/username/input/

2. Run the MapReduce job:

hadoop jar MaxCostJob.jar MaxCostDriver /user/username/input/input.txt /user/username/output

3. View the results:

hadoop fs -cat /user/username/output/part-r-00000


Exercise 5:

Install Apache Spark and configure it to run on a single machine. Create workers on different
machines and configure a multi-node setup

To install Apache Spark and configure it for both single-node and multi-node (cluster) setups on a
Windows machine, follow the detailed steps below.

1.1. Install Java (JDK)

Spark requires Java to run. Download and install the JDK from the Oracle website or the OpenJDK
website.

1. After installing, set the JAVA_HOME environment variable:


o Right-click This PC > Properties > Advanced system settings > Environment
Variables.
o Click New under System Variables and set:
 Variable Name: JAVA_HOME
 Variable Value: C:\path\to\jdk (the path to the JDK directory, e.g.,
C:\Program Files\Java\jdk-1.x.x).
2. Add the bin directory of JDK to the Path variable:
o Under System Variables, select Path and click Edit.
o Add: C:\path\to\jdk\bin.
3. Verify Java installation by running this in Command Prompt:

java -version

1.2. Install Python (for PySpark, optional)

If you want to use PySpark (Spark with Python), install Python from the official website.

 Add Python to your system’s Path during installation.


 Verify by running:

python --version

1.3. Install WinUtils (Optional for HDFS Usage)

If you're using Hadoop’s HDFS, you will need winutils.exe. Download it from the WinUtils
repository and place it in a directory like C:\hadoop\bin.

2. Download and Install Apache Spark

2.1. Download Spark

 Go to the Apache Spark download page and download the latest pre-built version for
Hadoop.
 Choose:
o Package Type: Pre-built for Apache Hadoop.
o Version: Choose a stable version.
2.2. Extract Spark Files

1. Extract the Spark archive to a directory (e.g., C:\spark).


2. Set environment variables for Spark:
o Go to Advanced system settings > Environment Variables.
o Click New and set:
 Variable Name: SPARK_HOME
 Variable Value: C:\spark (path to the Spark directory).
3. Add Spark's bin directory to the Path variable:
o In the Path variable, click Edit and add: C:\spark\bin.

3. Configure Apache Spark for a Single-Node Setup (Standalone Mode)

Spark operates in a standalone mode with one master and worker running on a single machine by
default. Follow these steps to set it up:

3.1. Create a Configuration Directory

Go to C:\spark\conf and copy the following template files:

 spark-env.sh.template (Rename it to spark-env.cmd)


 slaves.template (Rename it to slaves)

3.2. Configure spark-env.cmd

Edit the spark-env.cmd file to set the necessary environment variables. Add the following lines to
specify the master and worker behavior:

set SPARK_MASTER_HOST=localhost
set SPARK_LOCAL_IP=localhost

You can also configure memory allocation, CPU cores, etc., as needed.

3.3. Start Spark in Standalone Mode

 Open Command Prompt, navigate to the C:\spark\sbin directory, and start the master:

start-master.cmd

This will start the Spark Master node on localhost:8080.

 Start the worker (slave) node by running:

start-worker.cmd spark://localhost:7077

You can access the Spark UI at http://localhost:8080.

4. Configure a Multi-Node Cluster Setup

In a multi-node setup, you will have a master node and multiple worker nodes on different machines.
Follow these steps for a cluster setup.
4.1. Prerequisites for Each Machine

 Install Java on all machines.


 Download and extract the same Spark version on each machine.
 Ensure that the machines can communicate with each other (via IP or hostname).

4.2. Set Up the Master Node (on One Machine)

The master node will coordinate the workers.

1. Edit spark-env.cmd on the Master Node:


o Set the master host:

set SPARK_MASTER_HOST=<Master_IP_or_Hostname>
set SPARK_LOCAL_IP=<Master_IP_or_Hostname>

o Make sure that the SPARK_LOCAL_IP and SPARK_MASTER_HOST are set to the
master machine's IP address or hostname.
2. Start the Spark Master:
o Navigate to the sbin directory in Spark and run:

start-master.cmd

4.3. Set Up Worker Nodes (on Each Worker Machine)

Each worker node will run tasks assigned by the master.

1. Edit spark-env.cmd on Each Worker: Set the following in spark-env.cmd on each worker
machine:

set SPARK_LOCAL_IP=<Worker_IP_or_Hostname>

2. Add Worker IP Addresses to the Master’s slaves File: On the master machine, edit the
slaves file located at C:\spark\conf. Add the IP addresses or hostnames of all the worker
machines. For example:

worker1_ip
worker2_ip
worker3_ip

3. Start Workers:
o On each worker node, start the worker by pointing it to the master:

start-worker.cmd spark://<Master_IP_or_Hostname>:7077

o Each worker should now connect to the master.

4.4. Verify Multi-Node Setup

To verify that the cluster is set up properly:


1. Go to http://<Master_IP_or_Hostname>:8080 in a browser. The Spark UI should show all the
connected workers under the "Workers" tab.
2. You can now run Spark jobs across multiple nodes.

5. Running Spark Jobs

Once the cluster is set up, you can submit jobs from any node (master or workers) using the spark-
submit command.

Example of submitting a Python (PySpark) job:

spark-submit --master spark://<Master_IP_or_Hostname>:7077 <path_to_your_script.py>

This will run the job across all available worker nodes.

6. Stopping the Cluster

 Stop the master:

stop-master.cmd

 Stop the workers:

stop-worker.cmd
Exercise 6:

Install Mesos and configure Spark to run with Mesos and perform dynamic resource allocation.

1. Install Mesos and Configure Spark on Linux

Step 1: Install Dependencies

Mesos requires several dependencies. Run the following commands to install them:

sudo apt-get update


sudo apt-get install -y tar wget git build-essential python-dev python-six libcurl4-nss-dev
libsasl2-dev maven

Step 2: Install Java

Spark requires Java to run, so ensure it's installed:

sudo apt-get install openjdk-11-jdk


java -version

Step 3: Download and Build Mesos

Mesos must be built from the source. Here are the steps:

cd /usr/local
sudo git clone https://github.com/apache/mesos.git
cd mesos
sudo git checkout 1.11.0 # Replace with the latest version if needed
sudo ./bootstrap
sudo mkdir build
cd build
sudo ../configure
sudo make -j4
sudo make install

Step 4: Set Up Mesos Master and Slave

Create a directory for Mesos configuration:

sudo mkdir -p /var/lib/mesos


sudo echo '1' > /var/lib/mesos/mesos-master-id

Start the Mesos master:

sudo mesos-master --work_dir=/var/lib/mesos --log_dir=/var/log/mesos

To set up the Mesos slave:

sudo mesos-agent --master=<Master IP>:5050 --work_dir=/var/lib/mesos --


log_dir=/var/log/mesos
Step 5: Download and Configure Apache Spark

Download the Spark binaries:

wget https://archive.apache.org/dist/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz
tar -xzf spark-3.4.0-bin-hadoop3.tgz
cd spark-3.4.0-bin-hadoop3

Step 6: Configure Spark to Use Mesos

Edit the spark-env.sh file to configure Spark with Mesos:

export SPARK_MASTER=mesos://<Mesos-Master-IP>:5050
export MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so

Step 7: Enable Dynamic Resource Allocation

To enable dynamic resource allocation, edit the conf/spark-defaults.conf:

spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true

Step 8: Run a Spark Job

Run Spark jobs on the Mesos cluster:

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master mesos://


Exercise 7:

Using pySpark in an interactive mode perform the following tasks: i) load data from a CSV file
(CRAN package download logs http://cranii) logs.rstudio.com/) iii) display the first n rows of
the data iv) transform each row of data into an array use countbykey method to find the
number of downloads for a package

Code:

from pyspark.sql import SparkSession


# Initialize Spark session
spark = SparkSession.builder.appName("CRAN Package Download Logs").getOrCreate()

# Load the CSV file


df = spark.read.csv("your_path_to_file.csv", header=True, inferSchema=True)

# Display the first n rows


df.show(n)

# Convert DataFrame to RDD and transform rows to array


rdd = df.rdd.map(lambda row: row.asDict().values())

# Map to (package, 1) and count occurrences


package_counts = df.rdd.map(lambda row: (row['package'], 1)).countByKey()

# Display the package counts


for package, count in package_counts.items():
print(f"Package: {package}, Downloads: {count}")

Output:

CSV file
date, package, version, country
2024-11-01, ggplot2, 3.3.5, US
2024-11-01, dplyr, 1.0.2, US
2024-11-01, ggplot2, 3.3.5, CA
2024-11-02, dplyr, 1.0.2, US
2024-11-02, tidyverse, 1.3.0, IN
2024-11-02, ggplot2, 3.3.5, US
2024-11-02, dplyr, 1.0.2, IN

Package: ggplot2, Downloads: 3


Package: dplyr, Downloads: 3
Package: tidyverse, Downloads: 1
Exercise 8:

Implement the following Spark Dataframe and SQL operations;

i) Create a spark dataframe from python list and RDD


ii) Change the data frame properties
iii) filter and aggregate the data
iv) transform a dataframe column
v) build a view with the Spark DataFrame

Code:

from pyspark.sql import SparkSession


from pyspark.sql.functions import avg, concat, lit
from pyspark.sql.types import FloatType

# Initialize Spark session


spark = SparkSession.builder.appName("Spark DataFrame Operations").getOrCreate()

# Create DataFrame from a Python list


data_list = [{"name": "Alice", "age": 29, "city": "New York"},
{"name": "Bob", "age": 31, "city": "San Francisco"},
{"name": "Cathy", "age": 25, "city": "Los Angeles"}]
df_from_list = spark.createDataFrame(data_list)

# Create DataFrame from RDD


data_rdd = spark.sparkContext.parallelize([("David", 22, "Chicago"), ("Eva", 35, "Seattle")])
columns = ["name", "age", "city"]
df_from_rdd = data_rdd.toDF(columns)

# Change DataFrame properties (rename and cast)


df_renamed = df_from_list.withColumnRenamed("age", "years")
df_casted = df_from_list.withColumn("age", df_from_list["age"].cast(FloatType()))

# Filter and aggregate


df_filtered = df_from_list.filter(df_from_list.age > 25)
df_aggregated = df_from_list.agg(avg("age").alias("average_age"))

# Transform column
df_transformed = df_from_list.withColumn("city", concat(df_from_list["city"], lit(" - USA")))

#Build a view and query it


df_transformed.createOrReplaceTempView("people_view")
result_df = spark.sql("SELECT name, years, city FROM people_view WHERE years > 30")

# Show results
df_from_list.show()
df_from_rdd.show()
df_renamed.show()
df_casted.show()
df_filtered.show()
df_aggregated.show()
df_transformed.show()
result_df.show()
Output:
df_from_list.show()
----------------------------------
Name age city
-----------------------------------
Alice 29 New York
Bob 31 San Francisco
Cathy 25 Los Angeles
------------------------------------

df_from_rdd.show()
-----------------------------------
Name age city
-----------------------------------
David 22 Chicago
Eva 35 Seattle
-----------------------------------
df_renamed.show()
-----------------------------------
Name years city
-----------------------------------
Alice 29 New York
Bob 31 San Francisco
Cathy 25 Los Angeles
-----------------------------------
df_casted.show()
-----------------------------------
Name age city
-----------------------------------
Alice 29.0 New York
Bob 31.0 San Francisco
Cathy 25.0 Los Angeles
-----------------------------------
df_casted.printSchema()
root
|-- name: string (nullable = true)
|-- age: float (nullable = true)
|-- city: string (nullable = true)
df_filtered.show()
-----------------------------------
Name age city
-----------------------------------
Alice 29 New York
Bob 31 San Francisco
-----------------------------------
df_aggregated.show()
-----------------------------------
average_age
-----------------------------------
28.333333333
-----------------------------------
df_transformed.show()
-----------------------------------
name age city
-----------------------------------
Alice 29 New York - USA
Bob 31 San Francisco - USA
Cathy 25 Los Angeles - USA
-----------------------------------

result_df.show()
-----------------------------------
Name years city
-----------------------------------

Bob 31.0 San Francisco - USA


-----------------------------------
Exercise 9:

Write a pySpark program to count the number of occurrences of words in a text and use
explicit caching.

Code:

from pyspark.sql import SparkSession

# Initialize Spark session


spark = SparkSession.builder.appName("WordCountWithCaching").getOrCreate()

# Load text data


text_rdd = spark.sparkContext.textFile("sample_text.txt")

# Split lines into words


words_rdd = text_rdd.flatMap(lambda line: line.split(" "))

# Map each word to (word, 1)


word_pairs_rdd = words_rdd.map(lambda word: (word, 1))

# Cache the RDD to improve performance for repeated operations


word_pairs_rdd = word_pairs_rdd.cache()

# Count the occurrences of each word


word_counts_rdd = word_pairs_rdd.reduceByKey(lambda a, b: a + b)

# Collect and print the word counts


word_counts = word_counts_rdd.collect()
for word, count in word_counts:
print(f"Word: {word}, Count: {count}")

Output:

Word: hello, Count: 4


Word: world, Count: 2
Word: Spark, Count: 1
Word: PySpark, Count: 1
Exercise 10:

Analyze the impact of number of worker cores on a parallelized operation and use Caching to
reduce computation time.

Code:

PySpark

1. Initialize Spark and Load Data

Start a Spark session with a specified number of cores.

from pyspark.sql import SparkSession


from pyspark import SparkConf

# Configure Spark session with a specified number of cores


conf = SparkConf().setAppName("WorkerCoreAnalysis").set("spark.executor.cores", "4")
spark = SparkSession.builder.config(conf=conf).getOrCreate()

2. Create a Large RDD for Testing

We create an RDD with a large number of integers to simulate a heavy computation.

# Create an RDD with a large range of numbers


data_rdd = spark.sparkContext.parallelize(range(1, 10000000))

3. Apply Transformations and Actions with and without Caching

Perform a map and filter transformation on data_rdd, then count the elements and calculate
the sum. First, we do this without caching, and then with caching to observe the performance
improvement.

from time import time

# Without caching
start_time = time()
filtered_rdd = data_rdd.map(lambda x: x * 2).filter(lambda x: x % 3 == 0)
count_result = filtered_rdd.count()
sum_result = filtered_rdd.sum()
print("Without Caching -> Count:", count_result, ", Sum:", sum_result)
print("Time taken without caching:", time() - start_time, "seconds")

# With caching
start_time = time()
filtered_rdd = data_rdd.map(lambda x: x * 2).filter(lambda x: x % 3 == 0).cache()
count_result = filtered_rdd.count()
sum_result = filtered_rdd.sum()
print("With Caching -> Count:", count_result, ", Sum:", sum_result)
print("Time taken with caching:", time() - start_time, "seconds")

Run the code to observe the time taken in each case:


o Without Caching: Each action (count and sum) re-runs the transformations, leading
to longer computation times.
o With Caching: The transformations (map and filter) are executed only once and
stored in memory, so subsequent actions (count and sum) access the cached data,
reducing computation time significantly.

Different Core Settings

To analyze the effect of the number of cores, change the spark.executor.cores setting (e.g.,
from 2 to 4, 8, etc.) and observe the performance impact. Number of cores depends on the
dataset size, transformations, and hardware resources.

Expected Observations

 Increased Cores: As the number of cores increases, the time taken to perform the
transformations and actions should decrease due to higher parallelism, up to a certain point.
 Caching Benefits: With caching, repeated actions (like count and sum in this example) on the
same data take less time because they access cached data rather than re-running
transformations.

Output:

Dataset: An RDD with 10 million integers.


Cluster Configuration: 4-worker Spark cluster, each with 4 cores (total 16 cores).
Operation: Transformations (e.g., map and filter) followed by count and sum actions.

Without Caching

 Transformation (map + filter with count action): ~6 seconds


 Transformation (map + filter with sum action): ~6 seconds
 Total Time Without Caching: 6 + 6 = ~12 seconds

With Caching

 Transformation (map + filter) and Caching: ~6 seconds


 Count Action on Cached Data: ~0.5 seconds
 Sum Action on Cached Data: ~0.5 seconds
 Total Time With Caching: 6 + 0.5 + 0.5 = ~7 seconds

You might also like