BDF Programs
BDF Programs
Build and set up Hadoop framework to run in single-node and multi-node setup.
1. Download Java: Download the Java Development Kit (JDK) from the Oracle JDK website or
OpenJDK.
2. Install Java: Follow the installation prompts and install it to a directory like
C:\Java\jdk1.8.0_281.
3. Set JAVA_HOME Environment Variable:
o Go to Control Panel > System and Security > System > Advanced system settings >
Environment Variables.
o Click New under System variables and set:
Variable name: JAVA_HOME
Variable value: C:\Java\jdk1.8.0_281
o Add %JAVA_HOME%\bin to the Path variable.
1. Download Hadoop: Visit the Apache Hadoop releases page and download the latest stable
release (e.g., hadoop-3.3.6.tar.gz).
2. Extract Hadoop:
o Extract the downloaded file to a directory like C:\hadoop-3.3.6.
set JAVA_HOME=C:\Java\jdk1.8.0_281
o Edit (in Notepad) core-site.xml: Add the following configuration inside the
<configuration> tags:
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/C:/hadoop/tmp</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/C:/hadoop/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/C:/hadoop/data/datanode</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
C:\Users\Dell-PC>start-dfs.cmd
C:\Users\Dell-PC>start-yarn.cmd
1. Check HDFS:
o Open a web browser and go to http://localhost:9870.
2. Check YARN:
o Go to http://localhost:8088.
1. Ensure that all nodes (Master and Slaves) can communicate via SSH. Install an SSH client
like OpenSSH or PuTTY on each node.
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value> <!-- Replace 'master' with the IP or hostname
of your master node -->
</property>
o Edit hdfs-site.xml:
<property>
<name>dfs.replication</name>
<value>3</value> <!-- Assuming 3 nodes; adjust accordingly -->
</property>
o Edit yarn-site.xml:
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value> <!-- Replace with the IP or hostname of the master node --
>
</property>
1. Start HDFS:
o On the master node, run:
start-dfs.cmd
2. Start YARN:
o On the master node, run:
start-yarn.cmd
1. Check HDFS:
o Go to http://master:9870 (replace master with the master node's IP or hostname).
2. Check YARN:
o Go to http://master:8088.
Exercise 2:
Write and implement a text file processing program using MapReduce model.
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
1. Open Command Prompt: Navigate to the directory where you saved WordCount.java.
2. Compile the Program: Use the following command to compile the Java file:
3. Create a JAR File: Package the compiled classes into a JAR file:
This command displays the word counts from the input text file.
(OR)
Write and implement a text file processing program using MapReduce model.
Stop-all.cmd
Exercise 3:
You need to install and configure the necessary software on Windows/Unix/Linux machine.
a. Install Java
f. Install findspark
jupyter notebook
import findspark
findspark.init()
Next, load the dataset, process the data, and apply MapReduce logic to identify potential customers.
First, need a dataset to work with. For example, create a simple CSV file customer_data.csv
containing customer details:
CustomerID,Age,Income,PurchaseHistory,Location
1,34,50000,"Electronics, Clothes","NY"
2,45,75000,"Groceries, Furniture","LA"
3,29,35000,"Electronics, Groceries","SF"
4,39,65000,"Clothes, Groceries","NY"
data_file = "C:/path/to/your/data/customer_data.csv"
customer_df = spark.read.option("header", "true").csv(data_file)
customer_df.show()
Convert columns like Age and Income to integers and remove rows with missing values:
# Data preprocessing
customer_df = customer_df.withColumn("Age", ol("Age").cast("int")) \
.withColumn("Income", col("Income").cast("int")) \.dropna()
customer_df.show()
Now filter the dataset based on criteria. For instance, identify potential customers who have an
income greater than $60,000 and are aged between 30 and 50:
Convert the filtered DataFrame into an RDD and apply a MapReduce approach to count the number
of potential customers by location:
After running the analysis, stop the Spark session to release resources:
Run the Jupyter notebook cell by cell. The output showing the count of potential customers by
location is generated.
OUTPUT:
Consider the dataset (attached Ex.4.txt) and write a mapper and reducer program for finding the cost
of the item that is most expensive, for each location.
Ex.4.txt (dataset)
Mapper Code
This class will emit the location as the key and the item cost as the value.
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException
{
// Split the input line into columns
String[] columns = value.toString().split("\t");
// Extract the location and cost (assuming location is at index 2, and cost is at index
4)
if (columns.length >= 5)
{
String location = columns[2];
String cost = columns[4];
try
{
// Emit the location as the key and cost as the value
context.write(new Text(location), new Text(cost));
}
catch (NumberFormatException e)
{
// If there's an issue with parsing the cost, skip this record
e.printStackTrace();
}
}
}
}
Reducer Code
This class will receive the location and list of costs, and it will output the maximum cost for each
location.
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
@Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
InterruptedException
{
double maxCost = 0.0;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
job.setJarByClass(MaxCostDriver.class);
job.setMapperClass(MaxCostMapper.class);
job.setReducerClass(MaxCostReducer.class);
// Wait for the job to complete and exit based on the result
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
You can compile the Java files using javac and the Hadoop libraries.
Once compiled, you can run the job using the Hadoop command-line tool.
Install Apache Spark and configure it to run on a single machine. Create workers on different
machines and configure a multi-node setup
To install Apache Spark and configure it for both single-node and multi-node (cluster) setups on a
Windows machine, follow the detailed steps below.
Spark requires Java to run. Download and install the JDK from the Oracle website or the OpenJDK
website.
java -version
If you want to use PySpark (Spark with Python), install Python from the official website.
python --version
If you're using Hadoop’s HDFS, you will need winutils.exe. Download it from the WinUtils
repository and place it in a directory like C:\hadoop\bin.
Go to the Apache Spark download page and download the latest pre-built version for
Hadoop.
Choose:
o Package Type: Pre-built for Apache Hadoop.
o Version: Choose a stable version.
2.2. Extract Spark Files
Spark operates in a standalone mode with one master and worker running on a single machine by
default. Follow these steps to set it up:
Edit the spark-env.cmd file to set the necessary environment variables. Add the following lines to
specify the master and worker behavior:
set SPARK_MASTER_HOST=localhost
set SPARK_LOCAL_IP=localhost
You can also configure memory allocation, CPU cores, etc., as needed.
Open Command Prompt, navigate to the C:\spark\sbin directory, and start the master:
start-master.cmd
start-worker.cmd spark://localhost:7077
In a multi-node setup, you will have a master node and multiple worker nodes on different machines.
Follow these steps for a cluster setup.
4.1. Prerequisites for Each Machine
set SPARK_MASTER_HOST=<Master_IP_or_Hostname>
set SPARK_LOCAL_IP=<Master_IP_or_Hostname>
o Make sure that the SPARK_LOCAL_IP and SPARK_MASTER_HOST are set to the
master machine's IP address or hostname.
2. Start the Spark Master:
o Navigate to the sbin directory in Spark and run:
start-master.cmd
1. Edit spark-env.cmd on Each Worker: Set the following in spark-env.cmd on each worker
machine:
set SPARK_LOCAL_IP=<Worker_IP_or_Hostname>
2. Add Worker IP Addresses to the Master’s slaves File: On the master machine, edit the
slaves file located at C:\spark\conf. Add the IP addresses or hostnames of all the worker
machines. For example:
worker1_ip
worker2_ip
worker3_ip
3. Start Workers:
o On each worker node, start the worker by pointing it to the master:
start-worker.cmd spark://<Master_IP_or_Hostname>:7077
Once the cluster is set up, you can submit jobs from any node (master or workers) using the spark-
submit command.
This will run the job across all available worker nodes.
stop-master.cmd
stop-worker.cmd
Exercise 6:
Install Mesos and configure Spark to run with Mesos and perform dynamic resource allocation.
Mesos requires several dependencies. Run the following commands to install them:
Mesos must be built from the source. Here are the steps:
cd /usr/local
sudo git clone https://github.com/apache/mesos.git
cd mesos
sudo git checkout 1.11.0 # Replace with the latest version if needed
sudo ./bootstrap
sudo mkdir build
cd build
sudo ../configure
sudo make -j4
sudo make install
wget https://archive.apache.org/dist/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz
tar -xzf spark-3.4.0-bin-hadoop3.tgz
cd spark-3.4.0-bin-hadoop3
export SPARK_MASTER=mesos://<Mesos-Master-IP>:5050
export MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
Using pySpark in an interactive mode perform the following tasks: i) load data from a CSV file
(CRAN package download logs http://cranii) logs.rstudio.com/) iii) display the first n rows of
the data iv) transform each row of data into an array use countbykey method to find the
number of downloads for a package
Code:
Output:
CSV file
date, package, version, country
2024-11-01, ggplot2, 3.3.5, US
2024-11-01, dplyr, 1.0.2, US
2024-11-01, ggplot2, 3.3.5, CA
2024-11-02, dplyr, 1.0.2, US
2024-11-02, tidyverse, 1.3.0, IN
2024-11-02, ggplot2, 3.3.5, US
2024-11-02, dplyr, 1.0.2, IN
Code:
# Transform column
df_transformed = df_from_list.withColumn("city", concat(df_from_list["city"], lit(" - USA")))
# Show results
df_from_list.show()
df_from_rdd.show()
df_renamed.show()
df_casted.show()
df_filtered.show()
df_aggregated.show()
df_transformed.show()
result_df.show()
Output:
df_from_list.show()
----------------------------------
Name age city
-----------------------------------
Alice 29 New York
Bob 31 San Francisco
Cathy 25 Los Angeles
------------------------------------
df_from_rdd.show()
-----------------------------------
Name age city
-----------------------------------
David 22 Chicago
Eva 35 Seattle
-----------------------------------
df_renamed.show()
-----------------------------------
Name years city
-----------------------------------
Alice 29 New York
Bob 31 San Francisco
Cathy 25 Los Angeles
-----------------------------------
df_casted.show()
-----------------------------------
Name age city
-----------------------------------
Alice 29.0 New York
Bob 31.0 San Francisco
Cathy 25.0 Los Angeles
-----------------------------------
df_casted.printSchema()
root
|-- name: string (nullable = true)
|-- age: float (nullable = true)
|-- city: string (nullable = true)
df_filtered.show()
-----------------------------------
Name age city
-----------------------------------
Alice 29 New York
Bob 31 San Francisco
-----------------------------------
df_aggregated.show()
-----------------------------------
average_age
-----------------------------------
28.333333333
-----------------------------------
df_transformed.show()
-----------------------------------
name age city
-----------------------------------
Alice 29 New York - USA
Bob 31 San Francisco - USA
Cathy 25 Los Angeles - USA
-----------------------------------
result_df.show()
-----------------------------------
Name years city
-----------------------------------
Write a pySpark program to count the number of occurrences of words in a text and use
explicit caching.
Code:
Output:
Analyze the impact of number of worker cores on a parallelized operation and use Caching to
reduce computation time.
Code:
PySpark
Perform a map and filter transformation on data_rdd, then count the elements and calculate
the sum. First, we do this without caching, and then with caching to observe the performance
improvement.
# Without caching
start_time = time()
filtered_rdd = data_rdd.map(lambda x: x * 2).filter(lambda x: x % 3 == 0)
count_result = filtered_rdd.count()
sum_result = filtered_rdd.sum()
print("Without Caching -> Count:", count_result, ", Sum:", sum_result)
print("Time taken without caching:", time() - start_time, "seconds")
# With caching
start_time = time()
filtered_rdd = data_rdd.map(lambda x: x * 2).filter(lambda x: x % 3 == 0).cache()
count_result = filtered_rdd.count()
sum_result = filtered_rdd.sum()
print("With Caching -> Count:", count_result, ", Sum:", sum_result)
print("Time taken with caching:", time() - start_time, "seconds")
To analyze the effect of the number of cores, change the spark.executor.cores setting (e.g.,
from 2 to 4, 8, etc.) and observe the performance impact. Number of cores depends on the
dataset size, transformations, and hardware resources.
Expected Observations
Increased Cores: As the number of cores increases, the time taken to perform the
transformations and actions should decrease due to higher parallelism, up to a certain point.
Caching Benefits: With caching, repeated actions (like count and sum in this example) on the
same data take less time because they access cached data rather than re-running
transformations.
Output:
Without Caching
With Caching