How to Configure Windows to Build a Project Having Apache Spark Code Without Installing it?
Last Updated :
23 Jul, 2025
Apache Spark is a unified analytics engine and it is used to process large scale data. Apache spark provides the functionality to connect with other programming languages like Java, Python, R, etc. by using APIs. It provides an easy way to configure with other IDE as well to perform our tasks as per your requirements. It supports tools like Spark SQL for SQL, GraphX for graph processing, MLlib for Machine Learning, etc.
In this, You will see how we can configure Scala IDE to execute the Apache Spark code. And you will be able to learn to configure Spark Project in Scala IDE without explicitly installing Hadoop and Spark in your System. We will discuss each step in detail and you will be able to configure with these steps. Also, will cover the required dependency to configure, and we will also cover what pre-requisite will be required to configure Scala IDE. Here we will discuss and implement all the steps on Scala IDE. But these steps can be followed by any other IDE as well. If you want to configure this on Eclipse IDE then the same steps can be followed by Eclipse IDE as well.
Introduction
- Spark is an open-source distributed Big Data processing Framework developed by AMPLab, the University of California in 2009.
- Later Spark was donated to Apache Software Foundation. Now, It is maintained by Apache Foundation.
- Spark is the main Big Data processing engine, which is a written Scala programming language.
- Initially, there was MapReduce(based on the Java programming language) processing model in Hadoop.
- Spark Framework supports Java, Scala, Python, and R Programming languages.
Based on the Type of Data and functionality Spark has a different API to process as following:
- Basic building block of Spark is Spark core.
- Spark provides SparkSQL for Structured and Semi-Structured Data analysis which is based on DataFrame and Dataset.
- For Streaming Data Spark has Spark Streaming APIs.
- To implement the Machine Learning algorithm Spark provides MLib i.e. Distributed Machine Learning Framework.
- Graph Data can be effectively processed with GraphX i.e Distributed Graph Processing Framework.
SparkSQL, Spark Streaming, MLib, and GraphX are based on Spark core functionality and based on the concept of RDD i.e. Resilient Distributed Dataset. RDD is an immutable collection of distributed partition Dataset, which is stored on Data Nodes in the Hadoop cluster.
Prerequisite :
- Java. Make sure you have Java installed on your system.
- Scala IDE/Eclipse/Intellij IDEA: You can use either of these, whichever you familiar with. Here, you will see Scala IDE for your reference.
Step 1: Create a Maven Project
Creating a maven project is very simple. Please follow the below steps to create a Project.
- Click on File menu tab -> New -> Other
- Click on Maven Project. Here, in this step click on "Create a simple project(skip archtype selection)" check box, then click on "Next >"
- Add Group Id and Artifact Id, then click on "Finish"
With this, you have successfully created a Java project with Maven. Now, the next action is to add a dependency for Spark.
Step 2: Adding required Spark dependency into pom.xml
You can just find the pom.xml file into your newly created maven project and add below Spark(spark-core, spark-sql) dependency. These dependencies version you can change as per Project need.
Note: You can see added 2.4.0 dependency of Spark-core and spark-sql. Spark 3.0.1 version is also available, you can add the dependency according to your Spark version on the cluster and according to your project requirement.
XML
<project xmlns="https://maven.apache.org/POM/4.0.0"
xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>java.spark</groupId>
<artifactId>Spark-Learning</artifactId>
<version>0.0.1-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>com.thoughtworks.paranamer</groupId>
<artifactId>paranamer</artifactId>
<version>2.8</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>2.4.0</version>
</dependency>
</dependencies>
</project>
Step 3: Writing Sample Spark code
Now, you are almost done. Just create a Package with the name spark.java inside your Project. Then inside the newly created package, Create a Java Class SparkReadCSV. As we don't Have Hadoop installed on the system, still we can simply download winutils file and add that path as a Hadoop home directory path.
Here are few steps which we required to do this.
System.setProperty("hadoop.home.dir", "D:\\Hadoop\\");
Create an employee.txt file and add below dummy records.
id | name | address | salary |
---|
1 | Bernard Norris | Amberloup | 10172 |
2 | Sebastian Russell | Delicias | 18178 |
3 | Uriel Webster | Faisalabad | 16419 |
4 | Clarke Huffman | Merritt | 16850 |
5 | Orson Travis | Oberursel | 17435 |
Add below code to SparkReadCSV.java file. You can check the given below code with very descriptive comments for better understanding.
Code:
Java
package spark.java;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class SparkReadCSV {
public static void main(String[] args)
{
// Set winutils.exe file path
System.setProperty("hadoop.home.dir",
"D:\\Hadoop\\");
// Create a SparkSession object to process the data
// Function builder() used for creating SparkSession
// object
// Function appName() set a name for the application
// which will be show in YARN/Spark web UI.
// Function master() set a spark master URL to run
// application, such "local" to run locally OR
// "local[3]" to run with 3 cores OR "yarn-cluster"
// to run on YARN Hadoop cluster.
// Function getOrCreate() return a Spark session to
// execute application.
SparkSession spark
= SparkSession
.builder()
.appName("***** Reading CSV file.*****")
.master("local[3]")
.getOrCreate();
// Read sample CSV file.
// Read used to read data as a DataFrame.
// The boolean value in option function indicate that
// input data first line is header.
// The delimiter value("|") in option indicate that
// files records are | separated.
// function csv() is accept input data file path
// either from Local File System OR Hadoop Distributed
// File System.
// Here we are reading data from Local File System.
Dataset<Row> employeeDS
= spark
.read()
.option("header", true)
.option("delimiter", "|")
.csv("D:\\data\\employee.txt");
// Displaying the records.
employeeDS.show();
}
}
We have set up a Spark development environment with few easy steps. With this starting point, we can further explore Spark by solving different use cases.
Similar Reads
How to Install Apache JMeter on Windows? The Apache JMeter⢠application is open-source software, a 100% pure Java application designed to load test functional behavior and measure performance. It was originally designed for testing Web Applications but has since expanded to other test functions. It can be used to simulate a heavy load on a
2 min read
How to Install ReactJS on Windows: Step-by-Step Guide ReactJS has become one of the most popular JavaScript libraries for building dynamic and interactive user interfaces, powering some of the biggest websites and applications across the web. If you're a Windows user eager to dive into React development, setting up your environment properly is the firs
8 min read
How to Download and Install LibreCAD on Windows? LibreCad is computer software that is made for 2D designing used by graphic designers and engineers. It is free and open-source software that is capable of running on different platforms like macOS, Windows, Linux, etc. It was first launched in 2011 and it is built using the C++ programming language
2 min read
How to Install Apache Maven on Windows, macOS, and Linux? Apache Maven is a build automation tool used primarily for Java projects and helps in the management of dependencies, project builds, and documentation of the project.Here we will provide you with step-by-step instructions to install Apache Maven on Windows, macOS, and Linux.Installation of Maven in
5 min read
How to Install and Run Apache Kafka on Windows? Apache Kafka is an open-source application used for real-time streams for data in huge amount. Apache Kafka is a publish-subscribe messaging system. A messaging system lets you send messages between processes, applications, and servers. Broadly Speaking, Apache Kafka is software where topics can be
2 min read
How to Install Apache Tomcat on Windows? Apache Tomcat which is short for âTomcatâ is a free, open-source Java Servlet, Java Expression Language, JavaServer Pages, and WebSocket implementation. Tomcat is an HTTP web server that basically runs Java code in a âpure Javaâ environment. Here, we will see how to install Tomcat 10 on Windows 10 f
3 min read