Apache Spark Installation and Programming Guide
This is a step-by-step guide to install Apache Spark. Spark can be configured with multiple cluster
managers like YARN or in local mode and standalone mode.
StandaloneDeployMode
In this practical, you will be configuring Spark to run in standalone mode. Both driver and worker
nodes run on the same machine.
Since we use Java to write and run programs on Spark, ensure that Java 8 is pre-installed on
the machines on which you have to run Spark job.
To install Spark on the machine, you would download prebuilt binary of Spark from
http://spark.apache.org/downloads.html page.
Select the spark distribution as shown in the below snapshot:
You can also directly download Spark-1.6.1 by using the following command:
wget http://mirror.fibergrid.in/apache/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.4.tgz
Decompress the Spark file into the directory where you want to store Spark.
tar xvf spark-1.6.1-bin-hadoop2.4.tgz C /DeZyre
Make a softlink to the actual spark directory (This will be helpful for any version upgrade in future)
ln -s spark-1.5.2-bin-hadoop2.4 spark
Make an entry for spark in .bashrc file
SPARK_HOME=/mydirectory/spark
export PATH=$SPARK_HOME/bin:$PATH
Source the changed .bashrc file by the command
source ~/.bashrc
We have successfully configured spark in standalone mode. To check lets launch the Spark Shell by
the following command:
spark-shell
To check the Sparks Scala shell version by the following command
sc.version
WritingProgram
Next we will write a basic Java application to count a word in a file. Below is the source code for the
Word Count program in Apache Spark. You also need to import some Spark classes into your program.
You also need to include the path for the file to be used.
JavaRDD<String> textFile = sc.textFile("hdfs://...");
JavaRDD<String> words = textFile.flatMap(new FlatMapFunction<String,
String>() {
public Iterable<String> call(String s) { return Arrays.asList(s.split("
")); }
});
JavaPairRDD<String, Integer> pairs = words.mapToPair(new
PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) { return new Tuple2<String,
Integer>(s, 1); }
});
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new
Function2<Integer, Integer, Integer>() {
public Integer call(Integer a, Integer b) { return a + b; }
});
counts.saveAsTextFile("hdfs://...");