Practical 11cdscds

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Practical 11: Install Apache Spark and try basic commands.

Step 1: Verifying Java Installation


$java –version

Step 2: Verifying Scala installation


You should Scala language to implement Spark.
$scala –version

If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL

In case you don’t have Scala installed on your system, then proceed to next step for Scala installation.

Step 3: Downloading Scala

Download the latest version of Scala by visit the following link https://www.scala-lang.org/download/.
After downloading, you will find the Scala tar file in the download folder.

Step 4: Installing Scala

Extract the Scala tar file


$ tar xvf scala-2.11.6.tgz

Move Scala software files


Use the following commands for moving the Scala software files, to respective directory (/usr/local/scala).

$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv scala-2.11.6 /usr/local/scala
# exit

Set PATH for Scala


$ export PATH = $PATH:/usr/local/scala/bin

Verifying Scala Installation

After installation, it is better to verify it.


$scala -version
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL

Step 5: Downloading Apache Spark


Download the latest version of Spark by visiting the given link https://spark.apache.org/downloads.html.
After downloading it, you will find the Spark tar file in the download folder.

Step 6: Installing Spark

Extracting Spark tar


$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz

Moving Spark software files

The following commands for moving the Spark software files to respective directory (/usr/local/spark).
$ su –
Password:

# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
# exit

Setting up the environment for Spark


Add the following line to ~/.bashrc file. It means adding the location, where the spark software files are
located to the PATH variable.

export PATH=$PATH:/usr/local/spark/bin

Use the following command for sourcing the ~/.bashrc file.

$ source ~/.bashrc

Step 7: Verifying the Spark Installation

$spark-shell

If spark is installed successfully then you will find the following output.

Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/09/23 10:50:14 INFO SecurityManager: Changing view acls to: hadoop
16/09/23 10:50:14 INFO SecurityManager: Changing modify acls to: hadoop
16/09/23 10:50:14 INFO SecurityManager: SecurityManager: authentication disabled;
ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions:
Set(hadoop)
16/09/23 10:50:15 INFO HttpServer: Starting HTTP Server
16/09/23 10:50:17 INFO Utils: Successfully started service 'HTTP class server' on port 43292.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.0
/_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>

Apache Spark Commands:

Apache Spark is a powerful open-source framework for big data processing and analytics. It provides an
interface for programming entire clusters with implicit data parallelism and fault tolerance.

Starting Spark Shell:


spark-shell

This launches the Spark shell in Scala, where you can execute Spark code interactively.
Creating RDD (Resilient Distributed Dataset):
RDD is the fundamental data structure in Spark. You can create an RDD from various sources, such as text
files, HDFS, or by transforming an existing RDD.

Create a new RDD

a) Read File from local filesystem and create an RDD.

[php]scala> val data = sc.textFile(“data.txt”)[/php]

Note: sc is the object of SparkContext

Note: You need to create a file data.txt in Spark_Home directory

b) Create an RDD through Parallelized Collection

[php]scala> val no = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)


scala> val noData = sc.parallelize(no)[/php]

c) From Existing RDDs

[php]scala> val newRDD = no.map(data => (data * 2))[/php]

These are three methods to create the RDD. We can use the first method, when data is already available
with the external systems like a local filesystem, HDFS, HBase, Cassandra, S3, etc. One can create an
RDD by calling a textFile method of Spark Context with path / URL as the argument. The second
approach can be used with the existing collections and the third one is a way to create new RDD from the
existing one.

Number of Items in the RDD


Count the number of items available in the RDD. To count the items we need to call an Action:

[php]scala> data.count()[/php]

Filter Operation
Filter the RDD and create new RDD of items which contain word “DataFlair”. To filter, we need to call
transformation filter, which will return a new RDD with subset of items.

[php]scala> val DFData = data.filter(line => line.contains(“DataFlair”))[/php]

Transformation and Action together


For complex requirements, we can chain multiple operations together like filter transformation and count
action together:

[php]scala> data.filter(line => line.contains(“DataFlair”)).count()[/php]

Read the first item from the RDD


To read the first item from the file, you can use the following command:

[php]scala> data.first()[/php]

Read the first 5 item from the RDD


To read the first 5 item from the file, you can use the following command:

[php]scala> data.take(5)[/php]
RDD Partitions
An RDD is made up of multiple partitions, to count the number of partitions:

[php]scala> data.partitions.length[/php]

Note: Minimum no. of partitions in the RDD is 2 (by default). When we create RDD from HDFS file then a
number of blocks will be equals to the number of partitions.

Cache the file


Caching is the optimization technique. Once we cache the RDD in the memory all future computation will
work on the in-memory data, which saves disk seeks and improve the performance.

[php]scala> data.cache()[/php]

RDD will not be cached once you run above operation, you can visit the web UI:
http://localhost:4040/storage, it will be blank. RDDs are not explicitly cached once we run cache(), rather
RDDs will be cached once we run the Action, which actually needs data read from the disk.

Let’s run some actions


[php]scala> data.count()[/php]
[php]scala> data.collect()[/php]

Read Data from HDFS file


To read data from HDFS file we can specify complete hdfs URL like hdfs://IP:PORT/PATH

[php]scala> var hFile = sc.textFile(“hdfs://localhost:9000/inp”)[/php]

You might also like