Practical 11cdscds
Practical 11cdscds
Practical 11cdscds
If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
In case you don’t have Scala installed on your system, then proceed to next step for Scala installation.
Download the latest version of Scala by visit the following link https://www.scala-lang.org/download/.
After downloading, you will find the Scala tar file in the download folder.
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv scala-2.11.6 /usr/local/scala
# exit
The following commands for moving the Spark software files to respective directory (/usr/local/spark).
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
# exit
export PATH=$PATH:/usr/local/spark/bin
$ source ~/.bashrc
$spark-shell
If spark is installed successfully then you will find the following output.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/09/23 10:50:14 INFO SecurityManager: Changing view acls to: hadoop
16/09/23 10:50:14 INFO SecurityManager: Changing modify acls to: hadoop
16/09/23 10:50:14 INFO SecurityManager: SecurityManager: authentication disabled;
ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions:
Set(hadoop)
16/09/23 10:50:15 INFO HttpServer: Starting HTTP Server
16/09/23 10:50:17 INFO Utils: Successfully started service 'HTTP class server' on port 43292.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>
Apache Spark is a powerful open-source framework for big data processing and analytics. It provides an
interface for programming entire clusters with implicit data parallelism and fault tolerance.
This launches the Spark shell in Scala, where you can execute Spark code interactively.
Creating RDD (Resilient Distributed Dataset):
RDD is the fundamental data structure in Spark. You can create an RDD from various sources, such as text
files, HDFS, or by transforming an existing RDD.
These are three methods to create the RDD. We can use the first method, when data is already available
with the external systems like a local filesystem, HDFS, HBase, Cassandra, S3, etc. One can create an
RDD by calling a textFile method of Spark Context with path / URL as the argument. The second
approach can be used with the existing collections and the third one is a way to create new RDD from the
existing one.
[php]scala> data.count()[/php]
Filter Operation
Filter the RDD and create new RDD of items which contain word “DataFlair”. To filter, we need to call
transformation filter, which will return a new RDD with subset of items.
[php]scala> data.first()[/php]
[php]scala> data.take(5)[/php]
RDD Partitions
An RDD is made up of multiple partitions, to count the number of partitions:
[php]scala> data.partitions.length[/php]
Note: Minimum no. of partitions in the RDD is 2 (by default). When we create RDD from HDFS file then a
number of blocks will be equals to the number of partitions.
[php]scala> data.cache()[/php]
RDD will not be cached once you run above operation, you can visit the web UI:
http://localhost:4040/storage, it will be blank. RDDs are not explicitly cached once we run cache(), rather
RDDs will be cached once we run the Action, which actually needs data read from the disk.