0% found this document useful (0 votes)
95 views11 pages

Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam

The document provides information about a student named Wable Snehal Mahesh studying Scala & Spark. It includes their name, subject, division, roll number, and guidance professor. It also provides summaries of 3 assignments - an introduction to Scala REPL, Spark ecosystem and modes of Spark, and understanding data frames.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views11 pages

Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam

The document provides information about a student named Wable Snehal Mahesh studying Scala & Spark. It includes their name, subject, division, roll number, and guidance professor. It also provides summaries of 3 assignments - an introduction to Scala REPL, Spark ecosystem and modes of Spark, and understanding data frames.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

NAME: WABLE SNEHAL MAHESH

SUBJECT:- SCALA & SPARK

DIV :- MBA II

ROLL NO :- 57

GUIDENCE NAME :- PROF. ARCHANA SURYAWANSHI – KADAM


Assignment 1
Scala in other frameworks Introduction to Scala REPL.
Answer :-
Scala | REPL
Scala REPL is an interactive command line interpreter shell, where REPL stands for Read-
Evaluate-Print-Loop. It works like it stands for only. It first Read expression provided as
input on Scala command line and then it Evaluate given expression and Print expression’s
outcome on screen and then it is again ready to Read and this thing goes in loop. In the scope
of the current expression as required, previous results are automatically imported. The REPL
reads expressions at the prompt In interactive mode, then wraps them into an executable
template, and after that compiles and executes the result.

Implementation Of REPL

 Either an object Or a class can be wrapped by user code the switch used is -Yrepl-
class-based.
 Each and every line of input is compiled separately.
 The Dependencies on previous lines are included by automatically generated
imports.
 The implicit import of scala.Predef can be controlled by inputting an explicit
import.

We can start Scala REPL by typing scala command in console/terminal.

$scala

Let’s understand how we can add two variable using Scala REPL.
In first line we initialized two variable in Scala REPL. Then Scala REPL printed these. In this
we can see that internally it create two variable of type Int with value. Then we executed
expression of sum with defined two variable. with this Scala REPL printed sum of expression
on screen again. Here it did not have any variable so it showed it with its temporary variable
only with prefix res. We can use these variable as same like we created it.
We can get more information of these temporary variable by calling getClass function over
these variable like below.
We can do lots of experiments like this with scala REPL on run time which would have been
time consuming if we were using some IDE. With scala2.0 we can also list down all function
suggestion that we can apply on variable by pressing TAB key.

Some More Important Features of REPL

 IMain of REPL is bound to $intp.

 The tab key is used for completion.

 lastException binds REPL’s last exception.

 :load is used to load a REPL input file.

 :javap is used to inspect class artifacts.

 -Yrepl-outdir is used to inspect class artifacts with external tools.

 :power imports compiler components after entering compiler mode.

 :help is used to get a list of commands to help the user.


Assignment 2
Spark Ecosystem, Modes of Spark, Spark installation demo.

Answer :-
Apache Spark is general purpose cluster computing system. It provides high-level API in
Java, Scala, Python, and R. Spark provide an optimized engine that supports general
execution graph. It also has abundant high-level tools for structured data processing, machine
learning, graph processing and streaming. The Spark can either run alone or on an
existing cluster manager. Follow this link to Learn more about Apache Spark.
Spark ecosystem
Apache Spark Ecosystem – Spark Core, Spark SQL, Spark Streaming, MLlib, GraphX,
SparkR. Following are 6 components in Apache Spark Ecosystem which empower to
Apache Spark- Spark Core, Spark SQL, Spark Streaming, Spark MLlib, Spark GraphX, and
SparkR.

Modes of Spark

Cluster Mode
In the cluster mode, the Spark driver or spark application master will get started in
any of the worker machines. So, the client who is submitting the application can
submit the application and the client can go away after initiating the application or
can continue with some other work. So, it works with the concept of Fire and
Forgets.

The question is: when to use Cluster-Mode? If we submit an application from a


machine that is far from the worker machines, for instance, submitting locally from
our laptop, then it is common to use cluster mode to minimize network latency
between the drivers and the executors. In any case, if the job is going to run for a
long period time and we don’t want to wait for the result then we can submit the job
using cluster mode so once the job submitted client doesn’t need to be online.

How to submit spark application in cluster mode


First, go to your spark installed directory and start a master and any number of
workers on a cluster using commands:

./sbin/start-master.sh

./sbin/start-slave.sh spark://<<hostname/ipaddress>>:portnumber -
worker1

./sbin/start-slave.sh spark://<<hostname/ipaddress>>:portnumber -
worker2

Then, run command:

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master


spark://<<hostname/ipaddress>>:portnumber --deploy-mode cluster
./examples/jars/spark-examples_2.11-2.3.1.jar 5(number of partitions)

NOTE: Your class name, Jar File and partition number could be different.

Client Mode
In the client mode, the client who is submitting the spark application will start the
driver and it will maintain the spark context. So, till the particular job execution gets
over, the management of the task will be done by the driver. Also, the client should
be in touch with the cluster. The client will have to be online until that particular job
gets completed.

In this mode, the client can keep getting the information in terms of what is the status
and what are the changes happening on a particular job. So, in case if we want to
keep monitoring the status of that particular job, we can submit the job in client
mode. In this mode, the entire application is dependent on the Local machine since
the Driver resides in here. In case of any issue in the local machine, the driver will go
off. Subsequently, the entire application will go off. Hence this mode is not suitable
for Production use cases. However, it is good for debugging or testing since we can
throw the outputs on the driver terminal which is a Local machine.

How to submit spark application in client mode?


First, go to your spark installed directory and start a master and any number of
workers on a cluster. Commands are mentioned above in Cluster mode. Then run
the following command:

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master


spark://<<hostname/ipaddress>>:portnumber --deploy-mode client
./examples/jars/spark-examples_2.11-2.3.1.jar 5(number of partitions)

Meanwhile, it requires only change in deploy-mode which is the client in Client mode
and cluster in Cluster mode.

Apache Spark installation Demo :-


Install Apache Spark on Windows
1. Step 1: Install Java 8. Apache Spark requires Java 8.
2. Step 2: Install Python.
3. Step 3: Download Apache Spark.
4. Step 4: Verify Spark Software File.
5. Step 5: Install Apache Spark.
6. Step 6: Add winutils.exe File.
7. Step 7: Configure Environment Variables.
Step 8: Launch Spark.

a step-by-step guide to install Apache Spark. Spark can be configured with multiple
cluster managers like YARN, Mesos etc. Along with that it can be configured in local
mode and standalone mode.

 Standalone Deploy Mode

o Simplest way to deploy Spark on a private cluster. Both driver and


worker nodes runs on the same machine.

 Amazon EC2

o EC2 scripts are available


o Very quick launching a new cluster

 Apache Mesos

o Driver run on the master

o Worker nodes run on separate machine

 Hadoop YARN

o Underlying storage is HDFS.

o Driver runs inside an application master process which is managed by


YARN on the cluster

o Worker nodes run on each datanode

Standalone mode is good to go for a developing applications in spark. Spark


processes runs in JVM. Java should be pre-installed on the machines on which we
have to run Spark job. Let’s install java before we configure spark.
Assignment 3

Understanding concept of data frame, Loading data in data frame, Operations on data frames

Answer :=

A) concept of data frame:


A DataFrame is the most common Structured API and simply represents a
table of data with rows and columns. The list of columns and the types in those
columns the schema. A simple analogy would be a spreadsheet with named
columns. The fundamental difference is that while a spreadsheet sits on one
computer in one specific location, a Spark DataFrame can span thousands of
computers. The reason for putting the data on more than one computer should
be intuitive: either the data is too large to fit on one machine or it would simply
take too long to perform that computation on one machine.

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional


array, or a table with rows and columns.

A data frame is used for storing data tables. It is a list of vectors of equal length.


For example, the following variable df is a data frame containing three vectors
n, s, b. > n = c(2, 3, 5) > s = c("aa", "bb", "cc").

Data Frames are widely used in data science, machine learning, scientific


computing, and many other data-intensive fields.

Retrieving Labels and Data


 Retrieve and modify row and column labels as sequences.
 Represent data as NumPy arrays.
 Check and adjust the data types.
 Analyze the size of Data Frame objects.

B) Loading data in data frame:

Load CSV files to Python Pandas


1. # Load the Pandas libraries with alias 'pd'
2. import pandas as pd.
3. # Read data from file 'filename.csv'
4. # (in the same directory that your python process is based)
5. # Control delimiters, rows, column names with read_csv (see later)
6. data = pd.
7. # Preview the first 5 lines of the loaded data.

Dataset into a Dataframe?


Using the read_csv() function from the pandas package, you
can import tabular data from CSV files into pandas dataframe by specifying a
parameter value for the file name (e.g. pd. read_csv("filename. csv") ).
Remember that you gave pandas an alias ( pd ), so you will use pd to call
pandas functions.

A Dataset in Python?
Steps to Import a CSV File into Python using Pandas
1. Step 1: Capture the File Path. Firstly, capture the full path where your
CSV file is stored.
2. Step 2: Apply the Python code.
3. Step 3: Run the Code.
4. Optional Step: Select Subset of Columns.

Load Data Via R Studio Menu Items


1. Text File or Web URL. As you can see in both the "Import Dataset"
menu items, you can import a data set "From Text File" or "From Web URL".
2. Selecting Data Format.
3. After the Data is Loaded.
4. read.
5. More read.
6. Assigning the Data Set to a Variable.
7. read.
C) Operations on data frames.

Operations that can be performed on a DataFrame are:


 Creating a DataFrame.
 Accessing rows and columns.
 Selecting the subset of the data frame.
 Editing dataframes.
 Adding extra rows and columns to the data frame.
 Add new variables to dataframe based on existing ones.
 Delete rows and columns in a data frame.
DataFrame Operations in R

DataFrames are generic data objects of R which are used to store the tabular
data. Data frames are considered to be the most popular data objects in R
programming because it is more comfortable to analyze the data in the tabular
form. Data frames can also be taught as mattresses where each column of a
matrix can be of the different data types. DataFrame are made up of three
principal components, the data, rows, and columns.
Creating a DataFrame:
In the real world, a DataFrame will be created by loading the datasets from
existing storage, storage can be SQL Database, CSV file, and an Excel file.
DataFrame can also be created from the vectors in R. Following are some of
the various ways that can be used to create a DataFrame:
Creating a data frame using Vectors: To create a data frame we use
the data.frame() function in R. To create a data frame
use data.frame() command and then pass each of the vectors you have created
as arguments to the function.

Operations performed on a series or Data frames:


We can perform basic operations on rows/columns like selecting, deleting,
adding, and renaming. Column Selection: In Order to select a column in
Pandas DataFrame, we can either access the columns by calling them by their
columns name.

Data Frame explain creation and operations of data frame in R with an


example:
A data frame is a table or a two-dimensional array-like structure in which each
column contains values of one variable and each row contains one set of values
from each column. Following are the characteristics of a data frame. The
column names should be non-empty. The row names should be unique.

You might also like