Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
DIV :- MBA II
ROLL NO :- 57
Implementation Of REPL
Either an object Or a class can be wrapped by user code the switch used is -Yrepl-
class-based.
Each and every line of input is compiled separately.
The Dependencies on previous lines are included by automatically generated
imports.
The implicit import of scala.Predef can be controlled by inputting an explicit
import.
$scala
Let’s understand how we can add two variable using Scala REPL.
In first line we initialized two variable in Scala REPL. Then Scala REPL printed these. In this
we can see that internally it create two variable of type Int with value. Then we executed
expression of sum with defined two variable. with this Scala REPL printed sum of expression
on screen again. Here it did not have any variable so it showed it with its temporary variable
only with prefix res. We can use these variable as same like we created it.
We can get more information of these temporary variable by calling getClass function over
these variable like below.
We can do lots of experiments like this with scala REPL on run time which would have been
time consuming if we were using some IDE. With scala2.0 we can also list down all function
suggestion that we can apply on variable by pressing TAB key.
Answer :-
Apache Spark is general purpose cluster computing system. It provides high-level API in
Java, Scala, Python, and R. Spark provide an optimized engine that supports general
execution graph. It also has abundant high-level tools for structured data processing, machine
learning, graph processing and streaming. The Spark can either run alone or on an
existing cluster manager. Follow this link to Learn more about Apache Spark.
Spark ecosystem
Apache Spark Ecosystem – Spark Core, Spark SQL, Spark Streaming, MLlib, GraphX,
SparkR. Following are 6 components in Apache Spark Ecosystem which empower to
Apache Spark- Spark Core, Spark SQL, Spark Streaming, Spark MLlib, Spark GraphX, and
SparkR.
Modes of Spark
Cluster Mode
In the cluster mode, the Spark driver or spark application master will get started in
any of the worker machines. So, the client who is submitting the application can
submit the application and the client can go away after initiating the application or
can continue with some other work. So, it works with the concept of Fire and
Forgets.
./sbin/start-master.sh
./sbin/start-slave.sh spark://<<hostname/ipaddress>>:portnumber -
worker1
./sbin/start-slave.sh spark://<<hostname/ipaddress>>:portnumber -
worker2
NOTE: Your class name, Jar File and partition number could be different.
Client Mode
In the client mode, the client who is submitting the spark application will start the
driver and it will maintain the spark context. So, till the particular job execution gets
over, the management of the task will be done by the driver. Also, the client should
be in touch with the cluster. The client will have to be online until that particular job
gets completed.
In this mode, the client can keep getting the information in terms of what is the status
and what are the changes happening on a particular job. So, in case if we want to
keep monitoring the status of that particular job, we can submit the job in client
mode. In this mode, the entire application is dependent on the Local machine since
the Driver resides in here. In case of any issue in the local machine, the driver will go
off. Subsequently, the entire application will go off. Hence this mode is not suitable
for Production use cases. However, it is good for debugging or testing since we can
throw the outputs on the driver terminal which is a Local machine.
Meanwhile, it requires only change in deploy-mode which is the client in Client mode
and cluster in Cluster mode.
a step-by-step guide to install Apache Spark. Spark can be configured with multiple
cluster managers like YARN, Mesos etc. Along with that it can be configured in local
mode and standalone mode.
Amazon EC2
Apache Mesos
Hadoop YARN
Understanding concept of data frame, Loading data in data frame, Operations on data frames
Answer :=
A Dataset in Python?
Steps to Import a CSV File into Python using Pandas
1. Step 1: Capture the File Path. Firstly, capture the full path where your
CSV file is stored.
2. Step 2: Apply the Python code.
3. Step 3: Run the Code.
4. Optional Step: Select Subset of Columns.
DataFrames are generic data objects of R which are used to store the tabular
data. Data frames are considered to be the most popular data objects in R
programming because it is more comfortable to analyze the data in the tabular
form. Data frames can also be taught as mattresses where each column of a
matrix can be of the different data types. DataFrame are made up of three
principal components, the data, rows, and columns.
Creating a DataFrame:
In the real world, a DataFrame will be created by loading the datasets from
existing storage, storage can be SQL Database, CSV file, and an Excel file.
DataFrame can also be created from the vectors in R. Following are some of
the various ways that can be used to create a DataFrame:
Creating a data frame using Vectors: To create a data frame we use
the data.frame() function in R. To create a data frame
use data.frame() command and then pass each of the vectors you have created
as arguments to the function.