0% found this document useful (0 votes)

96 views11 pages

Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam

The document provides information about a student named Wable Snehal Mahesh studying Scala & Spark. It includes their name, subject, division, roll number, and guidance professor. It also provides summaries of 3 assignments - an introduction to Scala REPL, Spark ecosystem and modes of Spark, and understanding data frames.

Uploaded by

Snehal Mahesh Wable

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

96 views11 pages

Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam

Uploaded by

Snehal Mahesh Wable

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

NAME: WABLE SNEHAL MAHESH

SUBJECT:- SCALA & SPARK

DIV :- MBA II

ROLL NO :- 57

GUIDENCE NAME :- PROF. ARCHANA SURYAWANSHI – KADAM

Assignment 1
Scala in other frameworks Introduction to Scala REPL.
Answer :-
Scala | REPL
Scala REPL is an interactive command line interpreter shell, where REPL stands for Read-
Evaluate-Print-Loop. It works like it stands for only. It first Read expression provided as
input on Scala command line and then it Evaluate given expression and Print expression’s
outcome on screen and then it is again ready to Read and this thing goes in loop. In the scope
of the current expression as required, previous results are automatically imported. The REPL
reads expressions at the prompt In interactive mode, then wraps them into an executable
template, and after that compiles and executes the result.

Implementation Of REPL

 Either an object Or a class can be wrapped by user code the switch used is -Yrepl-
class-based.
 Each and every line of input is compiled separately.
 The Dependencies on previous lines are included by automatically generated
imports.
 The implicit import of scala.Predef can be controlled by inputting an explicit
import.

We can start Scala REPL by typing scala command in console/terminal.

$scala

Let’s understand how we can add two variable using Scala REPL.
In first line we initialized two variable in Scala REPL. Then Scala REPL printed these. In this
we can see that internally it create two variable of type Int with value. Then we executed
expression of sum with defined two variable. with this Scala REPL printed sum of expression
on screen again. Here it did not have any variable so it showed it with its temporary variable
only with prefix res. We can use these variable as same like we created it.
We can get more information of these temporary variable by calling getClass function over
these variable like below.
We can do lots of experiments like this with scala REPL on run time which would have been
time consuming if we were using some IDE. With scala2.0 we can also list down all function
suggestion that we can apply on variable by pressing TAB key.

Some More Important Features of REPL

 IMain of REPL is bound to $intp.

 The tab key is used for completion.

 lastException binds REPL’s last exception.

 :load is used to load a REPL input file.

 :javap is used to inspect class artifacts.

 -Yrepl-outdir is used to inspect class artifacts with external tools.

 :power imports compiler components after entering compiler mode.

 :help is used to get a list of commands to help the user.

Assignment 2
Spark Ecosystem, Modes of Spark, Spark installation demo.

Answer :-
Apache Spark is general purpose cluster computing system. It provides high-level API in
Java, Scala, Python, and R. Spark provide an optimized engine that supports general
execution graph. It also has abundant high-level tools for structured data processing, machine
learning, graph processing and streaming. The Spark can either run alone or on an
existing cluster manager. Follow this link to Learn more about Apache Spark.
Spark ecosystem
Apache Spark Ecosystem – Spark Core, Spark SQL, Spark Streaming, MLlib, GraphX,
SparkR. Following are 6 components in Apache Spark Ecosystem which empower to
Apache Spark- Spark Core, Spark SQL, Spark Streaming, Spark MLlib, Spark GraphX, and
SparkR.

Modes of Spark

Cluster Mode
In the cluster mode, the Spark driver or spark application master will get started in
any of the worker machines. So, the client who is submitting the application can
submit the application and the client can go away after initiating the application or
can continue with some other work. So, it works with the concept of Fire and
Forgets.

The question is: when to use Cluster-Mode? If we submit an application from a

machine that is far from the worker machines, for instance, submitting locally from
our laptop, then it is common to use cluster mode to minimize network latency
between the drivers and the executors. In any case, if the job is going to run for a
long period time and we don’t want to wait for the result then we can submit the job
using cluster mode so once the job submitted client doesn’t need to be online.

How to submit spark application in cluster mode

First, go to your spark installed directory and start a master and any number of
workers on a cluster using commands:

./sbin/start-master.sh

./sbin/start-slave.sh spark://<<hostname/ipaddress>>:portnumber -
worker1

./sbin/start-slave.sh spark://<<hostname/ipaddress>>:portnumber -
worker2

Then, run command:

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master

spark://<<hostname/ipaddress>>:portnumber --deploy-mode cluster
./examples/jars/spark-examples_2.11-2.3.1.jar 5(number of partitions)

NOTE: Your class name, Jar File and partition number could be different.

Client Mode
In the client mode, the client who is submitting the spark application will start the
driver and it will maintain the spark context. So, till the particular job execution gets
over, the management of the task will be done by the driver. Also, the client should
be in touch with the cluster. The client will have to be online until that particular job
gets completed.

In this mode, the client can keep getting the information in terms of what is the status
and what are the changes happening on a particular job. So, in case if we want to
keep monitoring the status of that particular job, we can submit the job in client
mode. In this mode, the entire application is dependent on the Local machine since
the Driver resides in here. In case of any issue in the local machine, the driver will go
off. Subsequently, the entire application will go off. Hence this mode is not suitable
for Production use cases. However, it is good for debugging or testing since we can
throw the outputs on the driver terminal which is a Local machine.

How to submit spark application in client mode?

First, go to your spark installed directory and start a master and any number of
workers on a cluster. Commands are mentioned above in Cluster mode. Then run
the following command:

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master

spark://<<hostname/ipaddress>>:portnumber --deploy-mode client
./examples/jars/spark-examples_2.11-2.3.1.jar 5(number of partitions)

Meanwhile, it requires only change in deploy-mode which is the client in Client mode
and cluster in Cluster mode.

Apache Spark installation Demo :-

Install Apache Spark on Windows
1. Step 1: Install Java 8. Apache Spark requires Java 8.
2. Step 2: Install Python.
3. Step 3: Download Apache Spark.
4. Step 4: Verify Spark Software File.
5. Step 5: Install Apache Spark.
6. Step 6: Add winutils.exe File.
7. Step 7: Configure Environment Variables.
Step 8: Launch Spark.

a step-by-step guide to install Apache Spark. Spark can be configured with multiple
cluster managers like YARN, Mesos etc. Along with that it can be configured in local
mode and standalone mode.

 Standalone Deploy Mode

o Simplest way to deploy Spark on a private cluster. Both driver and

worker nodes runs on the same machine.

 Amazon EC2

o EC2 scripts are available

o Very quick launching a new cluster

 Apache Mesos

o Driver run on the master

o Worker nodes run on separate machine

 Hadoop YARN

o Underlying storage is HDFS.

o Driver runs inside an application master process which is managed by

YARN on the cluster

o Worker nodes run on each datanode

Standalone mode is good to go for a developing applications in spark. Spark

processes runs in JVM. Java should be pre-installed on the machines on which we
have to run Spark job. Let’s install java before we configure spark.
Assignment 3

Understanding concept of data frame, Loading data in data frame, Operations on data frames

Answer :=

A) concept of data frame:

A DataFrame is the most common Structured API and simply represents a
table of data with rows and columns. The list of columns and the types in those
columns the schema. A simple analogy would be a spreadsheet with named
columns. The fundamental difference is that while a spreadsheet sits on one
computer in one specific location, a Spark DataFrame can span thousands of
computers. The reason for putting the data on more than one computer should
be intuitive: either the data is too large to fit on one machine or it would simply
take too long to perform that computation on one machine.

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional

array, or a table with rows and columns.

A data frame is used for storing data tables. It is a list of vectors of equal length.

For example, the following variable df is a data frame containing three vectors
n, s, b. > n = c(2, 3, 5) > s = c("aa", "bb", "cc").

Data Frames are widely used in data science, machine learning, scientific

computing, and many other data-intensive fields.

Retrieving Labels and Data

 Retrieve and modify row and column labels as sequences.
 Represent data as NumPy arrays.
 Check and adjust the data types.
 Analyze the size of Data Frame objects.

B) Loading data in data frame:

Load CSV files to Python Pandas

1. # Load the Pandas libraries with alias 'pd'
2. import pandas as pd.
3. # Read data from file 'filename.csv'
4. # (in the same directory that your python process is based)
5. # Control delimiters, rows, column names with read_csv (see later)
6. data = pd.
7. # Preview the first 5 lines of the loaded data.

Dataset into a Dataframe?

Using the read_csv() function from the pandas package, you
can import tabular data from CSV files into pandas dataframe by specifying a
parameter value for the file name (e.g. pd. read_csv("filename. csv") ).
Remember that you gave pandas an alias ( pd ), so you will use pd to call
pandas functions.

A Dataset in Python?
Steps to Import a CSV File into Python using Pandas
1. Step 1: Capture the File Path. Firstly, capture the full path where your
CSV file is stored.
2. Step 2: Apply the Python code.
3. Step 3: Run the Code.
4. Optional Step: Select Subset of Columns.

Load Data Via R Studio Menu Items

1. Text File or Web URL. As you can see in both the "Import Dataset"
menu items, you can import a data set "From Text File" or "From Web URL".
2. Selecting Data Format.
3. After the Data is Loaded.
4. read.
5. More read.
6. Assigning the Data Set to a Variable.
7. read.
C) Operations on data frames.

Operations that can be performed on a DataFrame are:

 Creating a DataFrame.
 Accessing rows and columns.
 Selecting the subset of the data frame.
 Editing dataframes.
 Adding extra rows and columns to the data frame.
 Add new variables to dataframe based on existing ones.
 Delete rows and columns in a data frame.
DataFrame Operations in R

DataFrames are generic data objects of R which are used to store the tabular
data. Data frames are considered to be the most popular data objects in R
programming because it is more comfortable to analyze the data in the tabular
form. Data frames can also be taught as mattresses where each column of a
matrix can be of the different data types. DataFrame are made up of three
principal components, the data, rows, and columns.
Creating a DataFrame:
In the real world, a DataFrame will be created by loading the datasets from
existing storage, storage can be SQL Database, CSV file, and an Excel file.
DataFrame can also be created from the vectors in R. Following are some of
the various ways that can be used to create a DataFrame:
Creating a data frame using Vectors: To create a data frame we use
the data.frame() function in R. To create a data frame
use data.frame() command and then pass each of the vectors you have created
as arguments to the function.

Operations performed on a series or Data frames:

We can perform basic operations on rows/columns like selecting, deleting,
adding, and renaming. Column Selection: In Order to select a column in
Pandas DataFrame, we can either access the columns by calling them by their
columns name.

Data Frame explain creation and operations of data frame in R with an

example:
A data frame is a table or a two-dimensional array-like structure in which each
column contains values of one variable and each row contains one set of values
from each column. Following are the characteristics of a data frame. The
column names should be non-empty. The row names should be unique.

Apache Airflow 1741977651
No ratings yet
Apache Airflow 1741977651
83 pages
Azure Devops Pipelines Azure Devops
No ratings yet
Azure Devops Pipelines Azure Devops
2,075 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Master The Configuration Of Apache Tomcat On Linux
From Everand
Master The Configuration Of Apache Tomcat On Linux
Koru Lenag
No ratings yet
Spark in Production
No ratings yet
Spark in Production
34 pages
Airflow
No ratings yet
Airflow
37 pages
Big Data Computing Spark Built-In Libraries
No ratings yet
Big Data Computing Spark Built-In Libraries
11 pages
Data Science RR Itec-Deep Learning
No ratings yet
Data Science RR Itec-Deep Learning
41 pages
Talend Open Studio For Data Integration: User Guide
No ratings yet
Talend Open Studio For Data Integration: User Guide
452 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
No ratings yet
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
48 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
Kyle Mcevoy - Test Automation in Python
No ratings yet
Kyle Mcevoy - Test Automation in Python
144 pages
17 SparkSQL
No ratings yet
17 SparkSQL
44 pages
Spark Notes
0% (1)
Spark Notes
23 pages
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
No ratings yet
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
74 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Exploring Reactive Integrations With: Akka Streams
No ratings yet
Exploring Reactive Integrations With: Akka Streams
66 pages
Project Report
No ratings yet
Project Report
92 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
TF On Spark
No ratings yet
TF On Spark
35 pages
1 Hdfs Notes
No ratings yet
1 Hdfs Notes
38 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Instant Redis Optimization How-to
From Everand
Instant Redis Optimization How-to
Arun Chinnachamy
No ratings yet
Big Data Hadoop Training Certification 7
No ratings yet
Big Data Hadoop Training Certification 7
40 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Parallel Programming With Spark: Matei Zaharia
No ratings yet
Parallel Programming With Spark: Matei Zaharia
40 pages
Spark Training in Bangalore
No ratings yet
Spark Training in Bangalore
36 pages
Spark NLP Training-Public-April 2020
No ratings yet
Spark NLP Training-Public-April 2020
39 pages
Defining Data Science - The What, Where and How of Data Science - 365 Data Science PDF
No ratings yet
Defining Data Science - The What, Where and How of Data Science - 365 Data Science PDF
24 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
3 Mapreduce Notes
No ratings yet
3 Mapreduce Notes
25 pages
Apache Airflow On Docker For Complete Beginners - Justin Gage - Medium
No ratings yet
Apache Airflow On Docker For Complete Beginners - Justin Gage - Medium
12 pages
Hadoop Distributed File System (HDFS) : Suresh Pathipati
No ratings yet
Hadoop Distributed File System (HDFS) : Suresh Pathipati
43 pages
Ansible 2
No ratings yet
Ansible 2
15 pages
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
Big Data Masters Certification Learnbay
No ratings yet
Big Data Masters Certification Learnbay
12 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Python Advanced - Pipes in Python
No ratings yet
Python Advanced - Pipes in Python
7 pages
Spark Streaming Twitter Example
No ratings yet
Spark Streaming Twitter Example
4 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Python Advanced - Threads and Threading
No ratings yet
Python Advanced - Threads and Threading
9 pages
Oracle Apps Multi Org Structure
No ratings yet
Oracle Apps Multi Org Structure
27 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Predictive Intelligence Overview
No ratings yet
Predictive Intelligence Overview
28 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
Hive Tutorial For Beginners: Learn With Examples in 3 Days
No ratings yet
Hive Tutorial For Beginners: Learn With Examples in 3 Days
3 pages
Spark RDD Dataframes SQL
No ratings yet
Spark RDD Dataframes SQL
3 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Problem Description: Sensitivity: Internal & Restricted
No ratings yet
Problem Description: Sensitivity: Internal & Restricted
2 pages
Docker - Part1
No ratings yet
Docker - Part1
3 pages
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
No ratings yet
24 Hadoop Interview Questions & Answers For MapReduce Developers - FromDev
7 pages
Explain in Detail About Hadoop Framework
No ratings yet
Explain in Detail About Hadoop Framework
4 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
Intro To SQL
No ratings yet
Intro To SQL
27 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Cloud Dataproc Workflow Animation
No ratings yet
Cloud Dataproc Workflow Animation
2 pages
BDT Lab Manual
No ratings yet
BDT Lab Manual
34 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
Iptv Xtream Codes 2
No ratings yet
Iptv Xtream Codes 2
3 pages
Salesforce Dumps
No ratings yet
Salesforce Dumps
2 pages
Web Engineering: DR Naima Iltaf Naima@mcs - Edu.pk
No ratings yet
Web Engineering: DR Naima Iltaf Naima@mcs - Edu.pk
52 pages
Smartplant Foundation: How To Configure The Consolidated Data Warehouse
100% (1)
Smartplant Foundation: How To Configure The Consolidated Data Warehouse
41 pages
Database & File Concepts Key Terms
No ratings yet
Database & File Concepts Key Terms
23 pages
Job Description
No ratings yet
Job Description
15 pages
Q&A DEMO Version: Axapta 3.0 Programming
No ratings yet
Q&A DEMO Version: Axapta 3.0 Programming
4 pages
Chapter-I Introduction To Data Analytics
No ratings yet
Chapter-I Introduction To Data Analytics
9 pages
Understanding Logging and Recovery in SQL Server
No ratings yet
Understanding Logging and Recovery in SQL Server
7 pages
Music Player Using Python: Submitted By: Mayank Kumar (1808210088)
No ratings yet
Music Player Using Python: Submitted By: Mayank Kumar (1808210088)
18 pages
Inventory Management and Billing System
No ratings yet
Inventory Management and Billing System
16 pages
Architecture Document
No ratings yet
Architecture Document
3 pages
Applying Software Reliability Engineering Process To Software Development in Korea Defense Industry
No ratings yet
Applying Software Reliability Engineering Process To Software Development in Korea Defense Industry
1 page
DTM
No ratings yet
DTM
215 pages
CBSE Class 12 Informatic Practices Databases and SQL
No ratings yet
CBSE Class 12 Informatic Practices Databases and SQL
45 pages
Handout 1294 MP1294 20 - 20making 20full 20use 20of 20the 20power 20of 20lookup 20tables 20 - 20final
No ratings yet
Handout 1294 MP1294 20 - 20making 20full 20use 20of 20the 20power 20of 20lookup 20tables 20 - 20final
13 pages
Assigning Object Reference Variables, Methods
No ratings yet
Assigning Object Reference Variables, Methods
21 pages
Tutorial5 - Wizard - Data Access With ADO - NET - 2021
No ratings yet
Tutorial5 - Wizard - Data Access With ADO - NET - 2021
44 pages
Module 1 - The Google Cloud Platform and Spring Framework
No ratings yet
Module 1 - The Google Cloud Platform and Spring Framework
21 pages
192.168.3.0/24 (120/2) Via 192.168.2.2, 00:00:30, Serial0/0
No ratings yet
192.168.3.0/24 (120/2) Via 192.168.2.2, 00:00:30, Serial0/0
6 pages
Software Project Management 4th Edition: Selection of An Appropriate Project Approach
No ratings yet
Software Project Management 4th Edition: Selection of An Appropriate Project Approach
14 pages
This Study Resource Was: Information Systems 2 - (WIH2100)
No ratings yet
This Study Resource Was: Information Systems 2 - (WIH2100)
6 pages
Questions Bank
No ratings yet
Questions Bank
1 page
Youtube Schinese
No ratings yet
Youtube Schinese
2 pages

Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam

Uploaded by

Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam

Uploaded by

NAME: WABLE SNEHAL MAHESH

SUBJECT:- SCALA & SPARK

GUIDENCE NAME :- PROF. ARCHANA SURYAWANSHI – KADAM

We can start Scala REPL by typing scala command in console/terminal.

Some More Important Features of REPL

 IMain of REPL is bound to $intp.

 The tab key is used for completion.

 lastException binds REPL’s last exception.

 :load is used to load a REPL input file.

 :javap is used to inspect class artifacts.

 -Yrepl-outdir is used to inspect class artifacts with external tools.

 :power imports compiler components after entering compiler mode.

 :help is used to get a list of commands to help the user.

The question is: when to use Cluster-Mode? If we submit an application from a

How to submit spark application in cluster mode

Then, run command:

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master

How to submit spark application in client mode?

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master

Apache Spark installation Demo :-

 Standalone Deploy Mode

o Simplest way to deploy Spark on a private cluster. Both driver and

o EC2 scripts are available

o Driver run on the master

o Worker nodes run on separate machine

o Underlying storage is HDFS.

o Driver runs inside an application master process which is managed by

o Worker nodes run on each datanode

Standalone mode is good to go for a developing applications in spark. Spark

A) concept of data frame:

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional

A data frame is used for storing data tables. It is a list of vectors of equal length.

Data Frames are widely used in data science, machine learning, scientific

Retrieving Labels and Data

B) Loading data in data frame:

Load CSV files to Python Pandas

Dataset into a Dataframe?

Load Data Via R Studio Menu Items

Operations that can be performed on a DataFrame are:

Operations performed on a series or Data frames:

Data Frame explain creation and operations of data frame in R with an

You might also like