0% found this document useful (0 votes)

273 views6 pages

Spark Lab

This document provides instructions for running Pyspark in Colab and analyzing the Boston housing dataset using linear regression and clustering algorithms. It describes how to: 1) Install dependencies like Spark, Java and findspark in Colab 2) Load and analyze the Boston housing dataset for linear regression, including data transformation, splitting into train and test sets, fitting a model and evaluating performance. 3) Use k-means clustering to cluster the Boston houses into 2 groups and evaluate the clustering model.

Uploaded by

Nistor Grozavu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

273 views6 pages

Spark Lab

Uploaded by

Nistor Grozavu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Practical Lecture on Pyspark

1. Running Pyspark in Colab

To run spark in Colab, we need to irst install all the dependencies in Colab envi-
ronment i.e. Apache Spark 2.3.2 with hadoop 2.7, Java 8 and Findspark to locate
the spark in the system. The tools installation can be carried out inside the
Jupyter Notebook of the Colab. One important note is that if you are new in
Spark, it is better to avoid Spark 2.4.0 version since some people have already
complained about its compatibility issue with python. Follow the steps to install
the dependencies:
# innstall java
!apt-get install openjdk-8-jdk-headless -qq > /dev/nul

# install spark (change the version number if needed)

!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/
spark-3.0.0-bin-hadoop3.2.tgz

# unzip the spark file to the current folder

!tar xf spark-3.0.0-bin-hadoop3.2.tg

Now that you installed Spark and Java in Colab, it is time to set the environment
path which enables you to run Pyspark in your Colab environment. Set the loca-
tion of Java and Spark by running the following code:

# set your spark folder to your system path environment.

import o
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"

# install findspark using pip

!pip install -q findspar

Run a local spark session to test your installation:

import findspar
findspark.init(
from pyspark.sql import SparkSessio
spark = SparkSession.builder.master("local[*]").getOrCreate(

2. Analysis and Regression on Boston housing dataset

2.1 Download the dataset given, by your professor and keep it somewhere on
your computer. Load the dataset into your Colab directory from your local sys-
tem:

from google.colab import file

files.upload(

f
s

Check the dataset is uploaded correctly in the system by the following com-
mand
!l

Now that you have uploaded the dataset, you can start analyzing it. For our linear
regression model we need to import two modules from Pyspark i.e. Vector As-
sembler and Linear Regression. Vector Assembler is a transformer that assem-
bles all the features into one vector from multiple columns that contain type
double. We could have used StringIndexer if any of our columns contains string
values to convert it into numeric values.
The goal of this exercise to predict the housing prices by the given features. Let's
predict the prices of the Boston Housing dataset by considering MEDV as the
output variable and all the other variables as input.

dataset = spark.read.csv('BostonHousing.csv',inferSchema=True,
header =True

Notice that we used InferSchema inside read.csv mofule. InferSchema enables us to infer automatically
different data types for each column.
Let us print look into the dataset to see the data types of each column:
dataset.printSchema(

2.2 Transformation. Next step is to convert all the features from different col-
umns into a single column and let's call this new vector column as 'Attributes' in
the outputCol.

from pyspark.ml.feature import VectorAssemble

from pyspark.ml.regression import LinearRegressio

#Input all the features in one vector column

assembler = VectorAssembler(inputCols=['crim', 'zn', 'indus',
'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'b',
'lstat'], outputCol = 'Attributes'

output = assembler.transform(dataset

#Input vs Output
finalized_data = output.select("Attributes","medv"
s

finalized_data.show(
You should obtain the following result

Here, 'Attributes' are in the input features from all the columns and 'medv' is the
target column.

2.3. Split the dataset

Next, we should split the training and testing data according to our dataset (0.8
and 0.2 in this case).
We will use this transformed data ( inalized_data) by indicating the featuresCol
as ‘Attributes’ and label as ‘medv’.

#Split training and testing data

train_data,test_data = finalized_data.randomSplit([0.8,0.2]

2.4. Learn and predict the linear regression

regressor = LinearRegression(featuresCol = 'Attributes', labelCol

= 'medv'
#Learn to fit the model from training set
regressor = regressor.fit(train_data

#To predict the prices on testing set

pred = regressor.evaluate(test_data

#Predict the model

pred.predictions.show(
)

f
)

2.5. Print the regression coef icients

We can also print the coef icient and intercept of the regression model by using
the following command:

#coefficient of the regression model

coeff = regressor.coefficient

#X and Y intercept
intr = regressor.intercep

print ("The coefficient of the model is : %a" %coeff

print ("The Intercept of the model is : %f" %intr

2.6. Evaluation of the Model

Once we are done with the basic linear regression operation, we can go a bit fur-
ther and analyze our model statistically by importing RegressionEvaluator mod-
ule from Pyspark.

from pyspark.ml.evaluation import RegressionEvaluato

eval = RegressionEvaluator(labelCol="medv", predictionCol="pre-
diction", metricName="rmse"

# Root Mean Square Error

rmse = eval.evaluate(pred.predictions
print("RMSE: %.3f" % rmse

# Mean Square Error

mse = eval.evaluate(pred.predictions, {eval.metricName: "mse"}
print("MSE: %.3f" % mse

# Mean Absolute Error

mae = eval.evaluate(pred.predictions, {eval.metricName: "mae"}
print("MAE: %.3f" % mae

# r2 - coefficient of determination
r2 = eval.evaluate(pred.predictions, {eval.metricName: "r2"}
print("r2: %.3f" %r2

2.7. Clustering the dataset

In this exercise, we are interested to cluster the houses in 2 groupes (clusters)

using the k-means algorithm.
You can check the following API documentation :

https://spark.apache.org/docs/latest/ml-clustering.html

To cluster the Boston houses dataset, import the kmeans library and train the
model :
from pyspark.ml.clustering import KMean
from pyspark.ml.evaluation import ClusteringEvaluato

# Trains a k-means model.

kmeans = KMeans(featuresCol='Attributes').setK(2).setSeed(1
model = kmeans.fit(finalized_data

To predict, use
# Make predictions
predictions = model.transform(finalized_data

and to see the predictions

predictions.show(

3. Churn analysis in Spark

The Churn (form English: change and turn) expresses the rate of loss of cus-
tomers to a company or product. The churn rate represents the percentage of
lost customers over a given (usually one year) by the total number of customers
at the beginning of this time period. This term is used primarily in the Telecom-
munications and Banking to measure the part of lost customers and to measure
the loyalty to offer a product.
The overall churn includes three causes of stopping the use of the good or ser-
vice:
1. Abandonment: If the client does not use the type of product or
service.
2. The transition to competition: The customer turns to a directly
competing product.
3. The move to another company offer: The customer goes to a
different offer, sold by the same company also covering its needs.

The churn rate of mobile phones is around 20%. It is therefore critical for mobile
operators to detect customers which can terminate their subscription as well as
the probable cause of such termination, in order to quickly provide an offer of
suitable retention.

3.1 Load the following dataset (it can be interested to concatenate the both
datasets before load them):

datasetCalls = spark.read.csv('CallsData.csv', header =True

datasetContract = spark.read.csv('ContractData.csv', header
=True

datasetCalls.show(

datasetContract.show(

3.2 Partitioning of dataset describing customers

A possible irst step for this kind of analysis is the automatic partitioning (clus-
tering) of data describing customers. It is assumed that each customer is de-
scribed by a score of relevant numerical values and a follow-up study allows us to
know the old customers who "churned" and under what conditions.

a. Does this data are univariate or multivariate ?

b. One of the first step here, is to use the normalization before to cluster the data.
It is necessary to use a noramlization here ? Justify the anwer ?
c. Apply the k-means algorithm with Spark by choosing a different number of
clusters. Can you deduce the best number of clusters to group the similar
customers ?

3.3 Visualization and analysis of obtained clustering

The second step for this problem would be to visualize and analyze the parti-
tions.

d. Why a dimension reduction method is well suited for this task and how a
visualization can help understanding this problem?
e. You can choose any reduction method in order to be able to visualize in 2
dimensions the data :
https://spark.apache.org/docs/latest/mllib-dimensionality-reduction
 

Machine Learning For Tabular Data XGBoost, Deep Learning, and AI (Mark Ryan, Luca Massaron) (Z-Library)
100% (1)
Machine Learning For Tabular Data XGBoost, Deep Learning, and AI (Mark Ryan, Luca Massaron) (Z-Library)
504 pages
Databricks Optimization Technique
No ratings yet
Databricks Optimization Technique
18 pages
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Z Devops Guide
No ratings yet
Z Devops Guide
136 pages
PDF 1733662736
No ratings yet
PDF 1733662736
17 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Pyspark
No ratings yet
Pyspark
31 pages
Azure Databricks Brief Introduction
No ratings yet
Azure Databricks Brief Introduction
40 pages
Databricks
No ratings yet
Databricks
36 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Big Data Computing Spark Built-In Libraries
No ratings yet
Big Data Computing Spark Built-In Libraries
11 pages
Spark
No ratings yet
Spark
96 pages
Pyspark Questions
No ratings yet
Pyspark Questions
63 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
13 pages
Databricks & PySpark Learning Day-10
No ratings yet
Databricks & PySpark Learning Day-10
4 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
Know More About Microsoft Fabric
100% (1)
Know More About Microsoft Fabric
13 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
No ratings yet
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
85 pages
Cloudera Spark
No ratings yet
Cloudera Spark
66 pages
Pyspark - DataFrame Window Functions
No ratings yet
Pyspark - DataFrame Window Functions
3 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
6.3. Data - Structure - Pyspark - Ipynb - Exercise
No ratings yet
6.3. Data - Structure - Pyspark - Ipynb - Exercise
6 pages
Intro To Jupyter Notebooks
No ratings yet
Intro To Jupyter Notebooks
44 pages
How To Create Secrets in Databricks? - by Ashish Garg - Medium
No ratings yet
How To Create Secrets in Databricks? - by Ashish Garg - Medium
13 pages
React Js
No ratings yet
React Js
21 pages
ENGG1003 10 PythonApplicationsOnJupiter
No ratings yet
ENGG1003 10 PythonApplicationsOnJupiter
30 pages
Real-Time Data Pipelines Made Easy With Structured Streaming in Apache Spark
No ratings yet
Real-Time Data Pipelines Made Easy With Structured Streaming in Apache Spark
51 pages
MLOps Roadmap Engineering
No ratings yet
MLOps Roadmap Engineering
1 page
Pyspark Cashing & Persisting - Complete Guide
No ratings yet
Pyspark Cashing & Persisting - Complete Guide
3 pages
1 - Optimize Amazon SageMaker Deployment Strategies
No ratings yet
1 - Optimize Amazon SageMaker Deployment Strategies
45 pages
Machine Learning Spark ML
No ratings yet
Machine Learning Spark ML
11 pages
De Mod 2 Transform Data With Spark
No ratings yet
De Mod 2 Transform Data With Spark
32 pages
Stream Processing at Lyft
No ratings yet
Stream Processing at Lyft
20 pages
SparkInternals All
No ratings yet
SparkInternals All
90 pages
Slide 3 Hadoop MapReduce Tutorial
No ratings yet
Slide 3 Hadoop MapReduce Tutorial
119 pages
The Dawn of LMMS: Preliminary Explorations With Gpt-4V (Ision)
No ratings yet
The Dawn of LMMS: Preliminary Explorations With Gpt-4V (Ision)
166 pages
2324 BigData Lab3
No ratings yet
2324 BigData Lab3
6 pages
How To Master Apache Spark Interview Questions
No ratings yet
How To Master Apache Spark Interview Questions
14 pages
M1 - Introducing Google Cloud v5.2 - ILT
No ratings yet
M1 - Introducing Google Cloud v5.2 - ILT
69 pages
Rtmnu Machine Learning Paper Winter 2024
100% (1)
Rtmnu Machine Learning Paper Winter 2024
4 pages
De Mod 0 Get Started With Pyspark Programming
No ratings yet
De Mod 0 Get Started With Pyspark Programming
7 pages
Exercise5 Solution
No ratings yet
Exercise5 Solution
22 pages
Machine Learning With PySpark and MLlib - Solving A Binary Classification Problem - by Susan Li - Towards Data Science
No ratings yet
Machine Learning With PySpark and MLlib - Solving A Binary Classification Problem - by Susan Li - Towards Data Science
10 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
Scalable-ML-3 4 1
No ratings yet
Scalable-ML-3 4 1
147 pages
Google People and Ai Guidebook-Workshop-Slides
No ratings yet
Google People and Ai Guidebook-Workshop-Slides
126 pages
Deep Learning With Databricks: Srijith Rajamohan, Ph.D. John O'Dwyer
No ratings yet
Deep Learning With Databricks: Srijith Rajamohan, Ph.D. John O'Dwyer
38 pages
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
No ratings yet
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
71 pages
Hive and Impala
No ratings yet
Hive and Impala
46 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Dataeng-Zoomcamp - 5 - Batch - Processing - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
No ratings yet
Dataeng-Zoomcamp - 5 - Batch - Processing - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
41 pages
Aws Three Practical Use Cases With Databricks Ebook v5 101221
No ratings yet
Aws Three Practical Use Cases With Databricks Ebook v5 101221
34 pages
Cs6007 - Information Retrieval: Objectives: The Student Should Be Made To
No ratings yet
Cs6007 - Information Retrieval: Objectives: The Student Should Be Made To
24 pages
7 Steps For A Developer To Learn Apache Spark
No ratings yet
7 Steps For A Developer To Learn Apache Spark
30 pages
Ambari Operations
No ratings yet
Ambari Operations
194 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
Introduction
No ratings yet
Introduction
27 pages
ML Two Marks Question According To Syllabus
No ratings yet
ML Two Marks Question According To Syllabus
4 pages
AnalytixLabs - Data Science & Machine Learning With Python-1601625377114-1
No ratings yet
AnalytixLabs - Data Science & Machine Learning With Python-1601625377114-1
16 pages
Analyzing The Behavior of Electricity Consumption Using Hadoop
No ratings yet
Analyzing The Behavior of Electricity Consumption Using Hadoop
4 pages
Thesis2 PDF
No ratings yet
Thesis2 PDF
127 pages
Fruit Detection Report
No ratings yet
Fruit Detection Report
50 pages
Hierar Scale4
No ratings yet
Hierar Scale4
51 pages
Bitspilani ML Ai Wilp
No ratings yet
Bitspilani ML Ai Wilp
31 pages
Multimedia Data Mining
No ratings yet
Multimedia Data Mining
19 pages
Machine Learning Intro Final
No ratings yet
Machine Learning Intro Final
74 pages
Lecture 11 K Means Clustering
No ratings yet
Lecture 11 K Means Clustering
8 pages
Smart User Consumption Profiling: Incremental Learning-Based OTT Service Degradation
No ratings yet
Smart User Consumption Profiling: Incremental Learning-Based OTT Service Degradation
18 pages
Book Chapter-Deep Learning-New 23-05-2019
No ratings yet
Book Chapter-Deep Learning-New 23-05-2019
24 pages
Data Science Nigeria Machine and Deep Learning Study Guide
No ratings yet
Data Science Nigeria Machine and Deep Learning Study Guide
78 pages
Java Datamining IEEE Projects 2012 at Seabirds (Chennai, Trichy, Pudukkottai, Thanjavur, Perambalur, Karur)
No ratings yet
Java Datamining IEEE Projects 2012 at Seabirds (Chennai, Trichy, Pudukkottai, Thanjavur, Perambalur, Karur)
9 pages
A Survey of Anomaly Detection Techniques in Financial (2015)
No ratings yet
A Survey of Anomaly Detection Techniques in Financial (2015)
43 pages
Agnes Bca VI Semester All Papers May 2024 Nep
No ratings yet
Agnes Bca VI Semester All Papers May 2024 Nep
6 pages
Unsupervised Learning Using Back Propagation in Neural Networks
No ratings yet
Unsupervised Learning Using Back Propagation in Neural Networks
4 pages
Jnca D 24 00212
No ratings yet
Jnca D 24 00212
35 pages
IJAS - Volume 12 - Issue 3 - Pages 1413-1424
No ratings yet
IJAS - Volume 12 - Issue 3 - Pages 1413-1424
12 pages
Comparison of Tests For Spatial Heterogeneity - Satscan - Lisa
No ratings yet
Comparison of Tests For Spatial Heterogeneity - Satscan - Lisa
14 pages
Lecture 1 Notes
No ratings yet
Lecture 1 Notes
99 pages
Neural Networks & Machine Learning: Worksheet 3
No ratings yet
Neural Networks & Machine Learning: Worksheet 3
3 pages
10 21105 Joss 06143
No ratings yet
10 21105 Joss 06143
7 pages
Dolnicar 2003 SHARE
No ratings yet
Dolnicar 2003 SHARE
9 pages
Knowledge Graph Implementation On The Wikipedia Page Using A Deep Learning Algorithm
No ratings yet
Knowledge Graph Implementation On The Wikipedia Page Using A Deep Learning Algorithm
12 pages
Assignment Unit-1: III Year VI Semester CSE
No ratings yet
Assignment Unit-1: III Year VI Semester CSE
7 pages

Spark Lab

Uploaded by

Spark Lab

Uploaded by

Practical Lecture on Pyspark

1. Running Pyspark in Colab

# install spark (change the version number if needed)

# unzip the spark file to the current folder

# set your spark folder to your system path environment.

# install findspark using pip

Run a local spark session to test your installation:

2. Analysis and Regression on Boston housing dataset

from google.colab import file

from pyspark.ml.feature import VectorAssemble

#Input all the features in one vector column

2.3. Split the dataset

#Split training and testing data

2.4. Learn and predict the linear regression

regressor = LinearRegression(featuresCol = 'Attributes', labelCol

#To predict the prices on testing set

#Predict the model

2.5. Print the regression coef icients

#coefficient of the regression model

print ("The coefficient of the model is : %a" %coeff

2.6. Evaluation of the Model

from pyspark.ml.evaluation import RegressionEvaluato

# Root Mean Square Error

# Mean Square Error

# Mean Absolute Error

2.7. Clustering the dataset

In this exercise, we are interested to cluster the houses in 2 groupes (clusters)

# Trains a k-means model.

and to see the predictions

3. Churn analysis in Spark

datasetCalls = spark.read.csv('CallsData.csv', header =True

3.2 Partitioning of dataset describing customers

a. Does this data are univariate or multivariate ?

3.3 Visualization and analysis of obtained clustering

You might also like