Spark Lab

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Practical Lecture on Pyspark

1. Running Pyspark in Colab


To run spark in Colab, we need to irst install all the dependencies in Colab envi-
ronment i.e. Apache Spark 2.3.2 with hadoop 2.7, Java 8 and Findspark to locate
the spark in the system. The tools installation can be carried out inside the
Jupyter Notebook of the Colab. One important note is that if you are new in
Spark, it is better to avoid Spark 2.4.0 version since some people have already
complained about its compatibility issue with python. Follow the steps to install
the dependencies:
# innstall java
!apt-get install openjdk-8-jdk-headless -qq > /dev/nul

# install spark (change the version number if needed)


!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/
spark-3.0.0-bin-hadoop3.2.tgz

# unzip the spark file to the current folder


!tar xf spark-3.0.0-bin-hadoop3.2.tg

Now that you installed Spark and Java in Colab, it is time to set the environment
path which enables you to run Pyspark in your Colab environment. Set the loca-
tion of Java and Spark by running the following code:

# set your spark folder to your system path environment.


import o
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"

# install findspark using pip


!pip install -q findspar

Run a local spark session to test your installation:

import findspar
findspark.init(
from pyspark.sql import SparkSessio
spark = SparkSession.builder.master("local[*]").getOrCreate(

2. Analysis and Regression on Boston housing dataset


2.1 Download the dataset given, by your professor and keep it somewhere on
your computer. Load the dataset into your Colab directory from your local sys-
tem:

from google.colab import file


files.upload(

f
s

Check the dataset is uploaded correctly in the system by the following com-
mand
!l

Now that you have uploaded the dataset, you can start analyzing it. For our linear
regression model we need to import two modules from Pyspark i.e. Vector As-
sembler and Linear Regression. Vector Assembler is a transformer that assem-
bles all the features into one vector from multiple columns that contain type
double. We could have used StringIndexer if any of our columns contains string
values to convert it into numeric values.
The goal of this exercise to predict the housing prices by the given features. Let's
predict the prices of the Boston Housing dataset by considering MEDV as the
output variable and all the other variables as input.

dataset = spark.read.csv('BostonHousing.csv',inferSchema=True,
header =True

Notice that we used InferSchema inside read.csv mofule. InferSchema enables us to infer automatically
different data types for each column.
Let us print look into the dataset to see the data types of each column:
dataset.printSchema(

2.2 Transformation. Next step is to convert all the features from different col-
umns into a single column and let's call this new vector column as 'Attributes' in
the outputCol.

from pyspark.ml.feature import VectorAssemble


from pyspark.ml.regression import LinearRegressio

#Input all the features in one vector column


assembler = VectorAssembler(inputCols=['crim', 'zn', 'indus',
'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'b',
'lstat'], outputCol = 'Attributes'

output = assembler.transform(dataset

#Input vs Output
finalized_data = output.select("Attributes","medv"
s

finalized_data.show(
You should obtain the following result 

Here, 'Attributes' are in the input features from all the columns and 'medv' is the
target column.

2.3. Split the dataset

Next, we should split the training and testing data according to our dataset (0.8
and 0.2 in this case).
We will use this transformed data ( inalized_data) by indicating the featuresCol
as ‘Attributes’ and label as ‘medv’.

#Split training and testing data


train_data,test_data = finalized_data.randomSplit([0.8,0.2]

2.4. Learn and predict the linear regression

regressor = LinearRegression(featuresCol = 'Attributes', labelCol


= 'medv'
#Learn to fit the model from training set
regressor = regressor.fit(train_data

#To predict the prices on testing set


pred = regressor.evaluate(test_data

#Predict the model


pred.predictions.show(
)

f
)

2.5. Print the regression coef icients

We can also print the coef icient and intercept of the regression model by using
the following command:

#coefficient of the regression model


coeff = regressor.coefficient

#X and Y intercept
intr = regressor.intercep

print ("The coefficient of the model is : %a" %coeff


print ("The Intercept of the model is : %f" %intr

2.6. Evaluation of the Model

Once we are done with the basic linear regression operation, we can go a bit fur-
ther and analyze our model statistically by importing RegressionEvaluator mod-
ule from Pyspark.

from pyspark.ml.evaluation import RegressionEvaluato


eval = RegressionEvaluator(labelCol="medv", predictionCol="pre-
diction", metricName="rmse"

# Root Mean Square Error


rmse = eval.evaluate(pred.predictions
print("RMSE: %.3f" % rmse

# Mean Square Error


mse = eval.evaluate(pred.predictions, {eval.metricName: "mse"}
print("MSE: %.3f" % mse

# Mean Absolute Error


mae = eval.evaluate(pred.predictions, {eval.metricName: "mae"}
print("MAE: %.3f" % mae

# r2 - coefficient of determination
r2 = eval.evaluate(pred.predictions, {eval.metricName: "r2"}
print("r2: %.3f" %r2

2.7. Clustering the dataset

In this exercise, we are interested to cluster the houses in 2 groupes (clusters)


using the k-means algorithm.
You can check the following API documentation :

https://spark.apache.org/docs/latest/ml-clustering.html

To cluster the Boston houses dataset, import the kmeans library and train the
model :
from pyspark.ml.clustering import KMean
from pyspark.ml.evaluation import ClusteringEvaluato

# Trains a k-means model.


kmeans = KMeans(featuresCol='Attributes').setK(2).setSeed(1
model = kmeans.fit(finalized_data

To predict, use 
# Make predictions
predictions = model.transform(finalized_data

and to see the predictions 


predictions.show(

3. Churn analysis in Spark


The Churn (form English: change and turn) expresses the rate of loss of cus-
tomers to a company or product. The churn rate represents the percentage of
lost customers over a given (usually one year) by the total number of customers
at the beginning of this time period. This term is used primarily in the Telecom-
munications and Banking to measure the part of lost customers and to measure
the loyalty to offer a product.
The overall churn includes three causes of stopping the use of the good or ser-
vice:
1. Abandonment: If the client does not use the type of product or
service.
2. The transition to competition: The customer turns to a directly
competing product.
3. The move to another company offer: The customer goes to a
different offer, sold by the same company also covering its needs.

The churn rate of mobile phones is around 20%. It is therefore critical for mobile
operators to detect customers which can terminate their subscription as well as
the probable cause of such termination, in order to quickly provide an offer of
suitable retention.

3.1 Load the following dataset (it can be interested to concatenate the both
datasets before load them):

datasetCalls = spark.read.csv('CallsData.csv', header =True


datasetContract = spark.read.csv('ContractData.csv', header
=True

datasetCalls.show(

datasetContract.show(

3.2 Partitioning of dataset describing customers


A possible irst step for this kind of analysis is the automatic partitioning (clus-
tering) of data describing customers. It is assumed that each customer is de-
scribed by a score of relevant numerical values and a follow-up study allows us to
know the old customers who "churned" and under what conditions.

a. Does this data are univariate or multivariate ?


b. One of the first step here, is to use the normalization before to cluster the data.
It is necessary to use a noramlization here ? Justify the anwer ?
c. Apply the k-means algorithm with Spark by choosing a different number of
clusters. Can you deduce the best number of clusters to group the similar
customers ?

3.3 Visualization and analysis of obtained clustering


The second step for this problem would be to visualize and analyze the parti-
tions.

d. Why a dimension reduction method is well suited for this task and how a
visualization can help understanding this problem?
e. You can choose any reduction method in order to be able to visualize in 2
dimensions the data :
https://spark.apache.org/docs/latest/mllib-dimensionality-reduction

​​

You might also like