Spark Lab
Spark Lab
Spark Lab
Now that you installed Spark and Java in Colab, it is time to set the environment
path which enables you to run Pyspark in your Colab environment. Set the loca-
tion of Java and Spark by running the following code:
import findspar
findspark.init(
from pyspark.sql import SparkSessio
spark = SparkSession.builder.master("local[*]").getOrCreate(
f
s
Check the dataset is uploaded correctly in the system by the following com-
mand
!l
Now that you have uploaded the dataset, you can start analyzing it. For our linear
regression model we need to import two modules from Pyspark i.e. Vector As-
sembler and Linear Regression. Vector Assembler is a transformer that assem-
bles all the features into one vector from multiple columns that contain type
double. We could have used StringIndexer if any of our columns contains string
values to convert it into numeric values.
The goal of this exercise to predict the housing prices by the given features. Let's
predict the prices of the Boston Housing dataset by considering MEDV as the
output variable and all the other variables as input.
dataset = spark.read.csv('BostonHousing.csv',inferSchema=True,
header =True
Notice that we used InferSchema inside read.csv mofule. InferSchema enables us to infer automatically
different data types for each column.
Let us print look into the dataset to see the data types of each column:
dataset.printSchema(
2.2 Transformation. Next step is to convert all the features from different col-
umns into a single column and let's call this new vector column as 'Attributes' in
the outputCol.
output = assembler.transform(dataset
#Input vs Output
finalized_data = output.select("Attributes","medv"
s
finalized_data.show(
You should obtain the following result
Here, 'Attributes' are in the input features from all the columns and 'medv' is the
target column.
Next, we should split the training and testing data according to our dataset (0.8
and 0.2 in this case).
We will use this transformed data ( inalized_data) by indicating the featuresCol
as ‘Attributes’ and label as ‘medv’.
f
)
We can also print the coef icient and intercept of the regression model by using
the following command:
#X and Y intercept
intr = regressor.intercep
Once we are done with the basic linear regression operation, we can go a bit fur-
ther and analyze our model statistically by importing RegressionEvaluator mod-
ule from Pyspark.
# r2 - coefficient of determination
r2 = eval.evaluate(pred.predictions, {eval.metricName: "r2"}
print("r2: %.3f" %r2
https://spark.apache.org/docs/latest/ml-clustering.html
To cluster the Boston houses dataset, import the kmeans library and train the
model :
from pyspark.ml.clustering import KMean
from pyspark.ml.evaluation import ClusteringEvaluato
To predict, use
# Make predictions
predictions = model.transform(finalized_data
The churn rate of mobile phones is around 20%. It is therefore critical for mobile
operators to detect customers which can terminate their subscription as well as
the probable cause of such termination, in order to quickly provide an offer of
suitable retention.
3.1 Load the following dataset (it can be interested to concatenate the both
datasets before load them):
datasetCalls.show(
datasetContract.show(