0% found this document useful (0 votes)
8 views1 page

Clustering-Kprototype Code

The document outlines a process for clustering customer data using K-Means in PySpark. It includes steps for reading data, assembling features, scaling the data, applying K-Means clustering, and evaluating the model using silhouette scores. Finally, it visualizes the silhouette scores to determine the optimal number of clusters.

Uploaded by

namyachawla8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views1 page

Clustering-Kprototype Code

The document outlines a process for clustering customer data using K-Means in PySpark. It includes steps for reading data, assembling features, scaling the data, applying K-Means clustering, and evaluating the model using silhouette scores. Finally, it visualizes the silhouette scores to determine the optimal number of clusters.

Uploaded by

namyachawla8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

In [ ]:

from pyspark.sql import SparkSession


spark = SparkSession.builder.appName(‘Clustering using K-Means’).getOrCreate()
data_customer=spark.read.csv('prodintdb.csv', header=True, inferSchema=True)
data_customer.printSchema()

In [ ]:
from pyspark.ml.feature import VectorAssembler
data_customer.columns
assemble=VectorAssembler(inputCols=['PDPcountperday','CheckoutHistory','Booked Revnue','B
randname','Styletype'], outputCol='features')
assembled_data=assemble.transform(data_customer)
assembled_data.show(2)

In [ ]:

from pyspark.ml.feature import StandardScaler


scale=StandardScaler(inputCol='features',outputCol='standardized')
data_scale=scale.fit(assembled_data)
data_scale_output=data_scale.transform(assembled_data)
data_scale_output.show(2)

In [ ]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
silhouette_score=[]
evaluator = ClusteringEvaluator(predictionCol='prediction', featuresCol='standardized', \
metricName='silhouette', distanceMeasure='squaredEuclide
an')
for i in range(2,10):

KMeans_algo=KMeans(featuresCol='standardized', k=i)

KMeans_fit=KMeans_algo.fit(data_scale_output)

output=KMeans_fit.transform(data_scale_output)

score=evaluator.evaluate(output)

silhouette_score.append(score)

print("Silhouette Score:",score)

In [ ]:

#Visualizing the silhouette scores in a plot


import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1, figsize =(8,6))
ax.plot(range(2,10),silhouette_score)
ax.set_xlabel(‘k’)
ax.set_ylabel(‘cost’)

You might also like