Final Project
Final Project
Task 1
YC1_1: Illustration
Solving:
• First creates a VectorAssembler() instance with the input columns
specified by the name_columns list, and sets the name of the output
column to "features". The VectorAssembler function is used to
combine multiple columns into a single vector column. Converting
image vectors of 784 into a matrix of 28 x 28
• applies the assembler to the df DataFrame to create a new
DataFrame df_features.
Task 1
• Then use select() to select specific columns from a DataFrame.
• show() line displays the resulting df_features Dat
• to display image ,create reshape_image_vector function takes in a
vector of image data and using NumPy reshape it into a 28x28 matrix.
The resulting matrix is converted to a list and returned it.
• df.take(15) take the first 15 rows of matrix features and labels if Data
Frames
Task 1
• Finally create a figure with 3 rows and 5 columns of subplots using
Matplotlib's subplots function with plt.subplots, displayed in
grayscale format and the corresponding label from label first 15
image in the data frame
Task 1
• Output:
Task 1
YC1_2: Clustering
• Define a list set_clusters containing the number of clusters to use in
the k-means clustering algorithm.
• Define an empty list set_model_load and set the count variable back to 0.
• The resulting set_model_load list contains the trained k-means models loaded from saved files.
Task 1
YC1_3: Result Visualization
Create a bar chart of the sum of distances for each value of k:
• Use Matplotlib's bar function to create a bar chart with the set_clusters list as
the x-axis and the sumOfDis list as the y-axis.
• Use Matplotlib's xticks function to set the x-tick labels to the values in
set_clusters.
• Use Matplotlib's yticks function to set the y-tick labels to the values in sumOfDis.
• Use Matplotlib's xlabel function to set the x-axis label to "k value".
• Use Matplotlib's ylabel function to set the y-axis label to "Summation of
Euclidean distances".
• Use Matplotlib's show function to display the resulting bar chart.
Task 1
• Ouputs
Task 2
• YC2_1: Dimensionality reduction in the training set
• Load the MNIST training dataset from a CSV file into a PySpark DataFrame.
a. Use PySpark's read.csv function to read the CSV file.
b. Specify that the CSV file has no header row and contains values of mixed types.
• Output:
• csv Files of train set and test set
Task 3
Recommendation with Collaborative Filtering
• Load the ratings dataset from a CSV file into a PySpark DataFrame.
• a. Use PySpark's read.csv function to read the CSV file.
• b. Specify that the CSV file has a header row and contains values of mixed
types.
• Filter the predictions DataFrame to include only the relevant user and item IDs:
• a. Use PySpark's filter method to keep only records where the "user" column is
between 71 and 74 and the "item" column is between 401 and 467.
• b. Use PySpark's select method to keep only the "user", "item", and "prediction"
columns.
Task 3
• Join the two DataFrames together:
• a. Use PySpark's join method to join the two DataFrames on the "user" and "item"
columns.
• Evaluate the performance of the model on the relevant subset of the data using mean
squared error:
• a. Create an instance of the RegressionEvaluator class for computing mean squared error.
• b. Specify the column names for the label, prediction, and metric.
• c. Use the evaluate method of the evaluator instance to compute mean squared error on
the joined DataFrame.
• d. Print the mean squared error.
Task 3
• d. Print the mean squared error.
Task 4
• Load the stock price data from a CSV file into a PySpark DataFrame:
• a. Use PySpark's read.csv function to read the CSV file.
• b. Specify that the CSV file has a header row and contains values of mixed types.
• The select function is used to select the "features" and "_c0" or "_c196" columns from the
dataframes, depending on whether the dataset is from the original or SVD-preprocessed data.
The "features" column contains the combined feature vectors, while the "_c0" or "_c196"
column contains the corresponding labels for each data point.
• Finally, the withColumnRenamed function is used to rename the "_c0" or "_c196" column to
"label" for both the training and test datasets, in order to conform to the standard format for
classification problems.
Task 5
• Multi-layer Perceptron
• trains a Multilayer Perceptron (MLP) classifier on the MNIST handwritten digits dataset using PySpark's
MultilayerPerceptronClassifier class. The layers parameter specifies the number of neurons in each layer of the MLP,
while maxIter specifies the maximum number of iterations for training the model.
• The fit function is called on the training data to train the model, and then the trained model is used to make
predictions on the training and test datasets, as well as the SVD-preprocessed versions of these datasets.
• The show function is used to display the first five rows of the predictions for each of the datasets.
• The MulticlassClassificationEvaluator class is used to evaluate the accuracy of the model on the test and SVD-
preprocessed test datasets. The metricName parameter is set to "accuracy" to compute the accuracy score.
• Finally, the accuracy scores for the training and test datasets, as well as the SVD-preprocessed versions of these
datasets, are printed using the print function.
Task 5
• Random Forest
• trains a Random Forest classifier on the MNIST handwritten digits dataset using PySpark's RandomForestClassifier
class. The numTrees parameter specifies the number of decision trees in the forest, while maxDepth specifies the
maximum depth of each decision tree.
• The fit function is called on the training data to train the model, and then the trained model is used to make
predictions on the training and test datasets, as well as the SVD-preprocessed versions of these datasets.
• The show function is used to display the first five rows of the predictions for each of the datasets.
• The MulticlassClassificationEvaluator class is used to evaluate the accuracy of the model on the test and SVD-
preprocessed test datasets. The metricName parameter is set to "accuracy" to compute the accuracy score.
• Finally, the accuracy scores for the training and test datasets, as well as the SVD-preprocessed versions of these
datasets, are printed using the print function.
Task 5
• Support Vector Machine
• trains a Linear Support Vector Machine (LinearSVM) classifier on the MNIST handwritten digits
dataset using PySpark's LinearSVC class, and uses the One-vs-Rest (OvR) strategy for multi-class
classification.
• The maxIter parameter specifies the maximum number of iterations for training the model, while
regParam specifies the regularization parameter.
• A Pipeline is created to encapsulate the OvR classifier and fit the model to the training data.
• The trained model is then used to make predictions on the training and test datasets, as well as the
SVD-preprocessed versions of these datasets.
• The MulticlassClassificationEvaluator class is used to evaluate the accuracy of the model on the test
and SVD-preprocessed test datasets. The metricName parameter is set to "accuracy" to compute
the accuracy score.
• Finally, the accuracy scores for the training and test datasets, as well as the SVD-preprocessed
versions of these datasets, are printed using the print function.
Task 5
• Finally, the accuracy scores for the training and test datasets, as well
as the SVD-preprocessed versions of these datasets, are printed using
the print function.
Assignment Sheet
Trung Kiên Thành Lộc
Task 1 HT
Task 2 HT
Task 3 HT
Task 4 HT
Task 5 HT
Task 6 HT