0% found this document useful (0 votes)
25 views34 pages

Final Project

This document contains details of the final project for a course on Mining Massive Datasets. It lists the members of the project team and their assigned tasks. It also provides details of the tasks completed by each member, including data preprocessing, clustering, dimensionality reduction, collaborative filtering for recommendations, and time series analysis of stock prices. All tasks were completed at 100% by the members.

Uploaded by

Loc Tran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views34 pages

Final Project

This document contains details of the final project for a course on Mining Massive Datasets. It lists the members of the project team and their assigned tasks. It also provides details of the tasks completed by each member, including data preprocessing, clustering, dimensionality reduction, collaborative filtering for recommendations, and time series analysis of stock prices. All tasks were completed at 100% by the members.

Uploaded by

Loc Tran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

FINAL PROJECT

Course: Mining Massive Datasets


Members

Student ID Full name Email Assigned tasks Complete percentage

519H0310 Trần Lê Thành Lộc [email protected] Task 1, 3, 5 100%

519H0306 Trần Trung Kiên [email protected] Task 2, 4, 6 100%


applies the assembler to the df DataFrame to create a new DataFrame df_features. 

Task 1
YC1_1: Illustration
Solving:
• First creates a VectorAssembler() instance with the input columns
specified by the name_columns list, and sets the name of the output
column to "features". The VectorAssembler function is used to
combine multiple columns into a single vector column. Converting
image vectors of 784 into a matrix of 28 x 28
• applies the assembler to the df DataFrame to create a new
DataFrame df_features.
Task 1
• Then use select() to select specific columns from a DataFrame.
• show() line displays the resulting df_features Dat
• to display image ,create reshape_image_vector function takes in a
vector of image data and using NumPy reshape it into a 28x28 matrix.
The resulting matrix is converted to a list and returned it.
• df.take(15) take the first 15 rows of matrix features and labels if Data
Frames
Task 1
• Finally create a figure with 3 rows and 5 columns of subplots using
Matplotlib's subplots function with plt.subplots, displayed in
grayscale format and the corresponding label from label first 15
image in the data frame
Task 1
• Output:
Task 1
YC1_2: Clustering
• Define a list set_clusters containing the number of clusters to use in
the k-means clustering algorithm.

• Define an empty list set_model and an empty list sumOfDis.


Task 1
• For each value of k in set_clusters:
- Create a KMeans instance with k clusters and a specified random seed.
- Fit the KMeans model to the df_features DataFrame.
- Print the distance measure used by the model.
- Add a new column to the DataFrame with the predicted cluster labels.
- Append the trained KMeans model to the set_model list.
- Print the number of images in each cluster.
- For each cluster, print the first 5 samples in the cluster.
- Get the centroids of the model.
- Compute the distance from each sample to its assigned centroid, and sum the resulting
distances.
- Append the resulting sum to the sumOfDis list.
Task 1

• When running the code


• The resulting set_model list contains the trained k-means models for
different values of k.
• The resulting sumOfDis list contains the sum of distances between
each sample and its assigned centroid for different values of k.
Task 1
Save and load model
• After that, set a variable count to 0.

• For each model in the set_model list:


- Save the model to a file named "model" + the current value of set_clusters[count].
- Increment the count variable by 1.

• Define an empty list set_model_load and set the count variable back to 0.

• For each value of k in set_clusters:


-Load the saved model from the file "model" + the current value of k“, then append the loaded model to the
set_model_loadlist and Increment thecount variable by 1.

• The resulting set_model_load list contains the trained k-means models loaded from saved files.
Task 1
YC1_3: Result Visualization
Create a bar chart of the sum of distances for each value of k:
• Use Matplotlib's bar function to create a bar chart with the set_clusters list as
the x-axis and the sumOfDis list as the y-axis.
• Use Matplotlib's xticks function to set the x-tick labels to the values in
set_clusters.
• Use Matplotlib's yticks function to set the y-tick labels to the values in sumOfDis.
• Use Matplotlib's xlabel function to set the x-axis label to "k value".
• Use Matplotlib's ylabel function to set the y-axis label to "Summation of
Euclidean distances".
• Use Matplotlib's show function to display the resulting bar chart.
Task 1
• Ouputs
Task 2
• YC2_1: Dimensionality reduction in the training set
• Load the MNIST training dataset from a CSV file into a PySpark DataFrame.
a. Use PySpark's read.csv function to read the CSV file.
b. Specify that the CSV file has no header row and contains values of mixed types.

• Convert the DataFrame into an RDD of dense vectors:


a. Use the rdd method to convert the DataFrame to an RDD.
b. Use a lambda function to extract the pixel values from each row and convert them to a dense
vector.

• Create a RowMatrix from the RDD of dense vectors:


a. Use the RowMatrix class to create a RowMatrix from the RDD.
Task 2
• Compute the singular value decomposition (SVD) of the RowMatrix:
a. Use the computeSVD method of the RowMatrix to compute the SVD.
b. Specify the number of singular values to keep and whether or not to compute the
right singular vectors.

• Transform the RowMatrix using the SVD:


a. Use the U attribute of the SVD to extract the transformed rows as a DenseMatrix.
b. Use the rows attribute of the DenseMatrix to extract the rows as an RDD of
DenseVectors.
c. Use a lambda function to convert each DenseVector to a list of its values.
Task 2
• Convert the RDD of transformed rows to a DataFrame:
a. Use PySpark's createDataFrame function to create a DataFrame from the RDD.
b. Use a list comprehension to create column names for the DataFrame.

• Add a row index to the DataFrame:


a. Use PySpark's withColumn method to add a new column to the DataFrame containing a
monotonically increasing ID.

• Save the transformed DataFrame to a CSV file:


a. Use the coalesce method to reduce the number of partitions to 1.
b. Use PySpark's write.csv method to save the DataFrame to a CSV file.
c. Specify that the output file should include a header row.
Task 2
• YC2_2: Dimensionality reduction in the test set
• Do the same as YC2_1 but using test set

• Output:
• csv Files of train set and test set
Task 3
Recommendation with Collaborative Filtering
• Load the ratings dataset from a CSV file into a PySpark DataFrame.
• a. Use PySpark's read.csv function to read the CSV file.
• b. Specify that the CSV file has a header row and contains values of mixed
types.

• Filter the DataFrame to remove records with missing ratings:


• a. Use PySpark's select method to select only the "user", "item", and "rating"
columns.
• b. Use PySpark's filter method to remove records with missing ratings.
Task 3
• Filter the DataFrame to keep only the first 75 users and 468 items
• Change the filtered DataFrame to create a matrix of ratings by user and item:
• a. Use PySpark's groupBy method to group the DataFrame by "user".
• b. Use PySpark's pivot method to pivot the "item" column and aggregate the
"rating" column using the "first" function.
• c. Use PySpark's orderBy method to sort the resulting DataFrame by "user".

• Split the filtered DataFrame into training and test sets:


• a. Use PySpark's rdd method to convert the DataFrame to an RDD.
• b. Use the filter method of the RDD to split the dataset by user ID.
• c. Convert the resulting RDD back to a DataFrame.
Task 3
• Create an instance of the ALS class for collaborative filtering:
• a. Specify the column names for the user ID, item ID, and rating.

• Fit the ALS model to the training data:


• a. Use the fit method of the ALS instance to fit the model to the training data.

• Compute the mean rating of the dataset:


• a. Use PySpark's select method to select the "rating" column and compute its mean
using mean.
• b. Use the first method to extract the mean rating value from the resulting DataFrame.
Task 3
• Use the trained model to make predictions on the test data:
• a. Use the transform method of the ALS model to generate predictions for the test data.
• b. Cast the "prediction" column to a FloatType.
• c. Use the na.fill method to replace missing predictions with the mean rating value.

• Evaluate the performance of the model using mean squared error:


• a. Create an instance of the RegressionEvaluator class for computing mean squared error.
• b. Specify the column names for the label, prediction, and metric.
• c. Use the evaluate method of the evaluator instance to compute mean squared error on
the predictions.
• d. Print the mean squared error.
Task 3
• Filter the original DataFrame to include only the relevant user and item IDs:
• a. Use PySpark's filter method to keep only records where the "user" column is
between 71 and 74 and the "item" column is between 401 and 467.
• b. Use PySpark's select method to keep only the "user", "item", and "rating"
columns.

• Filter the predictions DataFrame to include only the relevant user and item IDs:
• a. Use PySpark's filter method to keep only records where the "user" column is
between 71 and 74 and the "item" column is between 401 and 467.
• b. Use PySpark's select method to keep only the "user", "item", and "prediction"
columns.
Task 3
• Join the two DataFrames together:
• a. Use PySpark's join method to join the two DataFrames on the "user" and "item"
columns.

• Evaluate the performance of the model on the relevant subset of the data using mean
squared error:
• a. Create an instance of the RegressionEvaluator class for computing mean squared error.
• b. Specify the column names for the label, prediction, and metric.
• c. Use the evaluate method of the evaluator instance to compute mean squared error on
the joined DataFrame.
• d. Print the mean squared error.
Task 3
• d. Print the mean squared error.
Task 4
• Load the stock price data from a CSV file into a PySpark DataFrame:
• a. Use PySpark's read.csv function to read the CSV file.
• b. Specify that the CSV file has a header row and contains values of mixed types.

• Convert the "Ngay" column to a date type:


• a. Use PySpark's withColumn method to create a new column with the same name as the "Ngay"
column.
• b. Use PySpark's to_date function to convert the "Ngay" column to a date type.

• Create lagged columns for the HVN stock price:


• a. Use PySpark's Window function to order the DataFrame by date.
• b. Use PySpark's withColumn method to create new columns for the HVN stock price lagged by 1, 2, 3,
4, and 5 days using the lag function.
Task 4
• Create a training DataFrame:
• a. Use PySpark's filter method to keep only records where the month of
"Ngay" is between January and June.
• b. Use PySpark's groupBy method to group the DataFrame by "Ngay".
• c. Use PySpark's agg method to compute the first lagged HVN stock prices
and the last HVN stock price for each group.
• d. Use PySpark's dropna method to remove any records with missing values.
• e. Use PySpark's VectorAssembler class to assemble the lagged HVN stock
prices into a single vector column.
For the test df do the same as train_df but change the time period
Task 4
• Scale the features using StandardScaler:
• a. Create an instance of the StandardScaler class with the input and output column names specified.
• b. Use the fit method of the scaler instance to fit the scaler to the training data.
• c. Use the transform method of the scaler instance to transform the training and test data.

• Define a linear regression model and its parameters:


• a. Create an instance of the LinearRegression class with the input and output column names
specified.
• b. Set the maximum number of iterations, regularization parameter, and elastic net parameter.

• Train the linear regression model on the training data:


• a. Use the fit method of the linear regression instance to fit the model to the training data.
Task 4
• Make predictions on the test data:
• a. Use the transform method of the linear regression instance to generate predictions for the test data.
• b. Use the transform method of the linear regression instance to generate predictions for the training
data.

• Evaluate the performance of the model on the test data:


• a. Create an instance of the RegressionEvaluator class for computing mean squared error.
• b. Specify the column names for the label, prediction, and metric.
• c. Use the evaluate method of the evaluator instance to compute mean squared error on the
predictions.
• d. Print the mean squared error for both the test and train data.N stock prices into a single vector
column.
• Finally save the model as a file for later use
Task 4
• Display the result with plotting function:
Task 5
• Data processing
• The VectorAssembler function is used to combine the input columns into a single feature
vector column. The name_columns variable contains a list of column names that are used as
input to the VectorAssembler function.

• The select function is used to select the "features" and "_c0" or "_c196" columns from the
dataframes, depending on whether the dataset is from the original or SVD-preprocessed data.
The "features" column contains the combined feature vectors, while the "_c0" or "_c196"
column contains the corresponding labels for each data point.

• Finally, the withColumnRenamed function is used to rename the "_c0" or "_c196" column to
"label" for both the training and test datasets, in order to conform to the standard format for
classification problems.
Task 5
• Multi-layer Perceptron
• trains a Multilayer Perceptron (MLP) classifier on the MNIST handwritten digits dataset using PySpark's
MultilayerPerceptronClassifier class. The layers parameter specifies the number of neurons in each layer of the MLP,
while maxIter specifies the maximum number of iterations for training the model.

• The fit function is called on the training data to train the model, and then the trained model is used to make
predictions on the training and test datasets, as well as the SVD-preprocessed versions of these datasets.

• The show function is used to display the first five rows of the predictions for each of the datasets.

• The MulticlassClassificationEvaluator class is used to evaluate the accuracy of the model on the test and SVD-
preprocessed test datasets. The metricName parameter is set to "accuracy" to compute the accuracy score.

• Finally, the accuracy scores for the training and test datasets, as well as the SVD-preprocessed versions of these
datasets, are printed using the print function.
Task 5
• Random Forest
• trains a Random Forest classifier on the MNIST handwritten digits dataset using PySpark's RandomForestClassifier
class. The numTrees parameter specifies the number of decision trees in the forest, while maxDepth specifies the
maximum depth of each decision tree.

• The fit function is called on the training data to train the model, and then the trained model is used to make
predictions on the training and test datasets, as well as the SVD-preprocessed versions of these datasets.

• The show function is used to display the first five rows of the predictions for each of the datasets.

• The MulticlassClassificationEvaluator class is used to evaluate the accuracy of the model on the test and SVD-
preprocessed test datasets. The metricName parameter is set to "accuracy" to compute the accuracy score.

• Finally, the accuracy scores for the training and test datasets, as well as the SVD-preprocessed versions of these
datasets, are printed using the print function.
Task 5
• Support Vector Machine
• trains a Linear Support Vector Machine (LinearSVM) classifier on the MNIST handwritten digits
dataset using PySpark's LinearSVC class, and uses the One-vs-Rest (OvR) strategy for multi-class
classification.
• The maxIter parameter specifies the maximum number of iterations for training the model, while
regParam specifies the regularization parameter.
• A Pipeline is created to encapsulate the OvR classifier and fit the model to the training data.
• The trained model is then used to make predictions on the training and test datasets, as well as the
SVD-preprocessed versions of these datasets.
• The MulticlassClassificationEvaluator class is used to evaluate the accuracy of the model on the test
and SVD-preprocessed test datasets. The metricName parameter is set to "accuracy" to compute
the accuracy score.

• Finally, the accuracy scores for the training and test datasets, as well as the SVD-preprocessed
versions of these datasets, are printed using the print function.
Task 5
• Finally, the accuracy scores for the training and test datasets, as well
as the SVD-preprocessed versions of these datasets, are printed using
the print function.
Assignment Sheet
Trung Kiên Thành Lộc
Task 1 HT
Task 2 HT
Task 3 HT
Task 4 HT
Task 5 HT
Task 6 HT

You might also like