0% found this document useful (0 votes)

25 views34 pages

Final Project

This document contains details of the final project for a course on Mining Massive Datasets. It lists the members of the project team and their assigned tasks. It also provides details of the tasks completed by each member, including data preprocessing, clustering, dimensionality reduction, collaborative filtering for recommendations, and time series analysis of stock prices. All tasks were completed at 100% by the members.

Uploaded by

Loc Tran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views34 pages

Final Project

Uploaded by

Loc Tran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 34

FINAL PROJECT

Course: Mining Massive Datasets

Members

Student ID Full name Email Assigned tasks Complete percentage

519H0310 Trần Lê Thành Lộc [email protected] Task 1, 3, 5 100%

519H0306 Trần Trung Kiên [email protected] Task 2, 4, 6 100%

applies the assembler to the df DataFrame to create a new DataFrame df_features.

Task 1
YC1_1: Illustration
Solving:
• First creates a VectorAssembler() instance with the input columns
specified by the name_columns list, and sets the name of the output
column to "features". The VectorAssembler function is used to
combine multiple columns into a single vector column. Converting
image vectors of 784 into a matrix of 28 x 28
• applies the assembler to the df DataFrame to create a new
DataFrame df_features.
Task 1
• Then use select() to select specific columns from a DataFrame.
• show() line displays the resulting df_features Dat
• to display image ,create reshape_image_vector function takes in a
vector of image data and using NumPy reshape it into a 28x28 matrix.
The resulting matrix is converted to a list and returned it.
• df.take(15) take the first 15 rows of matrix features and labels if Data
Frames
Task 1
• Finally create a figure with 3 rows and 5 columns of subplots using
Matplotlib's subplots function with plt.subplots, displayed in
grayscale format and the corresponding label from label first 15
image in the data frame
Task 1
• Output:
Task 1
YC1_2: Clustering
• Define a list set_clusters containing the number of clusters to use in
the k-means clustering algorithm.

• Define an empty list set_model and an empty list sumOfDis.

Task 1
• For each value of k in set_clusters:
- Create a KMeans instance with k clusters and a specified random seed.
- Fit the KMeans model to the df_features DataFrame.
- Print the distance measure used by the model.
- Add a new column to the DataFrame with the predicted cluster labels.
- Append the trained KMeans model to the set_model list.
- Print the number of images in each cluster.
- For each cluster, print the first 5 samples in the cluster.
- Get the centroids of the model.
- Compute the distance from each sample to its assigned centroid, and sum the resulting
distances.
- Append the resulting sum to the sumOfDis list.
Task 1

• When running the code

• The resulting set_model list contains the trained k-means models for
different values of k.
• The resulting sumOfDis list contains the sum of distances between
each sample and its assigned centroid for different values of k.
Task 1
Save and load model
• After that, set a variable count to 0.

• For each model in the set_model list:

- Save the model to a file named "model" + the current value of set_clusters[count].
- Increment the count variable by 1.

• Define an empty list set_model_load and set the count variable back to 0.

• For each value of k in set_clusters:

-Load the saved model from the file "model" + the current value of k“, then append the loaded model to the
set_model_loadlist and Increment thecount variable by 1.

• The resulting set_model_load list contains the trained k-means models loaded from saved files.
Task 1
YC1_3: Result Visualization
Create a bar chart of the sum of distances for each value of k:
• Use Matplotlib's bar function to create a bar chart with the set_clusters list as
the x-axis and the sumOfDis list as the y-axis.
• Use Matplotlib's xticks function to set the x-tick labels to the values in
set_clusters.
• Use Matplotlib's yticks function to set the y-tick labels to the values in sumOfDis.
• Use Matplotlib's xlabel function to set the x-axis label to "k value".
• Use Matplotlib's ylabel function to set the y-axis label to "Summation of
Euclidean distances".
• Use Matplotlib's show function to display the resulting bar chart.
Task 1
• Ouputs
Task 2
• YC2_1: Dimensionality reduction in the training set
• Load the MNIST training dataset from a CSV file into a PySpark DataFrame.
a. Use PySpark's read.csv function to read the CSV file.
b. Specify that the CSV file has no header row and contains values of mixed types.

• Convert the DataFrame into an RDD of dense vectors:

a. Use the rdd method to convert the DataFrame to an RDD.
b. Use a lambda function to extract the pixel values from each row and convert them to a dense
vector.

• Create a RowMatrix from the RDD of dense vectors:

a. Use the RowMatrix class to create a RowMatrix from the RDD.
Task 2
• Compute the singular value decomposition (SVD) of the RowMatrix:
a. Use the computeSVD method of the RowMatrix to compute the SVD.
b. Specify the number of singular values to keep and whether or not to compute the
right singular vectors.

• Transform the RowMatrix using the SVD:

a. Use the U attribute of the SVD to extract the transformed rows as a DenseMatrix.
b. Use the rows attribute of the DenseMatrix to extract the rows as an RDD of
DenseVectors.
c. Use a lambda function to convert each DenseVector to a list of its values.
Task 2
• Convert the RDD of transformed rows to a DataFrame:
a. Use PySpark's createDataFrame function to create a DataFrame from the RDD.
b. Use a list comprehension to create column names for the DataFrame.

• Add a row index to the DataFrame:

a. Use PySpark's withColumn method to add a new column to the DataFrame containing a
monotonically increasing ID.

• Save the transformed DataFrame to a CSV file:

a. Use the coalesce method to reduce the number of partitions to 1.
b. Use PySpark's write.csv method to save the DataFrame to a CSV file.
c. Specify that the output file should include a header row.
Task 2
• YC2_2: Dimensionality reduction in the test set
• Do the same as YC2_1 but using test set

• Output:
• csv Files of train set and test set
Task 3
Recommendation with Collaborative Filtering
• Load the ratings dataset from a CSV file into a PySpark DataFrame.
• a. Use PySpark's read.csv function to read the CSV file.
• b. Specify that the CSV file has a header row and contains values of mixed
types.

• Filter the DataFrame to remove records with missing ratings:

• a. Use PySpark's select method to select only the "user", "item", and "rating"
columns.
• b. Use PySpark's filter method to remove records with missing ratings.
Task 3
• Filter the DataFrame to keep only the first 75 users and 468 items
• Change the filtered DataFrame to create a matrix of ratings by user and item:
• a. Use PySpark's groupBy method to group the DataFrame by "user".
• b. Use PySpark's pivot method to pivot the "item" column and aggregate the
"rating" column using the "first" function.
• c. Use PySpark's orderBy method to sort the resulting DataFrame by "user".

• Split the filtered DataFrame into training and test sets:

• a. Use PySpark's rdd method to convert the DataFrame to an RDD.
• b. Use the filter method of the RDD to split the dataset by user ID.
• c. Convert the resulting RDD back to a DataFrame.
Task 3
• Create an instance of the ALS class for collaborative filtering:
• a. Specify the column names for the user ID, item ID, and rating.

• Fit the ALS model to the training data:

• a. Use the fit method of the ALS instance to fit the model to the training data.

• Compute the mean rating of the dataset:

• a. Use PySpark's select method to select the "rating" column and compute its mean
using mean.
• b. Use the first method to extract the mean rating value from the resulting DataFrame.
Task 3
• Use the trained model to make predictions on the test data:
• a. Use the transform method of the ALS model to generate predictions for the test data.
• b. Cast the "prediction" column to a FloatType.
• c. Use the na.fill method to replace missing predictions with the mean rating value.

• Evaluate the performance of the model using mean squared error:

• a. Create an instance of the RegressionEvaluator class for computing mean squared error.
• b. Specify the column names for the label, prediction, and metric.
• c. Use the evaluate method of the evaluator instance to compute mean squared error on
the predictions.
• d. Print the mean squared error.
Task 3
• Filter the original DataFrame to include only the relevant user and item IDs:
• a. Use PySpark's filter method to keep only records where the "user" column is
between 71 and 74 and the "item" column is between 401 and 467.
• b. Use PySpark's select method to keep only the "user", "item", and "rating"
columns.

• Filter the predictions DataFrame to include only the relevant user and item IDs:
• a. Use PySpark's filter method to keep only records where the "user" column is
between 71 and 74 and the "item" column is between 401 and 467.
• b. Use PySpark's select method to keep only the "user", "item", and "prediction"
columns.
Task 3
• Join the two DataFrames together:
• a. Use PySpark's join method to join the two DataFrames on the "user" and "item"
columns.

• Evaluate the performance of the model on the relevant subset of the data using mean
squared error:
• a. Create an instance of the RegressionEvaluator class for computing mean squared error.
• b. Specify the column names for the label, prediction, and metric.
• c. Use the evaluate method of the evaluator instance to compute mean squared error on
the joined DataFrame.
• d. Print the mean squared error.
Task 3
• d. Print the mean squared error.
Task 4
• Load the stock price data from a CSV file into a PySpark DataFrame:
• a. Use PySpark's read.csv function to read the CSV file.
• b. Specify that the CSV file has a header row and contains values of mixed types.

• Convert the "Ngay" column to a date type:

• a. Use PySpark's withColumn method to create a new column with the same name as the "Ngay"
column.
• b. Use PySpark's to_date function to convert the "Ngay" column to a date type.

• Create lagged columns for the HVN stock price:

• a. Use PySpark's Window function to order the DataFrame by date.
• b. Use PySpark's withColumn method to create new columns for the HVN stock price lagged by 1, 2, 3,
4, and 5 days using the lag function.
Task 4
• Create a training DataFrame:
• a. Use PySpark's filter method to keep only records where the month of
"Ngay" is between January and June.
• b. Use PySpark's groupBy method to group the DataFrame by "Ngay".
• c. Use PySpark's agg method to compute the first lagged HVN stock prices
and the last HVN stock price for each group.
• d. Use PySpark's dropna method to remove any records with missing values.
• e. Use PySpark's VectorAssembler class to assemble the lagged HVN stock
prices into a single vector column.
For the test df do the same as train_df but change the time period
Task 4
• Scale the features using StandardScaler:
• a. Create an instance of the StandardScaler class with the input and output column names specified.
• b. Use the fit method of the scaler instance to fit the scaler to the training data.
• c. Use the transform method of the scaler instance to transform the training and test data.

• Define a linear regression model and its parameters:

• a. Create an instance of the LinearRegression class with the input and output column names
specified.
• b. Set the maximum number of iterations, regularization parameter, and elastic net parameter.

• Train the linear regression model on the training data:

• a. Use the fit method of the linear regression instance to fit the model to the training data.
Task 4
• Make predictions on the test data:
• a. Use the transform method of the linear regression instance to generate predictions for the test data.
• b. Use the transform method of the linear regression instance to generate predictions for the training
data.

• Evaluate the performance of the model on the test data:

• a. Create an instance of the RegressionEvaluator class for computing mean squared error.
• b. Specify the column names for the label, prediction, and metric.
• c. Use the evaluate method of the evaluator instance to compute mean squared error on the
predictions.
• d. Print the mean squared error for both the test and train data.N stock prices into a single vector
column.
• Finally save the model as a file for later use
Task 4
• Display the result with plotting function:
Task 5
• Data processing
• The VectorAssembler function is used to combine the input columns into a single feature
vector column. The name_columns variable contains a list of column names that are used as
input to the VectorAssembler function.

• The select function is used to select the "features" and "_c0" or "_c196" columns from the
dataframes, depending on whether the dataset is from the original or SVD-preprocessed data.
The "features" column contains the combined feature vectors, while the "_c0" or "_c196"
column contains the corresponding labels for each data point.

• Finally, the withColumnRenamed function is used to rename the "_c0" or "_c196" column to
"label" for both the training and test datasets, in order to conform to the standard format for
classification problems.
Task 5
• Multi-layer Perceptron
• trains a Multilayer Perceptron (MLP) classifier on the MNIST handwritten digits dataset using PySpark's
MultilayerPerceptronClassifier class. The layers parameter specifies the number of neurons in each layer of the MLP,
while maxIter specifies the maximum number of iterations for training the model.

• The fit function is called on the training data to train the model, and then the trained model is used to make
predictions on the training and test datasets, as well as the SVD-preprocessed versions of these datasets.

• The show function is used to display the first five rows of the predictions for each of the datasets.

• The MulticlassClassificationEvaluator class is used to evaluate the accuracy of the model on the test and SVD-
preprocessed test datasets. The metricName parameter is set to "accuracy" to compute the accuracy score.

• Finally, the accuracy scores for the training and test datasets, as well as the SVD-preprocessed versions of these
datasets, are printed using the print function.
Task 5
• Random Forest
• trains a Random Forest classifier on the MNIST handwritten digits dataset using PySpark's RandomForestClassifier
class. The numTrees parameter specifies the number of decision trees in the forest, while maxDepth specifies the
maximum depth of each decision tree.

• The show function is used to display the first five rows of the predictions for each of the datasets.

• Finally, the accuracy scores for the training and test datasets, as well as the SVD-preprocessed versions of these
datasets, are printed using the print function.
Task 5
• Support Vector Machine
• trains a Linear Support Vector Machine (LinearSVM) classifier on the MNIST handwritten digits
dataset using PySpark's LinearSVC class, and uses the One-vs-Rest (OvR) strategy for multi-class
classification.
• The maxIter parameter specifies the maximum number of iterations for training the model, while
regParam specifies the regularization parameter.
• A Pipeline is created to encapsulate the OvR classifier and fit the model to the training data.
• The trained model is then used to make predictions on the training and test datasets, as well as the
SVD-preprocessed versions of these datasets.
• The MulticlassClassificationEvaluator class is used to evaluate the accuracy of the model on the test
and SVD-preprocessed test datasets. The metricName parameter is set to "accuracy" to compute
the accuracy score.

• Finally, the accuracy scores for the training and test datasets, as well as the SVD-preprocessed
versions of these datasets, are printed using the print function.
Task 5
• Finally, the accuracy scores for the training and test datasets, as well
as the SVD-preprocessed versions of these datasets, are printed using
the print function.
Assignment Sheet
Trung Kiên Thành Lộc
Task 1 HT
Task 2 HT
Task 3 HT
Task 4 HT
Task 5 HT
Task 6 HT

Codecalc - 2013 Manual
100% (1)
Codecalc - 2013 Manual
484 pages
NCR Selfserv 27: Attr Act Customers With An Exceptional Atm Experience
No ratings yet
NCR Selfserv 27: Attr Act Customers With An Exceptional Atm Experience
2 pages
Module 5 CSS Weeks 9 and 10
No ratings yet
Module 5 CSS Weeks 9 and 10
18 pages
Pgdca Practices
No ratings yet
Pgdca Practices
38 pages
SpiraTest-Team RemoteLaunch Automated Testing User Guide
No ratings yet
SpiraTest-Team RemoteLaunch Automated Testing User Guide
114 pages
Siemens PLM NX CAM High Productivity Part Manufacturing
No ratings yet
Siemens PLM NX CAM High Productivity Part Manufacturing
20 pages
Powerpoint 2007: Create Brisk Presentations
No ratings yet
Powerpoint 2007: Create Brisk Presentations
16 pages
MB16A English
No ratings yet
MB16A English
26 pages
Labeling Device Mvps G3: Step-By-Step Instruction
No ratings yet
Labeling Device Mvps G3: Step-By-Step Instruction
13 pages
Dos Device Drivers BTCS-703
No ratings yet
Dos Device Drivers BTCS-703
10 pages
Training Matrix: Daliaon District In-Service Training (Inset)
No ratings yet
Training Matrix: Daliaon District In-Service Training (Inset)
2 pages
Eluxeo Lite Ep 6000
No ratings yet
Eluxeo Lite Ep 6000
1 page
2023 LG Digital Signage E-Catalog (Low) - 20230912
No ratings yet
2023 LG Digital Signage E-Catalog (Low) - 20230912
39 pages
Touch Pad (Bs-Ai) Hassan
No ratings yet
Touch Pad (Bs-Ai) Hassan
9 pages
Manual Book Hunterlab
No ratings yet
Manual Book Hunterlab
89 pages
Day 2 Tables Activity
No ratings yet
Day 2 Tables Activity
5 pages
Capsule Endoscopy Platform: Product Brochure
No ratings yet
Capsule Endoscopy Platform: Product Brochure
8 pages
Unit Iii
No ratings yet
Unit Iii
42 pages
Css Microproject (Final)
No ratings yet
Css Microproject (Final)
23 pages
Unit - 4
No ratings yet
Unit - 4
26 pages
Rika Zarai - Moja Prirodna Medicina - PDF
No ratings yet
Rika Zarai - Moja Prirodna Medicina - PDF
289 pages
Bca 302
No ratings yet
Bca 302
6 pages
STSW TPM I2c DRV
No ratings yet
STSW TPM I2c DRV
5 pages
Module 3 Lecture Notes - CSC 317
No ratings yet
Module 3 Lecture Notes - CSC 317
21 pages
Plant Disease Detection
No ratings yet
Plant Disease Detection
13 pages
Multi Domain Security Management Datasheet
No ratings yet
Multi Domain Security Management Datasheet
6 pages
SoundSphere PDF
No ratings yet
SoundSphere PDF
5 pages
Sonosite Titan Brochure
No ratings yet
Sonosite Titan Brochure
2 pages
教科文组织教师信息和通信技术能力框架
No ratings yet
教科文组织教师信息和通信技术能力框架
67 pages
1st Year Computer-21
No ratings yet
1st Year Computer-21
2 pages
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (648)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2886)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)

Final Project

Uploaded by

Final Project

Uploaded by

FINAL PROJECT

Course: Mining Massive Datasets

Student ID Full name Email Assigned tasks Complete percentage

519H0310 Trần Lê Thành Lộc [email protected] Task 1, 3, 5 100%

519H0306 Trần Trung Kiên [email protected] Task 2, 4, 6 100%

• Define an empty list set_model and an empty list sumOfDis.

• When running the code

• For each model in the set_model list:

• For each value of k in set_clusters:

• Convert the DataFrame into an RDD of dense vectors:

• Create a RowMatrix from the RDD of dense vectors:

• Transform the RowMatrix using the SVD:

• Add a row index to the DataFrame:

• Save the transformed DataFrame to a CSV file:

• Filter the DataFrame to remove records with missing ratings:

• Split the filtered DataFrame into training and test sets:

• Fit the ALS model to the training data:

• Compute the mean rating of the dataset:

• Evaluate the performance of the model using mean squared error:

• Convert the "Ngay" column to a date type:

• Create lagged columns for the HVN stock price:

• Define a linear regression model and its parameters:

• Train the linear regression model on the training data:

• Evaluate the performance of the model on the test data:

You might also like