0% found this document useful (0 votes)
60 views7 pages

Setup: Step-By-Step Instructions For How To Create The Solutions File From The Code Provided

The document provides step-by-step instructions for building machine learning models to predict data outcomes. It involves preprocessing data to generate feature sets, training individual models using libraries like liblinear and libsvm, combining models into ensembles, and generating predictions on test data to create a final submissions file. The process requires using Java and R code to extract features, split data, train models, make predictions, and evaluate results across multiple folds of cross-validation.

Uploaded by

Akshat Verma
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views7 pages

Setup: Step-By-Step Instructions For How To Create The Solutions File From The Code Provided

The document provides step-by-step instructions for building machine learning models to predict data outcomes. It involves preprocessing data to generate feature sets, training individual models using libraries like liblinear and libsvm, combining models into ensembles, and generating predictions on test data to create a final submissions file. The process requires using Java and R code to extract features, split data, train models, make predictions, and evaluate results across multiple folds of cross-validation.

Uploaded by

Akshat Verma
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Step-by-step instructions for how to create the solutions file from the code provided: Setup Prerequisites Compile

Java Code Generate Top Features Using MutualInformation Criterion Generate the seven data sets required for liblinear/libsvm models: Build Individual Models and Predict for Test Data Liblinear Models Libsvm Models Naive Bayes model# 10 Weighted k-NN models (11) and (12) Multinomial Naive Bayes model# 20 Weighted k-NN models (21) and (22) Train the Two Ensembles Predict for Test Data Using the Two Ensembles Building the Final Submission File

Setup
EMC_ROOT refers to the directory containing Java/R code and data. (i) Extract emc.zip to <EMC_ROOT> (ii) Copy all data files (the files provided by Kaggle/EMC) to <EMC_ROOT>/data directory. (iii) In com.emc.util.Config.java, set the value of EMC_ROOT field appropriately.

Prerequisites
(1) JDK 1.6 or higher (2) Javas Ant build tool (3) R version 2.14 or higher (4) Rs glmnet package (5) Install liblinear (6) Install libsvm

Compile Java Code


$ cd <EMC_ROOT>/java $ ant After running ant, <EMC_ROOT>/java/classes directory will have the compiled Java classes.

Generate Top Features Using MutualInformation Criterion


Run Java class com.emc.featureSelection.MutualInformationBasedFeatureSelector

$ cd <EMC_ROOT>/java $ java -Xmx1024m -cp classes:lib/commons-math-2.2.jar com.emc.featureSelection.MutualInformationBasedFeatureSelector

Generate the seven data sets required for liblinear/libsvm models:


(1) Run the following Java classes to generate the training and cross-validation files: i. com.emc.liblinear.LiblinearFileGen ii. com.emc.liblinear.LibLinear_TF_IDF__minTermCount iii. com.emc.liblinear.LiblinearFileGen__minTermCount iv. com.emc.liblinear.LibLinear_TF_IDF__TfEqualOne_minTermCount v. com.emc.liblinear.LibLinear_TF_IDF__featureSelection vi. com.emc.liblinear.LibLinear_raw__featureSelection vii. com.emc.liblinear.LibLinear_TF_IDF__TfEqualOne__featureSelection Example: $ cd <EMC_ROOT>/java $ java -Xmx1024m -cp classes:lib/commons-math-2.2.jar com.emc.liblinear.LiblinearFileGen It will generate 11 files: -> five files for training [one for each fold of the 5-fold cross-validation] each file contains 80% of training data file name: *_tr_[1-5].csv -> five files for cross validation [one for each fold of the 5-fold cross-validation] each file contains 20% of training data file name: *_cv_[1-5].csv -> one file containing the entire training data file name: *_tr_-1.csv (2) Run the following Java classes to generate the test files: i. com.emc.liblinear.LiblinearFileGen_test ii. com.emc.liblinear.LibLinear_TF_IDF__minTermCount_test iii. com.emc.liblinear.LiblinearFileGen__minTermCount_test iv. com.emc.liblinear.LibLinear_TF_IDF__TfEqualOne_minTermCount_test v. com.emc.liblinear.LibLinear_TF_IDF__featureSelection__test vi. com.emc.liblinear.LibLinear_raw__featureSelection__test vii. com.emc.liblinear.LibLinear_TF_IDF__TfEqualOne__featureSelection__test Example: $ cd <EMC_ROOT>/java $ java -Xmx1024m -cp classes:lib/commons-math-2.2.jar com.emc.liblinear.LiblinearFileGen_test It will generate one file: -> one file for test data file name: *_test.csv

Build Individual Models and Predict for Test Data


Liblinear Models
Build each liblinear model described in the Model table in emc_my_solution document by following the steps outlined below: Concrete steps for building models: Step (1): Train using liblinears train command. Example: ./train -s 1 -c 1 <EMC_ROOT>/outputFiles/liblinear/liblinear_tr_<cvf>.csv liblinear_tr_<cvf>__s-1_c-1.model -> Use the training files corresponding to the models data set. Each models data set can be found in the Model table in emc_my_solution document. (These files were generated in the prior section Generate the seven data sets required for liblinear/libsvm models.) -> Use the model parameters from the Model table in emc_my_solution document -> Perform this step 5 times [once for each fold (<cvf> = 1 to 5)]. Step (2): Predict for cross validation data, using liblinears predict command. Example: ./predict -b 1 <EMC_ROOT>/outputFiles/liblinear/liblinear_cv_<cvf>.csv liblinear_tr_<cvf>__s-1_c-1.model liblinear_tr_<cvf>__s-1_c-1.out -> Use the cross validation files corresponding to the models data set. Each models data set can be found in the Model table in emc_my_solution document. (These files were generated in the prior section Generate the seven data sets required for liblinear/ libsvm models.) -> Perform this step 5 times [once for each fold (<cvf> = 1 to 5)]. Step (3): Run com.emc.liblinear.LiblinearCalibration Java class to generate the output CSV file. This Java class reads the prediction files from step(2) and generates the output CSV. Before running this class, make the following two changes to LiblinearCalibration.java: (i) Modify line 107 to point to liblinear prediction files from step(2) BufferedReader br = new BufferedReader(new FileReader("...")); (ii) Modify line 37 so that the file name matches the model's file name in emc_train_e9b.r [one of data<model_number> variables] fileWriter = new FileWriter("..."); $ cd <EMC_ROOT>/java $ ant $ java -Xmx1024m -cp classes:lib/commons-math-2.2.jar com.emc.liblinear.LiblinearCalibration Concrete steps for generating predictions for test data: Step (1): Train using liblinears train command. Example: ./train -s 1 -c 1 <EMC_ROOT>/outputFiles/liblinear/liblinear_tr_-1.csv liblinear_tr_-1__s-1_c-1.model

-> Use the training file containing the entire training data (*_tr_-1.csv) corresponding to the models data set. Each models data set can be found in the Model table in emc_my_solution document. (These files were generated in the prior section Generate the seven data sets required for liblinear/libsvm models.) -> Use the model parameters from the Model table in emc_my_solution document Step (2): Predict for test data, using liblinears predict command. Example: ./predict -b 1 <EMC_ROOT>/outputFiles/liblinear/liblinear_test.csv liblinear_tr_1__s-1_c-1.model liblinear_tr_-1__s-1_c-1.out -> Use the test file corresponding to the models data set. Each models data set can be found in the Model table in emc_my_solution document. (These files were generated in the prior section Generate the seven data sets required for liblinear/libsvm models.) Step (3): Run com.emc.liblinear.LiblinearExp_subm Java class to generate the output CSV file. This Java class reads the prediction files from step(2) and generates the output CSV. Before running this class, make the following two changes: (i) Modify line 24 to point to liblinear prediction files from step(2) BufferedReader br = new BufferedReader(new FileReader("...")); (ii) Modify line 34 so that the file name matches the model's file name in either emc_sub_e9b.r or emc_sub_e9a_FS-10k-all-15.r [one of data<model_number> variables] FileWriter fileWriter = new FileWriter("..."); $ cd <EMC_ROOT>/java $ ant $ java -Xmx1024m -cp classes:lib/commons-math-2.2.jar com.emc.liblinear.LiblinearExp_subm

Libsvm Models
The steps for libsvm are the same as liblinear except for different commands used for training and prediction. For completeness, libsvm steps are documented here. Build each libsvm model described in the Model table in emc_my_solution document by following the steps outlined below: Concrete steps for building models: Step (1): Train using libsvms svm-train command. Example: ./svm-train -m 2000 -c 8 -g 0.1 -b 1 <EMC_ROOT>/outputFiles/liblinear/ raw_minTermCount-3_tr_<cvf>.csv raw_minTermCount-3_tr_<cvf>__c-8_g-0.1.model -> Use the training files corresponding to the models data set. Each models data set can be found in the Model table in emc_my_solution document. (These files were generated in the prior section Generate the seven data sets required for liblinear/libsvm models.) -> Use the model parameters from the Model table in emc_my_solution document -> Perform this step 5 times [once for each fold (<cvf> = 1 to 5)]. Step (2): Predict for cross validation data, using libsvms svm-predict command. Example: ./svm-predict -b 1 <EMC_ROOT>/outputFiles/liblinear/raw_minTermCount3_cv_<cvf>.csv raw_minTermCount-3_tr_<cvf>__c-8_g-0.1.model raw_minTermCount3_tr_<cvf>__c-8_g-0.1.out

-> Use the cross validation files corresponding to the models data set. Each models data set can be found in the Model table in emc_my_solution document. (These files were generated in the prior section Generate the seven data sets required for liblinear/ libsvm models.) -> Perform this step 5 times [once for each fold (<cvf> = 1 to 5)]. Step (3): Same as that for liblinear Concrete steps for generating predictions for test data: Step (1): Train using libsvms svm-train command. Example: ./svm-train -m 2000 -c 8 -g 0.1 -b 1 <EMC_ROOT>/outputFiles/liblinear/ raw_minTermCount-3_tr_-1.csv raw_minTermCount-3_tr_-1__c-8_g-0.1.model -> Use the training file containing the entire training data (*_tr_-1.csv) corresponding to the models data set. Each models data set can be found in the Model table in emc_my_solution document. (These files were generated in the prior section Generate the seven data sets required for liblinear/libsvm models.) -> Use the model parameters from the Model table in emc_my_solution document Step (2): Predict for test data, using libsvms svm-predict command. Example: ./svm-predict -b 1 <EMC_ROOT>/outputFiles/liblinear/raw_minTermCount3_test.csv raw_minTermCount-3_tr_-1__c-8_g-0.1.model raw_minTermCount-3_tr_1__c-8_g-0.1.out -> Use the test file corresponding to the models data set. Each models data set can be found in the Model table in emc_my_solution document. (These files were generated in the prior section Generate the seven data sets required for liblinear/libsvm models.) Step (3): Same as that for liblinear

Naive Bayes model# 10


Concrete steps for building model: Steps (1), (2), and (3): Run Java class com.emc.nb.NBExp1 $ cd <EMC_ROOT>/java $ java -Xmx1024m -cp classes:lib/commons-math-2.2.jar com.emc.nb.NBExp1 Concrete steps for generating predictions for test data: Steps (1), (2), and (3): Run Java class com.emc.nb.NBExp1_Subm java -Xmx1024m -cp classes:lib/commons-math-2.2.jar com.emc.nb.NBExp1_Subm

Weighted k-NN models (11) and (12)


Concrete steps for building models: Steps (1), (2), and (3): Run Java class com.emc.knn.KnnExp1. This class automatically builds both models 11 and 12. command line arguments: argument 1: number of threads [set to as high as possible: number of cores] argument 2: sleepIntervalInSeconds [set to 60] [used for printing progress] $ cd <EMC_ROOT>/java

$ java -Xmx6000m -cp classes:lib/commons-math-2.2.jar com.emc.knn.KnnExp1 4 60 Concrete steps for generating predictions for test data: Steps (1), (2), and (3): Run Java class com.emc.knn.KnnExp1_Subm. This class automatically generates test predictions for both models 11 and 12. command line arguments: argument 1: number of threads [set to as high as possible: number of cores] argument 2: sleepIntervalInSeconds [set to 60] [used for printing progress] $ cd <EMC_ROOT>/java $ java -Xmx6000m -cp classes:lib/commons-math-2.2.jar com.emc.knn.KnnExp1_Subm 4 60

Multinomial Naive Bayes model# 20


Concrete steps for building model: Steps (1), (2), and (3): Run Java class com.emc.nb.exp1_withTF.NBExp1__withTF $ cd <EMC_ROOT>/java $ java -Xmx1024m -cp classes:lib/commons-math-2.2.jar com.emc.nb.exp1_withTF.NBExp1__withTF Concrete steps for generating predictions for test data: Steps (1), (2), and (3): Run Java class com.emc.nb.exp1_withTF.NBExp1__withTF_Subm $ cd <EMC_ROOT>/java $ java -Xmx1024m -cp classes:lib/commons-math-2.2.jar com.emc.nb.exp1_withTF.NBExp1__withTF_Subm

Weighted k-NN models (21) and (22)


Concrete steps for building models: Steps (1), (2), and (3): Run Java class com.emc.knn.KnnExp2_win. This class automatically builds both models 21 and 22. command line arguments: argument 1: number of threads [set to as high as possible: number of cores] argument 2: sleepIntervalInSeconds [set to 60] [used for printing progress] $ cd <EMC_ROOT>/java $ java -Xmx6000m -cp classes:lib/commons-math-2.2.jar com.emc.knn.KnnExp2_win 4 60 Concrete steps for generating predictions for test data: Steps (1), (2), and (3): Run Java class com.emc.knn.KnnExp2_win_Subm. This class automatically generates test predictions for both models 21 and 22. command line arguments: argument 1: number of threads [set to as high as possible: number of cores] argument 2: sleepIntervalInSeconds [set to 60] [used for printing progress] $ cd <EMC_ROOT>/java $ java -Xmx6000m -cp classes:lib/commons-math-2.2.jar com.emc.knn.KnnExp2_win_Subm 4 60

Train the Two Ensembles


Ensemble (1): e9a_FS_10k_all_15 Run the following command: setwd('<EMC_ROOT>/R'); source('emc_train_e9a_FS-10k-all-15.r'); ens.glmnetTest(cvFold=5, alpha=0.5) Ensemble (2): e9b Run the following command: setwd('<EMC_ROOT>/R'); source('emc_train_e9b.r'); ens.glmnetTest(cvFold=5, alpha=0.5)

Predict for Test Data Using the Two Ensembles


Ensemble (1): e9a_FS_10k_all_15 Run the following command: setwd('<EMC_ROOT>/R'); source('emc_sub_e9a_FS-10k-all-15.r'); build_sub() Predictions will be saved in file e9a_FS-10k-all-15.csv Ensemble (2): e9b Run the following command: setwd('<EMC_ROOT>/R'); source('emc_sub_e9b.r'); build_sub() Predictions will be saved in file e9b.csv

Building the Final Submission File


Take the weighted average of predictions from two ensembles using the following code snippet:
e9a_FS_10k_all_15 = read.csv('e9a_FS-10k-all-15.csv') e9b = read.csv('e9b.csv') tmp = (0.3 * e9a_FS_10k_all_15) + (0.7 * e9b) weighted_average = e9b weighted_average[,2:98] = tmp[,2:98] write.table(weighted_average, 'final_submission.csv', row.names=F, col.names=T, sep=',')

You might also like