Setup: Step-By-Step Instructions For How To Create The Solutions File From The Code Provided
Setup: Step-By-Step Instructions For How To Create The Solutions File From The Code Provided
Java Code Generate Top Features Using MutualInformation Criterion Generate the seven data sets required for liblinear/libsvm models: Build Individual Models and Predict for Test Data Liblinear Models Libsvm Models Naive Bayes model# 10 Weighted k-NN models (11) and (12) Multinomial Naive Bayes model# 20 Weighted k-NN models (21) and (22) Train the Two Ensembles Predict for Test Data Using the Two Ensembles Building the Final Submission File
Setup
EMC_ROOT refers to the directory containing Java/R code and data. (i) Extract emc.zip to <EMC_ROOT> (ii) Copy all data files (the files provided by Kaggle/EMC) to <EMC_ROOT>/data directory. (iii) In com.emc.util.Config.java, set the value of EMC_ROOT field appropriately.
Prerequisites
(1) JDK 1.6 or higher (2) Javas Ant build tool (3) R version 2.14 or higher (4) Rs glmnet package (5) Install liblinear (6) Install libsvm
-> Use the training file containing the entire training data (*_tr_-1.csv) corresponding to the models data set. Each models data set can be found in the Model table in emc_my_solution document. (These files were generated in the prior section Generate the seven data sets required for liblinear/libsvm models.) -> Use the model parameters from the Model table in emc_my_solution document Step (2): Predict for test data, using liblinears predict command. Example: ./predict -b 1 <EMC_ROOT>/outputFiles/liblinear/liblinear_test.csv liblinear_tr_1__s-1_c-1.model liblinear_tr_-1__s-1_c-1.out -> Use the test file corresponding to the models data set. Each models data set can be found in the Model table in emc_my_solution document. (These files were generated in the prior section Generate the seven data sets required for liblinear/libsvm models.) Step (3): Run com.emc.liblinear.LiblinearExp_subm Java class to generate the output CSV file. This Java class reads the prediction files from step(2) and generates the output CSV. Before running this class, make the following two changes: (i) Modify line 24 to point to liblinear prediction files from step(2) BufferedReader br = new BufferedReader(new FileReader("...")); (ii) Modify line 34 so that the file name matches the model's file name in either emc_sub_e9b.r or emc_sub_e9a_FS-10k-all-15.r [one of data<model_number> variables] FileWriter fileWriter = new FileWriter("..."); $ cd <EMC_ROOT>/java $ ant $ java -Xmx1024m -cp classes:lib/commons-math-2.2.jar com.emc.liblinear.LiblinearExp_subm
Libsvm Models
The steps for libsvm are the same as liblinear except for different commands used for training and prediction. For completeness, libsvm steps are documented here. Build each libsvm model described in the Model table in emc_my_solution document by following the steps outlined below: Concrete steps for building models: Step (1): Train using libsvms svm-train command. Example: ./svm-train -m 2000 -c 8 -g 0.1 -b 1 <EMC_ROOT>/outputFiles/liblinear/ raw_minTermCount-3_tr_<cvf>.csv raw_minTermCount-3_tr_<cvf>__c-8_g-0.1.model -> Use the training files corresponding to the models data set. Each models data set can be found in the Model table in emc_my_solution document. (These files were generated in the prior section Generate the seven data sets required for liblinear/libsvm models.) -> Use the model parameters from the Model table in emc_my_solution document -> Perform this step 5 times [once for each fold (<cvf> = 1 to 5)]. Step (2): Predict for cross validation data, using libsvms svm-predict command. Example: ./svm-predict -b 1 <EMC_ROOT>/outputFiles/liblinear/raw_minTermCount3_cv_<cvf>.csv raw_minTermCount-3_tr_<cvf>__c-8_g-0.1.model raw_minTermCount3_tr_<cvf>__c-8_g-0.1.out
-> Use the cross validation files corresponding to the models data set. Each models data set can be found in the Model table in emc_my_solution document. (These files were generated in the prior section Generate the seven data sets required for liblinear/ libsvm models.) -> Perform this step 5 times [once for each fold (<cvf> = 1 to 5)]. Step (3): Same as that for liblinear Concrete steps for generating predictions for test data: Step (1): Train using libsvms svm-train command. Example: ./svm-train -m 2000 -c 8 -g 0.1 -b 1 <EMC_ROOT>/outputFiles/liblinear/ raw_minTermCount-3_tr_-1.csv raw_minTermCount-3_tr_-1__c-8_g-0.1.model -> Use the training file containing the entire training data (*_tr_-1.csv) corresponding to the models data set. Each models data set can be found in the Model table in emc_my_solution document. (These files were generated in the prior section Generate the seven data sets required for liblinear/libsvm models.) -> Use the model parameters from the Model table in emc_my_solution document Step (2): Predict for test data, using libsvms svm-predict command. Example: ./svm-predict -b 1 <EMC_ROOT>/outputFiles/liblinear/raw_minTermCount3_test.csv raw_minTermCount-3_tr_-1__c-8_g-0.1.model raw_minTermCount-3_tr_1__c-8_g-0.1.out -> Use the test file corresponding to the models data set. Each models data set can be found in the Model table in emc_my_solution document. (These files were generated in the prior section Generate the seven data sets required for liblinear/libsvm models.) Step (3): Same as that for liblinear
$ java -Xmx6000m -cp classes:lib/commons-math-2.2.jar com.emc.knn.KnnExp1 4 60 Concrete steps for generating predictions for test data: Steps (1), (2), and (3): Run Java class com.emc.knn.KnnExp1_Subm. This class automatically generates test predictions for both models 11 and 12. command line arguments: argument 1: number of threads [set to as high as possible: number of cores] argument 2: sleepIntervalInSeconds [set to 60] [used for printing progress] $ cd <EMC_ROOT>/java $ java -Xmx6000m -cp classes:lib/commons-math-2.2.jar com.emc.knn.KnnExp1_Subm 4 60