0% found this document useful (0 votes)
109 views4 pages

LT Mahout Exercises

The document provides steps to perform logistic regression and LDA modeling on banking and news dataset using Mahout machine learning library on Hadoop. It describes how to: 1) Download and preprocess the banking data for logistic regression by removing headers and splitting into training and test sets. 2) Train a logistic regression model on the training set and evaluate it on both training and test sets. 3) Download and preprocess a news dataset for LDA, converting it to sequence and vector formats for topic modeling. 4) Run LDA using Mahout to extract 20 topics from the news dataset and view the results.

Uploaded by

Mypost
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views4 pages

LT Mahout Exercises

The document provides steps to perform logistic regression and LDA modeling on banking and news dataset using Mahout machine learning library on Hadoop. It describes how to: 1) Download and preprocess the banking data for logistic regression by removing headers and splitting into training and test sets. 2) Train a logistic regression model on the training set and evaluate it on both training and test sets. 3) Download and preprocess a news dataset for LDA, converting it to sequence and vector formats for topic modeling. 4) Run LDA using Mahout to extract 20 topics from the news dataset and view the results.

Uploaded by

Mypost
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

L & T Mahout Practice Examples

Logistic Regression
Create a folder in your home directory with the following command:
cd $HOME
mkdir bank_data
cd bank_data
Download the data in the bank_data directory:
wget http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bankadditional.zip
unzip the file
unzip bank-additional.zip
cd bank-additional
Open the bank-additional directory and observe the file using ls and gedit commands
Sed is a powerful editor on linux tha can perform data pre-processing tool . The general syntax is :
sed -e 's/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g' fileName > Output_
fileName
sed -i 's/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g'
Command to replace ; with , and remove " are as follows:
sed -e 's/;/,/g' bank-additional-full.csv > input_bank_data.csv
sed -i 's/"//g' input_bank_data.csv
Remove the heading line from the dataset
sed -i '1d' input_bank_data.csv
Create new dir and copy the files
mkdir input_bank
cp input_bank_data.csv input_bank
Set Mahout to run in local and not on distributed mode export MAHOUT_LOCAL=TRUE
Split the dataset into training and test datasets using the Mahout split command mahout split --input input_bank --trainingOutput train_data --testOutput
test_data -xm sequential --randomSelectionPct 30

Restore the header line in test and training datasets


sed -i '1s/^/age,job,marital,education,default,housing,loan,contact,month
,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.
price.idx,cons.conf.idx,euribor3m,nr.employed,y\n/' train_data/input_
bank_data.csv
sed -i '1s/^/age,job,marital,education,default,housing,loan,contact,month
,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.
price.idx,cons.conf.idx,euribor3m,nr.employed,y\n/' test_data/input_bank_
data.csv
Train the model mahout trainlogistic --input train_data/input_bank_data.csv --output
model --target y --predictors age job marital education default housing
loan contact month day_of_week duration campaign pdays previous poutcome
emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed --types
n w w w w w w w w w n n n n w n n n n n --features 20 --passes 100 --rate
50 --categories 2
Get re-substitution error
mahout runlogistic --auc --confusion --input train_data/input_bank_data.
csv --model model
To get the scores for each instance, we use the --scores option as follows:
mahout runlogistic --scores --input train_data/input_bank_data.csv
--model model
To test the model on the test data, we will pass on the test file created during the split
process as follows:
mahout runlogistic --auc --confusion --input test_data/input_bank_data.
csv --model model
use scores option to predict scores.
LDA
On the command line, first set up the working directory as follows:
mkdir /tmp/lda
export WORK_DIR=/tmp/lda
Then we download the data to a location on the hard drive and extract the
downloaded file to the working directory:

wget http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
tar xvzf reuters21578.tar.gz -C $WORK_DIR/input
We will use the Mahout class ExtractReuters to extract the files:
mahout org.apache.lucene.benchmark.utils.ExtractReuters
$WORK_DIR/input$WORK_DIR/reutersfinal
Set mahout to run on hadoop cluster instead of local
export MAHOUT_LOCAL=FALSE
Close the terminal and reopen the terminal
Transfer the reutersfinal to hadoop cluster from local
hadoop fs -put /tmp/lda/reutersfinal reutersfinal
Check if the files are transferred to hadoop
hadoop fs -ls reutersfinal
The next step is to convert the files to the sequence format. We will use the Mahout
command seqdirectory for that:
mahout seqdirectory -i reutersfinal -o sequencefiles -c UTF-8 -chunk 5
To view one of the sequence files, we will use the seqdumper utility:
mahout seqdumper -i sequencefiles/part-m-00000 -o part-m-00000.txt
gedit part-m-00000.txt
The next step is to convert the sequence file into a term frequency matrix. We will
use the Mahout utility seq2sparse for that. This matrix can then be used to perform
topic modeling :
mahout seq2sparse -i sequencefiles/ -o vectors/ -wt tf --namedVector
Check the files created
hadoop fs -ls vectors
using rowid to convert sparse vectors to the form needed for cvb clustering (i.e., to change the Text
key to an Integer).
mahout rowid -i vectors/tf-vectors -o reuters-out-matrix
We execute the Mahout cvb command to perform topic modeling on the input dataset:
mahout cvb -i reuters-out-matrix/matrix -o reuterslda -k 20 -ow -x 20 -dict vectors/dictionary.file-0

-dt reuters-lda-topics -mt reuters-lda-model


Verify the output directories
hadoop fs -ls reuters-lda-topics
To view the results, we will use the Mahout vectordump utility:
mahout vectordump -i reuterslda/part-m-00000 -o reutersldaop/vectordump -vs 10 -p true -d
vectors/dictionary.file-0 -dt sequencefile -sort reuterslda/part-m-00000
Check the dump created
ls reutersldaop
gedit reutersldaop/vectordump

You might also like