LT Mahout Exercises

The document provides steps to perform logistic regression and LDA modeling on banking and news dataset using Mahout machine learning library on Hadoop. It describes how to: 1) Download and preprocess the banking data for logistic regression by removing headers and splitting into training and test sets. 2) Train a logistic regression model on the training set and evaluate it on both training and test sets. 3) Download and preprocess a news dataset for LDA, converting it to sequence and vector formats for topic modeling. 4) Run LDA using Mahout to extract 20 topics from the news dataset and view the results.

Uploaded by

Mypost

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

109 views4 pages

LT Mahout Exercises

Uploaded by

Mypost

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

L & T Mahout Practice Examples

Logistic Regression
Create a folder in your home directory with the following command:
cd $HOME
mkdir bank_data
cd bank_data
Download the data in the bank_data directory:
wget http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bankadditional.zip
unzip the file
unzip bank-additional.zip
cd bank-additional
Open the bank-additional directory and observe the file using ls and gedit commands
Sed is a powerful editor on linux tha can perform data pre-processing tool . The general syntax is :
sed -e 's/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g' fileName > Output_
fileName
sed -i 's/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g'
Command to replace ; with , and remove " are as follows:
sed -e 's/;/,/g' bank-additional-full.csv > input_bank_data.csv
sed -i 's/"//g' input_bank_data.csv
Remove the heading line from the dataset
sed -i '1d' input_bank_data.csv
Create new dir and copy the files
mkdir input_bank
cp input_bank_data.csv input_bank
Set Mahout to run in local and not on distributed mode export MAHOUT_LOCAL=TRUE
Split the dataset into training and test datasets using the Mahout split command mahout split --input input_bank --trainingOutput train_data --testOutput
test_data -xm sequential --randomSelectionPct 30

Restore the header line in test and training datasets

sed -i '1s/^/age,job,marital,education,default,housing,loan,contact,month
,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.
price.idx,cons.conf.idx,euribor3m,nr.employed,y\n/' train_data/input_
bank_data.csv
sed -i '1s/^/age,job,marital,education,default,housing,loan,contact,month
,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.
price.idx,cons.conf.idx,euribor3m,nr.employed,y\n/' test_data/input_bank_
data.csv
Train the model mahout trainlogistic --input train_data/input_bank_data.csv --output
model --target y --predictors age job marital education default housing
loan contact month day_of_week duration campaign pdays previous poutcome
emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed --types
n w w w w w w w w w n n n n w n n n n n --features 20 --passes 100 --rate
50 --categories 2
Get re-substitution error
mahout runlogistic --auc --confusion --input train_data/input_bank_data.
csv --model model
To get the scores for each instance, we use the --scores option as follows:
mahout runlogistic --scores --input train_data/input_bank_data.csv
--model model
To test the model on the test data, we will pass on the test file created during the split
process as follows:
mahout runlogistic --auc --confusion --input test_data/input_bank_data.
csv --model model
use scores option to predict scores.
LDA
On the command line, first set up the working directory as follows:
mkdir /tmp/lda
export WORK_DIR=/tmp/lda
Then we download the data to a location on the hard drive and extract the
downloaded file to the working directory:

wget http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
tar xvzf reuters21578.tar.gz -C $WORK_DIR/input
We will use the Mahout class ExtractReuters to extract the files:
mahout org.apache.lucene.benchmark.utils.ExtractReuters
$WORK_DIR/input$WORK_DIR/reutersfinal
Set mahout to run on hadoop cluster instead of local
export MAHOUT_LOCAL=FALSE
Close the terminal and reopen the terminal
Transfer the reutersfinal to hadoop cluster from local
hadoop fs -put /tmp/lda/reutersfinal reutersfinal
Check if the files are transferred to hadoop
hadoop fs -ls reutersfinal
The next step is to convert the files to the sequence format. We will use the Mahout
command seqdirectory for that:
mahout seqdirectory -i reutersfinal -o sequencefiles -c UTF-8 -chunk 5
To view one of the sequence files, we will use the seqdumper utility:
mahout seqdumper -i sequencefiles/part-m-00000 -o part-m-00000.txt
gedit part-m-00000.txt
The next step is to convert the sequence file into a term frequency matrix. We will
use the Mahout utility seq2sparse for that. This matrix can then be used to perform
topic modeling :
mahout seq2sparse -i sequencefiles/ -o vectors/ -wt tf --namedVector
Check the files created
hadoop fs -ls vectors
using rowid to convert sparse vectors to the form needed for cvb clustering (i.e., to change the Text
key to an Integer).
mahout rowid -i vectors/tf-vectors -o reuters-out-matrix
We execute the Mahout cvb command to perform topic modeling on the input dataset:
mahout cvb -i reuters-out-matrix/matrix -o reuterslda -k 20 -ow -x 20 -dict vectors/dictionary.file-0

-dt reuters-lda-topics -mt reuters-lda-model

Verify the output directories
hadoop fs -ls reuters-lda-topics
To view the results, we will use the Mahout vectordump utility:
mahout vectordump -i reuterslda/part-m-00000 -o reutersldaop/vectordump -vs 10 -p true -d
vectors/dictionary.file-0 -dt sequencefile -sort reuterslda/part-m-00000
Check the dump created
ls reutersldaop
gedit reutersldaop/vectordump

V7 Adobe Acrobat Pro DC 2018 (11 - 04-11 - 10) (11 - 25)
50% (2)
V7 Adobe Acrobat Pro DC 2018 (11 - 04-11 - 10) (11 - 25)
6 pages
Carton Packaging Knowledge
88% (8)
Carton Packaging Knowledge
93 pages
7 Data Science / Machine Learning Cheat Sheets in One
100% (1)
7 Data Science / Machine Learning Cheat Sheets in One
9 pages
60 ChatGPT Prompts For Data Science 2023
100% (3)
60 ChatGPT Prompts For Data Science 2023
67 pages
Batangas State University Graduate School
No ratings yet
Batangas State University Graduate School
9 pages
Advertising Response Models
50% (2)
Advertising Response Models
36 pages
VARMA For Battery Voltage Forecasting 1
No ratings yet
VARMA For Battery Voltage Forecasting 1
70 pages
Test Questions in Professional Education
No ratings yet
Test Questions in Professional Education
16 pages
Iyer Vadammma Tamilnadu Brahmin Wedding
No ratings yet
Iyer Vadammma Tamilnadu Brahmin Wedding
23 pages
Handy R Stuff
No ratings yet
Handy R Stuff
5 pages
Final PPT
100% (1)
Final PPT
16 pages
The Jhānas
No ratings yet
The Jhānas
4 pages
QMS M1
No ratings yet
QMS M1
10 pages
National-Oilwell: Top Drive
No ratings yet
National-Oilwell: Top Drive
6 pages
21 - Olorunfemi - Assessment of The Effect
No ratings yet
21 - Olorunfemi - Assessment of The Effect
7 pages
Neofiti 1 - Deuteronomio - Translation-English
No ratings yet
Neofiti 1 - Deuteronomio - Translation-English
68 pages
Mahout Tutorial
No ratings yet
Mahout Tutorial
9 pages
1 s2.0 S0263224113006519 Main
No ratings yet
1 s2.0 S0263224113006519 Main
11 pages
PEPSICO
No ratings yet
PEPSICO
5 pages
KNN - Model: Train Test CL K
No ratings yet
KNN - Model: Train Test CL K
2 pages
QR729 (QTR729) Qatar Airways Flight Tracking and History - FlightAware
No ratings yet
QR729 (QTR729) Qatar Airways Flight Tracking and History - FlightAware
1 page
Hands On Mahout - Mammoth Scale Machine Learning Presentation
No ratings yet
Hands On Mahout - Mammoth Scale Machine Learning Presentation
68 pages
Newborn Care 2
No ratings yet
Newborn Care 2
2 pages
Pengaruh Feed Rate Terhadap Sifat Mekanik Pada Pengelasan Friction Stir Welding Alumunium 6110
No ratings yet
Pengaruh Feed Rate Terhadap Sifat Mekanik Pada Pengelasan Friction Stir Welding Alumunium 6110
10 pages
PT - English 1 - Q3
No ratings yet
PT - English 1 - Q3
4 pages
Lesson 04 - Physical Science
No ratings yet
Lesson 04 - Physical Science
24 pages
Atm4171 2024 E0
No ratings yet
Atm4171 2024 E0
7 pages
Java Lab Cycle Programs 2022
No ratings yet
Java Lab Cycle Programs 2022
2 pages
Lymph 4649 Document PDF
No ratings yet
Lymph 4649 Document PDF
17 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
2 pages
Hemant Resume 1
No ratings yet
Hemant Resume 1
4 pages
SOW Ransomware Protection Solution v1
No ratings yet
SOW Ransomware Protection Solution v1
11 pages
Subset Creation in R
No ratings yet
Subset Creation in R
27 pages
RFQ - Section - III - Technical - Questionnaire
No ratings yet
RFQ - Section - III - Technical - Questionnaire
12 pages
Fisa Tehnica Pompe SPAU
No ratings yet
Fisa Tehnica Pompe SPAU
4 pages
Dav Pracs
No ratings yet
Dav Pracs
9 pages
Untitled Document 13
No ratings yet
Untitled Document 13
3 pages
Day25 Lab (Pandas Series)
No ratings yet
Day25 Lab (Pandas Series)
4 pages
DA LabFile
No ratings yet
DA LabFile
63 pages
Sriparna New Resume
No ratings yet
Sriparna New Resume
1 page
3 Machine Learning Tools
No ratings yet
3 Machine Learning Tools
69 pages
A World of Art Exam Chapter 1
No ratings yet
A World of Art Exam Chapter 1
7 pages
Data Science Using R 2
No ratings yet
Data Science Using R 2
29 pages
Tan ChineseLiteratureEssays 2016
No ratings yet
Tan ChineseLiteratureEssays 2016
5 pages
Teaching MBAs With Chatgpt - Lessons Learned
No ratings yet
Teaching MBAs With Chatgpt - Lessons Learned
22 pages
ML Lab External QP
No ratings yet
ML Lab External QP
2 pages
Soe Hed Cbcs Syllabus
No ratings yet
Soe Hed Cbcs Syllabus
53 pages
Environment Notes by Akshay Jadhav Sir Rank52
No ratings yet
Environment Notes by Akshay Jadhav Sir Rank52
176 pages
Data Science Lab Manual..
No ratings yet
Data Science Lab Manual..
54 pages
Learn SAP Basis in 24 Hours
From Everand
Learn SAP Basis in 24 Hours
Alex Nordeen
4.5/5 (2)
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
Administering Microsoft Azure SQL Solutions DP 300
From Everand
Administering Microsoft Azure SQL Solutions DP 300
Manish Soni
No ratings yet
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Linux System Administrator Interview Questions You'll Most Likely Be Asked
From Everand
Linux System Administrator Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Operating Systems Interview Questions You'll Most Likely Be Asked
From Everand
Operating Systems Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Derinlemesine React Data
From Everand
Derinlemesine React Data
Onder Teker
No ratings yet
Hacking of Computer Networks: Full Course on Hacking of Computer Networks
From Everand
Hacking of Computer Networks: Full Course on Hacking of Computer Networks
Dr. Hidaia Mahmood Alassouli
No ratings yet
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
Some Tutorials in Computer Networking Hacking
From Everand
Some Tutorials in Computer Networking Hacking
Dr. Hidaia Mahmood Alassouli
No ratings yet
EnterpriseOne Interview Questions
From Everand
EnterpriseOne Interview Questions
equitypress
No ratings yet
Configuration of a Simple Samba File Server, Quota and Schedule Backup
From Everand
Configuration of a Simple Samba File Server, Quota and Schedule Backup
Dr. Hidaia Mahmood Alassouli
No ratings yet
Footprinting, Reconnaissance, Scanning and Enumeration Techniques of Computer Networks
From Everand
Footprinting, Reconnaissance, Scanning and Enumeration Techniques of Computer Networks
Dr. Hidaia Mahmood Alassouli
No ratings yet
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
Quick Configuration of Openldap and Kerberos in Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos in Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet
Kubernetes Made Easy
From Everand
Kubernetes Made Easy
Pankaj Joshi
No ratings yet
The Definitive Guide to PowerShell
From Everand
The Definitive Guide to PowerShell
Wesley Dunne
No ratings yet
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
From Everand
Microsoft Visual Basic Interview Questions: Microsoft VB Certification Review
Equity Press
No ratings yet
LPIC-1 Primer
From Everand
LPIC-1 Primer
John Greene
4.5/5 (3)
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
C Programming
From Everand
C Programming
Netra
No ratings yet
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet
DevOps for the Desperate: A Hands-On Survival Guide
From Everand
DevOps for the Desperate: A Hands-On Survival Guide
Bradley Smith
No ratings yet
Python and SQLite Development
From Everand
Python and SQLite Development
Agus Kurniawan
No ratings yet
Pyqt6 101: A Beginner’s Guide to PyQt6
From Everand
Pyqt6 101: A Beginner’s Guide to PyQt6
Edward Chang
No ratings yet
Fresher PyQt5: A Beginner’s Guide to PyQt5
From Everand
Fresher PyQt5: A Beginner’s Guide to PyQt5
Edward Chang
No ratings yet
Software Design Simplified
From Everand
Software Design Simplified
Liviu Catalin Dorobantu
No ratings yet
Build your own Blockchain: Make your own blockchain and trading bot on your pc
From Everand
Build your own Blockchain: Make your own blockchain and trading bot on your pc
Magelan Cybersecurity
No ratings yet
Linux Services Deployment
From Everand
Linux Services Deployment
Fabian Mestre
No ratings yet
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
Configuration of a Simple Samba File Server, Quota and Schedule Backup
From Everand
Configuration of a Simple Samba File Server, Quota and Schedule Backup
Dr. Hedaya Alasooly
No ratings yet
NoSQL Injection for Elasticsearch
From Everand
NoSQL Injection for Elasticsearch
Gary Drocella
No ratings yet
"C Programming for Beginners: A Step-by-Step Guide"
From Everand
"C Programming for Beginners: A Step-by-Step Guide"
Lov kush
No ratings yet
The Mac Terminal Reference and Scripting Primer
From Everand
The Mac Terminal Reference and Scripting Primer
Jay Docherty
4.5/5 (3)
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
OpenCart Tips and Tricks
From Everand
OpenCart Tips and Tricks
iSenseLabs
No ratings yet
C# 2010 Coding Briefs Data Access
From Everand
C# 2010 Coding Briefs Data Access
Kevin Hough
No ratings yet
Introduction to PHP, Part 4, Second Edition
From Everand
Introduction to PHP, Part 4, Second Edition
Adam Majczak
No ratings yet

LT Mahout Exercises

Uploaded by

LT Mahout Exercises

Uploaded by

L & T Mahout Practice Examples

Restore the header line in test and training datasets

-dt reuters-lda-topics -mt reuters-lda-model

You might also like