Map-Reduce Implementation, Using In-Map Aggregation and Other Features
Map-Reduce Implementation, Using In-Map Aggregation and Other Features
Author:CristianoRuschelMarquesDias
Description
Theindexingalgorithmimplementedusingthemapreducearchitectureallows
whomeverhasaccesstotheoutputdatatomakequeriestakingonaccountthepositionof
eachwordonthedocument,andtheamountofoccurrencesofeachwordineachdocument.
Thefeaturesimplementedwere:
CaseFoldingAcontextbasedcapitalizationalgorithmwasusedtodecidewhento
leaveawordcapitalized.Essentially,wheneverawordisthestartofasentence,itis
presupposedthatitnormallywouldnotbecapitalized,andthereforeislowercased.A
lotofthoughtwasgivenintothis,speciallyifitwouldbeworthtoimplementacasing
algorithm,giventhatingeneralnotusingcasematchinggivesgoodenoughand
arguablyasgoodas.Sincetheusageofthisdoesnothaveagreatimpacton
performance,andaccordingtotheopinionviewedon[1],thealgorithmusesit,but
resultsaresimilarwithoutit.
Punctuationtreatinginsteadofremovingallpunctuation,weinsteadtrimthe
punctuationofthewords,sintepunctuationinthemiddleofawordsometimeshave
meaning.Forexample,thenumber8.5orinmapper.Todothiswealsohadto
separetelytrimthetagsforreferencesgeneratedbywikipedia,duetotheirpeculiar
form[number],somechanismstomodifytheimportanceofwordsinsidea
referenceinthedocumentwouldbeeasilyimplementedfromthispoint.
StopWordsandStemmingAftertheaforementionedsteps,stopwordswordsthat
donotaddinformationtothetextareremovedusingthealgorithmprovided.After
this,wordsarestemmed,alsousingaprovidedalgorithm.
InmappercombiningThemapreductionpatterncalledinmappercombiningwas
implemented.Thismeansthatinsteadofbeingdirectlywrittenintothecontexttobe
treatedbythereducer,foreachmapperthekeyvaluepairsarepreprocessed,suchas
tolessentheamountofinformationthatissenttothereducersandincreaseoverall
speedoftherunofthemapreduce.Itwasimplementedinawaythatthepre
combinedkeyvaluepairsarewritteninthecontextassoonasthemapperfinishesor
theMapcontainingthemhasusedtoomuchmemoryaconstantvaluecanbe
specified.Itissimilartothelastimplementationfoundon[2]thoughnocodewas
copied.
PositionalindexingThepositionofeachofeachoccurrenceofeachtokenemittedby
themapperwhichisasimplifiedversionofawordresultedbytheaforementioned
operationsiskept,andpropagatedintheoutput,sothatqueriescantaketheposition
ofthewordinthedocumentintoaccount.
FlaggingofimportantitemsThemodificationsneededtoimplementthepropagations
oftheflaggingofimportantitemstotheoutputwerenotmadetherefore,eventhough
theverificationofthisisbeingmadeinsomepoints,thisinformationisnotsenttothe
output.
Performance
AlltheoperationsmadehavearuntimecomplexityofO(n)inrelationtothelengthofthe
input,whichguaranteesthespeedandscalabilityofthealgorithmimplemented.The
algorithmtakessometimetorunduetotheoverheadsinvolvedinthemapreduce
architecture,thoughastheinputgrows,theoverheadgetscomparativelyinsignificant.The
useoftheinmappercombiningpatternhelpsusavoidbottlenecks,suchasthealgorithm
runningslowlyduetotheexcessivenormallycostlymemoryoperationsthatwouldbe
causedbythemappersendinganunnecessarylargeamountofdatatothereducers,which
makesthealgorithmbetterscalable.Alsothepatterndoesnotoverloadthememory,andthe
usageofmemoryusedbytheinmapcombinermaybechangedenablesthealgorithmtobe
selectivelytunedfordifferentusersorsituations.Thebottleneckstothealgorithmas
implementedaretheamountofmemoryonthemachinethoughitwouldneedareallybig
inputtocauserealimpactontheperformanceandtheamountofcores,sincethoselimitthe
amountofmapreduceoperationsthatcanberunparallelly.
Sample Output
Man(Bart_the_Fink.txt.gz,[101,1950])
Man(Bart_the_Mother.txt.gz,[178,2268])
Manhattan(Bart_the_Murderer.txt.gz,[492])
Marg(Bart_the_Murderer.txt.gz,[134,517,2199])
Marg(Bart_the_Genius.txt.gz,[372,402])
Marg(Bart_the_Fink.txt.gz,[130,460,639,1978])
Marg(Bart_the_General.txt.gz,[257,403])
Marg(Bart_the_Lover.txt.gz,[110,625,627,2480])
Mark(Bart_the_Murderer.txt.gz,[1760])
Marri(Bart_the_Murderer.txt.gz,[133,2198])
Marri(Bart_the_Lover.txt.gz,[109,1573,2479])
Marri(Bart_the_Fink.txt.gz,[1379])
Martin(Bart_the_Genius.txt.gz,[349,466,1034])
Martyn(Bart_the_Genius.txt.gz,[1257,1686])
Martyn(Bart_the_Fink.txt.gz,[1461,1619])
Martyn(Bart_the_Mother.txt.gz,[1681])
Martyn(Bart_the_Murderer.txt.gz,[1492,1850])
Martyn(Bart_the_Lover.txt.gz,[1864,2040])
Martyn(Bart_the_General.txt.gz,[860,1350])
Mason(Bart_the_Lover.txt.gz,[1632])
Massachusett(Bart_the_Mother.txt.gz,[1433])
Masterpiec(Bart_the_Genius.txt.gz,[1986])
Masterpiec(Bart_the_Fink.txt.gz,[1855])
Matt(Bart_the_Fink.txt.gz,[73,1704])
Matt(Bart_the_Genius.txt.gz,[27,71,874,1722,1789])
Matt(Bart_the_Lover.txt.gz,[53,2102])
Matt(Bart_the_Mother.txt.gz,[78,957,1966])
Matt(Bart_the_General.txt.gz,[27,39,974,1327])
Matt(Bart_the_Murderer.txt.gz,[78,1926])
Max(Bart_the_Mother.txt.gz,[153,2243])
Maximum(Bart_the_Mother.txt.gz,[167,2257])
Mayor(Bart_the_Mother.txt.gz,[135,858,2225])
McClure(Bart_the_Mother.txt.gz,[75,317,1529,1574])
McClure(Bart_the_Fink.txt.gz,[70,1195])
McClure(Bart_the_Murderer.txt.gz,[59])
Me(Bart_the_Mother.txt.gz,[186,2276])
Melissa(Bart_the_Genius.txt.gz,[1529])
Melros(Bart_the_Fink.txt.gz,[1371])
BasicInvertedIndex.java
/**
*BasicInvertedIndex
*
*ThisMapReduceprogramshouldbuildanInvertedIndexfromasetoffiles.
*Eachtoken(thekey)inagivenfileshouldreferencethefileitwasfound
*in.
*
*Theoutputoftheprogramshouldlooklikethis:
*sometoken[file001,file002,...]
*
*@authorKristianEpps
*/
packageuk.ac.man.cs.comp38120.exercise
importjava.io.*
importjava.util.*
importjava.util.regex.Pattern
importorg.apache.hadoop.conf.Configuration
importorg.apache.hadoop.conf.Configured
importorg.apache.hadoop.fs.Path
importorg.apache.hadoop.io.*
importorg.apache.hadoop.mapreduce.Job
importorg.apache.hadoop.mapreduce.Mapper
importorg.apache.hadoop.mapreduce.Reducer
importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat
importorg.apache.hadoop.mapreduce.lib.input.FileSplit
importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat
importorg.apache.commons.cli.CommandLine
importorg.apache.commons.cli.CommandLineParser
importorg.apache.commons.cli.HelpFormatter
importorg.apache.commons.cli.OptionBuilder
importorg.apache.commons.cli.Options
importorg.apache.commons.cli.ParseException
importorg.apache.hadoop.util.Tool
importorg.apache.hadoop.util.ToolRunner
importorg.apache.log4j.Logger
importuk.ac.man.cs.comp38120.io.array.ArrayListWritable
importuk.ac.man.cs.comp38120.io.pair.PairOfStringFloat
importuk.ac.man.cs.comp38120.io.pair.PairOfWritables
importuk.ac.man.cs.comp38120.util.XParser
importuk.ac.man.cs.comp38120.ir.StopAnalyser
importuk.ac.man.cs.comp38120.ir.Stemmer
importstaticjava.lang.System.out
publicclassBasicInvertedIndexextendsConfiguredimplementsTool
{
privatestaticfinalLoggerLOG=Logger
.getLogger(BasicInvertedIndex.class)
publicstaticclassMapextends
Mapper<Object,Text,Text,PairOfWritables<Text,ArrayListWritable<IntWritable>>>
{
//Inmapaggregatorarray
java.util.Map<String,ArrayListWritable<IntWritable>>aggregator
finalintMAX_AGGREGATOR_SIZE=300000
//lazyinitialization
privatejava.util.Map<String,ArrayListWritable<IntWritable>>getAggregator()
if(aggregator==null)
aggregator=newHashMap<String,ArrayListWritable<IntWritable>>()
returnaggregator
//functionthatwritesintothecontextallthedataontheaggregatorarrayandcleansit
privatevoiddump(Contextcontext)throwsIOException,InterruptedException
Iterator<java.util.Map.Entry<String,ArrayListWritable<IntWritable>>>iter
iter=getAggregator().entrySet().iterator()
while(iter.hasNext())
java.util.Map.Entry<String,ArrayListWritable<IntWritable>>aux=
iter.next()
WORD.set(aux.getKey())
context.write(WORD,new
PairOfWritables<Text,ArrayListWritable<IntWritable>>(INPUTFILE,aux.getValue()))
aggregator=null
//flushesthearrayshoulditusetoomuchmemory
privatevoidflush(Contextcontext)throwsIOException,InterruptedException
if(getAggregator().size()>MAX_AGGREGATOR_SIZE)
dump(context)
//addsthegiveninformationtobewritteninthecontexttotheaggregatorarray
privatevoidaggregate(Stringtoken,intposition,Contextcontext)throwsIOException,
InterruptedException
if(getAggregator().containsKey(token))
ArrayListWritable<IntWritable>l=getAggregator().get(token)
l.add(newIntWritable(position))
getAggregator().put(token,l)
else
ArrayListWritable<IntWritable>l=new
ArrayListWritable<IntWritable>()
l.add(newIntWritable(position))
getAggregator().put(token,l)
flush(context)
//INPUTFILEholdsthenameofthecurrentfile
privatefinalstaticTextINPUTFILE=newText()
//TOKENshouldbesettothecurrenttokenratherthancreatinga
//newTextobjectforeachone
@SuppressWarnings("unused")
privatefinalstaticTextTOKEN=newText()
//TheStopAnalyserclasshelpsremovestopwords
@SuppressWarnings("unused")
privateStopAnalyserstopAnalyser=newStopAnalyser()
//ThestemmethodwrapsthefunctionalityoftheStemmer
//class,whichtrimsextracharactersfromEnglishwords
//PleaserefertotheStemmerclassformorecomments
@SuppressWarnings("unused")
privateStringstem(Stringword)
{
Stemmers=newStemmer()
//Achar[]wordisaddedtothestemmerwithitslength,
//thenstemmed
s.add(word.toCharArray(),word.length())
s.stem()
//returnthestemmedchar[]wordasastring
returns.toString()
}
//ThismethodgetsthenameofthefilethecurrentMapperisworking
//on
@Override
publicvoidsetup(Contextcontext)
{
StringinputFilePath=((FileSplit)context.getInputSplit()).getPath().toString()
String[]pathComponents=inputFilePath.split("/")
INPUTFILE.set(pathComponents[pathComponents.length1])
}
//leavesuppercasedlettersinbeginningofsentences
privateStringcaseFolding(Stringtext)
{
Stringresult=newString(text)
//foreachsentence
for(Stringsentence:text.split("\\."))
for(Stringword:sentence.split(""))
//cleansthewordofpunctuation
Stringaux=trimPunctuation(word)
//getsthefirstwordthatwasnotonlypunctuation
if(aux==null)
continue
if(aux.length()<=0)
continue
//makesitlowercase
if(Character.isUpperCase(aux.codePointAt(0)))
//TODO
//IFNOTACRONYM
result=result.replace(word,
word.toLowerCase())
break
returnresult
///trimspunctuationfromstartandendofstring.returnsnullifstringisonly
punctuation,elsereturnsthetrimmedstring
privateStringtrimPunctuation(Stringstr)
if(str.length()==0)
returnnull
Stringpunct=newString("!\"#$%&\'*+,./:'\\'<=>?@[]^_`{|}~()\t\n\f\r")
//removespunctuationandothersymbolsfrombeginningandendofstring
inti=0
//removespunctuationfrombeginning
while(i<str.length()&&punct.contains(str.substring(i,i+1)))
i++
str=str.substring(i)
if(str.length()==0)
returnnull
intj=str.length()1
//removespunctuationfromend
while(j>0&&punct.contains(str.substring(j,j+1)))
returnstr.substring(0,j+1)
//removestagsintheform[number],whichocasionallyremainaftertokenization
privateStringtrimTags(Stringstr)
if(str.length()==0)
returnnull
inti
for(i=0i<str.length()&&str.codePointAt(i)!='['i++)
if(i==str.length())
returnstr
intj
for(j=ij<str.length()&&(Character.isDigit(str.codePointAt(j))||
str.codePointAt(j)==']')j++)
if(i<str.length())
str=str.substring(0,i)
if(j>i&&j+1<str.length())
str+=str.substring(j+1,str.length())
returnstr
privatefinalstaticTextWORD=newText()
//TODO
//ThisMappershouldreadinaline,convertittoasetoftokens
//andoutputeachmodifiedtokenwiththepositionofitsoccurrenceinthedocument
publicvoidmap(Objectkey,Textvalue,Contextcontext)
throwsIOException,InterruptedException
Stringline=value.toString()
//tokenizesthetextaftercasefolding
StringTokenizeritr=newStringTokenizer(caseFolding(line))
for(intposition=0itr.hasMoreTokens()position++)
Stringstr=itr.nextToken()
//trimsthetagsontheform[number]
str=trimTags(str)
//trimspunctuation
str=trimPunctuation(str)
//doesnotaddwordsthatbecamenullafterbeingstrippedof
punctuation
if(str==null)
continue
//disregardsstopwords
if(StopAnalyser.isStopWord(str))
continue
//stemwords
str=stem(str)
//combinesthisoutputwiththeotheroutputgivenbythismapper.
//ImplementsthepatternofInmapcombiningorInmapagregation
aggregate(str,position,context)
//guaranteesthatnoinformationremainswithoutbeingforwardedtothe
reducer
dump(context)
}
}
publicstaticclassReduceextendsReducer<Text,Text,Text,
PairOfWritables<Text,ArrayListWritable<IntWritable>>>
{
privatefinalstaticTextWORD=newText()
//TODO
//ThisReduceJobshouldtakeinakeyandaniterableoffilenames
//Itshouldconvertthisiterabletoawritablearraylistandoutput
//italongwiththekey
publicvoidreduce(
Textkey,
Iterable<PairOfWritables<Text,ArrayListWritable<IntWritable>>>values,
Contextcontext)throwsIOException,InterruptedException
Iterator<PairOfWritables<Text,ArrayListWritable<IntWritable>>>iter=
values.iterator()
java.util.Map<Text,ArrayListWritable<IntWritable>>combine=new
HashMap<Text,ArrayListWritable<IntWritable>>()
//foreachvaluegivenbythemappers
while(iter.hasNext())
PairOfWritables<Text,ArrayListWritable<IntWritable>>pair=
iter.next()
//concatenatesthepositionarraysforeachdocumentforallthe
tokensthatappearonthatdocument
if(!combine.containsKey(pair.getLeftElement()))
combine.put(pair.getLeftElement(),
pair.getRightElement())
else
ArrayListWritable<IntWritable>auxList=new
ArrayListWritable<IntWritable>()
auxList.addAll(pair.getRightElement())
auxList.addAll(combine.get(key))
combine.put(pair.getLeftElement(),auxList)
Iterator<java.util.Map.Entry<Text,ArrayListWritable<IntWritable>>>
iter2=combine.entrySet().iterator()
//writestheotput
while(iter2.hasNext())
java.util.Map.Entry<Text,ArrayListWritable<IntWritable>>entry
=iter2.next()
WORD.set(key)
context.write(WORD,new
PairOfWritables<Text,ArrayListWritable<IntWritable>>(entry.getKey(),entry.getValue()))
}
}
//Letscreateanobject!:)
publicBasicInvertedIndex()
{
}
//Variablestoholdcmdlineargs
privatestaticfinalStringINPUT="input"
privatestaticfinalStringOUTPUT="output"
privatestaticfinalStringNUM_REDUCERS="numReducers"
@SuppressWarnings({"staticaccess"})
publicintrun(String[]args)throwsException
{
//Handlecommandlineargs
Optionsoptions=newOptions()
options.addOption(OptionBuilder.withArgName("path").hasArg()
.withDescription("inputpath").create(INPUT))
options.addOption(OptionBuilder.withArgName("path").hasArg()
.withDescription("outputpath").create(OUTPUT))
options.addOption(OptionBuilder.withArgName("num").hasArg()
.withDescription("numberofreducers").create(NUM_REDUCERS))
CommandLinecmdline=null
CommandLineParserparser=newXParser(true)
try
cmdline=parser.parse(options,args)
}
catch(ParseExceptionexp)
{
System.err.println("Errorparsingcommandline:"
+exp.getMessage())
System.err.println(cmdline)
return1
}
//Ifwearemissingtheinputoroutputflag,lettheuserknow
if(!cmdline.hasOption(INPUT)||!cmdline.hasOption(OUTPUT))
{
System.out.println("args:"+Arrays.toString(args))
HelpFormatterformatter=newHelpFormatter()
formatter.setWidth(120)
formatter.printHelp(this.getClass().getName(),options)
ToolRunner.printGenericCommandUsage(System.out)
return1
}
//CreateanewMapReduceJob
Configurationconf=newConfiguration()
Jobjob=newJob(conf)
StringinputPath=cmdline.getOptionValue(INPUT)
StringoutputPath=cmdline.getOptionValue(OUTPUT)
intreduceTasks=cmdline.hasOption(NUM_REDUCERS)?Integer
.parseInt(cmdline.getOptionValue(NUM_REDUCERS)):1
//SetthenameoftheJobandtheclassitisin
job.setJobName("BasicInvertedIndex")
job.setJarByClass(BasicInvertedIndex.class)
job.setNumReduceTasks(reduceTasks)
//SettheMapperandReducerclass(noneedforcombinerhere)
job.setMapperClass(Map.class)
job.setReducerClass(Reduce.class)
//SettheOutputClasses
job.setMapOutputKeyClass(Text.class)
job.setMapOutputValueClass(PairOfWritables.class)
job.setOutputKeyClass(Text.class)
job.setOutputValueClass(PairOfWritables.class)
//Settheinputandoutputfilepaths
FileInputFormat.setInputPaths(job,newPath(inputPath))
FileOutputFormat.setOutputPath(job,newPath(outputPath))
//Timethejobwhilstitisrunning
longstartTime=System.currentTimeMillis()
job.waitForCompletion(true)
LOG.info("JobFinishedin"+(System.currentTimeMillis()startTime)
/1000.0+"seconds")
//Returning0letseveryoneknowthejobwassuccessful
return0
}
publicstaticvoidmain(String[]args)throwsException
{
ToolRunner.run(newBasicInvertedIndex(),args)
}