0% found this document useful (0 votes)
260 views15 pages

Map-Reduce Implementation, Using In-Map Aggregation and Other Features

The document describes a MapReduce algorithm for building an inverted index from a set of files, where each token is mapped to the files it appears in along with its positions. The algorithm implements various text processing steps like case folding, punctuation handling, stopword removal, and stemming before writing the output in key-value pairs with the token as key and file/positions list as value. The algorithm aims to improve performance through techniques like in-mapper combining to reduce data shuffling between map and reduce tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
260 views15 pages

Map-Reduce Implementation, Using In-Map Aggregation and Other Features

The document describes a MapReduce algorithm for building an inverted index from a set of files, where each token is mapped to the files it appears in along with its positions. The algorithm implements various text processing steps like case folding, punctuation handling, stopword removal, and stemming before writing the output in key-value pairs with the token as key and file/positions list as value. The algorithm aims to improve performance through techniques like in-mapper combining to reduce data shuffling between map and reduce tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

COMP38120: Documents, Services and Data on the Web

Laboratory Exercise 1.3

Author:CristianoRuschelMarquesDias

Description

Theindexingalgorithmimplementedusingthemapreducearchitectureallows
whomeverhasaccesstotheoutputdatatomakequeriestakingonaccountthepositionof
eachwordonthedocument,andtheamountofoccurrencesofeachwordineachdocument.
Thefeaturesimplementedwere:

CaseFoldingAcontextbasedcapitalizationalgorithmwasusedtodecidewhento
leaveawordcapitalized.Essentially,wheneverawordisthestartofasentence,itis
presupposedthatitnormallywouldnotbecapitalized,andthereforeislowercased.A
lotofthoughtwasgivenintothis,speciallyifitwouldbeworthtoimplementacasing
algorithm,giventhatingeneralnotusingcasematchinggivesgoodenoughand
arguablyasgoodas.Sincetheusageofthisdoesnothaveagreatimpacton
performance,andaccordingtotheopinionviewedon[1],thealgorithmusesit,but
resultsaresimilarwithoutit.

Punctuationtreatinginsteadofremovingallpunctuation,weinsteadtrimthe
punctuationofthewords,sintepunctuationinthemiddleofawordsometimeshave
meaning.Forexample,thenumber8.5orinmapper.Todothiswealsohadto
separetelytrimthetagsforreferencesgeneratedbywikipedia,duetotheirpeculiar
form[number],somechanismstomodifytheimportanceofwordsinsidea
referenceinthedocumentwouldbeeasilyimplementedfromthispoint.

StopWordsandStemmingAftertheaforementionedsteps,stopwordswordsthat
donotaddinformationtothetextareremovedusingthealgorithmprovided.After
this,wordsarestemmed,alsousingaprovidedalgorithm.

InmappercombiningThemapreductionpatterncalledinmappercombiningwas
implemented.Thismeansthatinsteadofbeingdirectlywrittenintothecontexttobe
treatedbythereducer,foreachmapperthekeyvaluepairsarepreprocessed,suchas
tolessentheamountofinformationthatissenttothereducersandincreaseoverall
speedoftherunofthemapreduce.Itwasimplementedinawaythatthepre
combinedkeyvaluepairsarewritteninthecontextassoonasthemapperfinishesor
theMapcontainingthemhasusedtoomuchmemoryaconstantvaluecanbe
specified.Itissimilartothelastimplementationfoundon[2]thoughnocodewas
copied.

PositionalindexingThepositionofeachofeachoccurrenceofeachtokenemittedby
themapperwhichisasimplifiedversionofawordresultedbytheaforementioned
operationsiskept,andpropagatedintheoutput,sothatqueriescantaketheposition
ofthewordinthedocumentintoaccount.

FlaggingofimportantitemsThemodificationsneededtoimplementthepropagations
oftheflaggingofimportantitemstotheoutputwerenotmadetherefore,eventhough
theverificationofthisisbeingmadeinsomepoints,thisinformationisnotsenttothe
output.

Performance
AlltheoperationsmadehavearuntimecomplexityofO(n)inrelationtothelengthofthe
input,whichguaranteesthespeedandscalabilityofthealgorithmimplemented.The
algorithmtakessometimetorunduetotheoverheadsinvolvedinthemapreduce
architecture,thoughastheinputgrows,theoverheadgetscomparativelyinsignificant.The
useoftheinmappercombiningpatternhelpsusavoidbottlenecks,suchasthealgorithm
runningslowlyduetotheexcessivenormallycostlymemoryoperationsthatwouldbe
causedbythemappersendinganunnecessarylargeamountofdatatothereducers,which
makesthealgorithmbetterscalable.Alsothepatterndoesnotoverloadthememory,andthe
usageofmemoryusedbytheinmapcombinermaybechangedenablesthealgorithmtobe
selectivelytunedfordifferentusersorsituations.Thebottleneckstothealgorithmas
implementedaretheamountofmemoryonthemachinethoughitwouldneedareallybig
inputtocauserealimpactontheperformanceandtheamountofcores,sincethoselimitthe
amountofmapreduceoperationsthatcanberunparallelly.

Sample Output

Man(Bart_the_Fink.txt.gz,[101,1950])
Man(Bart_the_Mother.txt.gz,[178,2268])
Manhattan(Bart_the_Murderer.txt.gz,[492])
Marg(Bart_the_Murderer.txt.gz,[134,517,2199])
Marg(Bart_the_Genius.txt.gz,[372,402])
Marg(Bart_the_Fink.txt.gz,[130,460,639,1978])
Marg(Bart_the_General.txt.gz,[257,403])
Marg(Bart_the_Lover.txt.gz,[110,625,627,2480])
Mark(Bart_the_Murderer.txt.gz,[1760])
Marri(Bart_the_Murderer.txt.gz,[133,2198])
Marri(Bart_the_Lover.txt.gz,[109,1573,2479])
Marri(Bart_the_Fink.txt.gz,[1379])
Martin(Bart_the_Genius.txt.gz,[349,466,1034])
Martyn(Bart_the_Genius.txt.gz,[1257,1686])
Martyn(Bart_the_Fink.txt.gz,[1461,1619])
Martyn(Bart_the_Mother.txt.gz,[1681])
Martyn(Bart_the_Murderer.txt.gz,[1492,1850])
Martyn(Bart_the_Lover.txt.gz,[1864,2040])
Martyn(Bart_the_General.txt.gz,[860,1350])
Mason(Bart_the_Lover.txt.gz,[1632])
Massachusett(Bart_the_Mother.txt.gz,[1433])
Masterpiec(Bart_the_Genius.txt.gz,[1986])
Masterpiec(Bart_the_Fink.txt.gz,[1855])
Matt(Bart_the_Fink.txt.gz,[73,1704])
Matt(Bart_the_Genius.txt.gz,[27,71,874,1722,1789])
Matt(Bart_the_Lover.txt.gz,[53,2102])
Matt(Bart_the_Mother.txt.gz,[78,957,1966])
Matt(Bart_the_General.txt.gz,[27,39,974,1327])
Matt(Bart_the_Murderer.txt.gz,[78,1926])
Max(Bart_the_Mother.txt.gz,[153,2243])
Maximum(Bart_the_Mother.txt.gz,[167,2257])
Mayor(Bart_the_Mother.txt.gz,[135,858,2225])
McClure(Bart_the_Mother.txt.gz,[75,317,1529,1574])
McClure(Bart_the_Fink.txt.gz,[70,1195])
McClure(Bart_the_Murderer.txt.gz,[59])
Me(Bart_the_Mother.txt.gz,[186,2276])
Melissa(Bart_the_Genius.txt.gz,[1529])
Melros(Bart_the_Fink.txt.gz,[1371])

BasicInvertedIndex.java

/**
*BasicInvertedIndex
*
*ThisMapReduceprogramshouldbuildanInvertedIndexfromasetoffiles.
*Eachtoken(thekey)inagivenfileshouldreferencethefileitwasfound
*in.
*
*Theoutputoftheprogramshouldlooklikethis:
*sometoken[file001,file002,...]
*
*@authorKristianEpps
*/
packageuk.ac.man.cs.comp38120.exercise

importjava.io.*
importjava.util.*
importjava.util.regex.Pattern

importorg.apache.hadoop.conf.Configuration
importorg.apache.hadoop.conf.Configured
importorg.apache.hadoop.fs.Path
importorg.apache.hadoop.io.*
importorg.apache.hadoop.mapreduce.Job
importorg.apache.hadoop.mapreduce.Mapper
importorg.apache.hadoop.mapreduce.Reducer
importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat
importorg.apache.hadoop.mapreduce.lib.input.FileSplit
importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat
importorg.apache.commons.cli.CommandLine
importorg.apache.commons.cli.CommandLineParser
importorg.apache.commons.cli.HelpFormatter
importorg.apache.commons.cli.OptionBuilder

importorg.apache.commons.cli.Options
importorg.apache.commons.cli.ParseException
importorg.apache.hadoop.util.Tool
importorg.apache.hadoop.util.ToolRunner
importorg.apache.log4j.Logger

importuk.ac.man.cs.comp38120.io.array.ArrayListWritable
importuk.ac.man.cs.comp38120.io.pair.PairOfStringFloat
importuk.ac.man.cs.comp38120.io.pair.PairOfWritables
importuk.ac.man.cs.comp38120.util.XParser
importuk.ac.man.cs.comp38120.ir.StopAnalyser
importuk.ac.man.cs.comp38120.ir.Stemmer

importstaticjava.lang.System.out

publicclassBasicInvertedIndexextendsConfiguredimplementsTool
{
privatestaticfinalLoggerLOG=Logger

.getLogger(BasicInvertedIndex.class)

publicstaticclassMapextends

Mapper<Object,Text,Text,PairOfWritables<Text,ArrayListWritable<IntWritable>>>
{

//Inmapaggregatorarray

java.util.Map<String,ArrayListWritable<IntWritable>>aggregator

finalintMAX_AGGREGATOR_SIZE=300000

//lazyinitialization

privatejava.util.Map<String,ArrayListWritable<IntWritable>>getAggregator()

if(aggregator==null)

aggregator=newHashMap<String,ArrayListWritable<IntWritable>>()

returnaggregator

//functionthatwritesintothecontextallthedataontheaggregatorarrayandcleansit

privatevoiddump(Contextcontext)throwsIOException,InterruptedException

Iterator<java.util.Map.Entry<String,ArrayListWritable<IntWritable>>>iter

iter=getAggregator().entrySet().iterator()

while(iter.hasNext())


java.util.Map.Entry<String,ArrayListWritable<IntWritable>>aux=
iter.next()

WORD.set(aux.getKey())

context.write(WORD,new
PairOfWritables<Text,ArrayListWritable<IntWritable>>(INPUTFILE,aux.getValue()))

aggregator=null

//flushesthearrayshoulditusetoomuchmemory

privatevoidflush(Contextcontext)throwsIOException,InterruptedException

if(getAggregator().size()>MAX_AGGREGATOR_SIZE)

dump(context)

//addsthegiveninformationtobewritteninthecontexttotheaggregatorarray

privatevoidaggregate(Stringtoken,intposition,Contextcontext)throwsIOException,
InterruptedException

if(getAggregator().containsKey(token))

ArrayListWritable<IntWritable>l=getAggregator().get(token)

l.add(newIntWritable(position))

getAggregator().put(token,l)

else

ArrayListWritable<IntWritable>l=new
ArrayListWritable<IntWritable>()

l.add(newIntWritable(position))

getAggregator().put(token,l)

flush(context)

//INPUTFILEholdsthenameofthecurrentfile

privatefinalstaticTextINPUTFILE=newText()

//TOKENshouldbesettothecurrenttokenratherthancreatinga
//newTextobjectforeachone
@SuppressWarnings("unused")
privatefinalstaticTextTOKEN=newText()
//TheStopAnalyserclasshelpsremovestopwords
@SuppressWarnings("unused")
privateStopAnalyserstopAnalyser=newStopAnalyser()

//ThestemmethodwrapsthefunctionalityoftheStemmer
//class,whichtrimsextracharactersfromEnglishwords
//PleaserefertotheStemmerclassformorecomments
@SuppressWarnings("unused")
privateStringstem(Stringword)
{
Stemmers=newStemmer()
//Achar[]wordisaddedtothestemmerwithitslength,
//thenstemmed
s.add(word.toCharArray(),word.length())
s.stem()
//returnthestemmedchar[]wordasastring
returns.toString()
}

//ThismethodgetsthenameofthefilethecurrentMapperisworking
//on
@Override
publicvoidsetup(Contextcontext)
{
StringinputFilePath=((FileSplit)context.getInputSplit()).getPath().toString()
String[]pathComponents=inputFilePath.split("/")
INPUTFILE.set(pathComponents[pathComponents.length1])
}

//leavesuppercasedlettersinbeginningofsentences
privateStringcaseFolding(Stringtext)
{
Stringresult=newString(text)

//foreachsentence
for(Stringsentence:text.split("\\."))

for(Stringword:sentence.split(""))

//cleansthewordofpunctuation

Stringaux=trimPunctuation(word)

//getsthefirstwordthatwasnotonlypunctuation

if(aux==null)

continue

if(aux.length()<=0)

continue

//makesitlowercase

if(Character.isUpperCase(aux.codePointAt(0)))

//TODO

//IFNOTACRONYM

result=result.replace(word,
word.toLowerCase())

break

returnresult

///trimspunctuationfromstartandendofstring.returnsnullifstringisonly
punctuation,elsereturnsthetrimmedstring

privateStringtrimPunctuation(Stringstr)

if(str.length()==0)

returnnull

Stringpunct=newString("!\"#$%&\'*+,./:'\\'<=>?@[]^_`{|}~()\t\n\f\r")

//removespunctuationandothersymbolsfrombeginningandendofstring

inti=0

//removespunctuationfrombeginning

while(i<str.length()&&punct.contains(str.substring(i,i+1)))

i++

str=str.substring(i)

if(str.length()==0)

returnnull

intj=str.length()1

//removespunctuationfromend

while(j>0&&punct.contains(str.substring(j,j+1)))

returnstr.substring(0,j+1)

//removestagsintheform[number],whichocasionallyremainaftertokenization

privateStringtrimTags(Stringstr)

if(str.length()==0)

returnnull

inti

for(i=0i<str.length()&&str.codePointAt(i)!='['i++)

if(i==str.length())

returnstr

intj

for(j=ij<str.length()&&(Character.isDigit(str.codePointAt(j))||
str.codePointAt(j)==']')j++)

if(i<str.length())

str=str.substring(0,i)

if(j>i&&j+1<str.length())

str+=str.substring(j+1,str.length())

returnstr

privatefinalstaticTextWORD=newText()


//TODO

//ThisMappershouldreadinaline,convertittoasetoftokens

//andoutputeachmodifiedtokenwiththepositionofitsoccurrenceinthedocument

publicvoidmap(Objectkey,Textvalue,Contextcontext)

throwsIOException,InterruptedException

Stringline=value.toString()

//tokenizesthetextaftercasefolding

StringTokenizeritr=newStringTokenizer(caseFolding(line))

for(intposition=0itr.hasMoreTokens()position++)

Stringstr=itr.nextToken()

//trimsthetagsontheform[number]

str=trimTags(str)

//trimspunctuation

str=trimPunctuation(str)

//doesnotaddwordsthatbecamenullafterbeingstrippedof
punctuation

if(str==null)

continue

//disregardsstopwords

if(StopAnalyser.isStopWord(str))

continue

//stemwords

str=stem(str)

//combinesthisoutputwiththeotheroutputgivenbythismapper.

//ImplementsthepatternofInmapcombiningorInmapagregation

aggregate(str,position,context)

//guaranteesthatnoinformationremainswithoutbeingforwardedtothe
reducer

dump(context)

}
}

publicstaticclassReduceextendsReducer<Text,Text,Text,
PairOfWritables<Text,ArrayListWritable<IntWritable>>>
{

privatefinalstaticTextWORD=newText()

//TODO

//ThisReduceJobshouldtakeinakeyandaniterableoffilenames

//Itshouldconvertthisiterabletoawritablearraylistandoutput

//italongwiththekey

publicvoidreduce(

Textkey,

Iterable<PairOfWritables<Text,ArrayListWritable<IntWritable>>>values,

Contextcontext)throwsIOException,InterruptedException

Iterator<PairOfWritables<Text,ArrayListWritable<IntWritable>>>iter=
values.iterator()

java.util.Map<Text,ArrayListWritable<IntWritable>>combine=new
HashMap<Text,ArrayListWritable<IntWritable>>()

//foreachvaluegivenbythemappers

while(iter.hasNext())

PairOfWritables<Text,ArrayListWritable<IntWritable>>pair=
iter.next()

//concatenatesthepositionarraysforeachdocumentforallthe
tokensthatappearonthatdocument

if(!combine.containsKey(pair.getLeftElement()))

combine.put(pair.getLeftElement(),
pair.getRightElement())

else

ArrayListWritable<IntWritable>auxList=new
ArrayListWritable<IntWritable>()

auxList.addAll(pair.getRightElement())

auxList.addAll(combine.get(key))

combine.put(pair.getLeftElement(),auxList)


Iterator<java.util.Map.Entry<Text,ArrayListWritable<IntWritable>>>
iter2=combine.entrySet().iterator()

//writestheotput

while(iter2.hasNext())

java.util.Map.Entry<Text,ArrayListWritable<IntWritable>>entry
=iter2.next()

WORD.set(key)

context.write(WORD,new
PairOfWritables<Text,ArrayListWritable<IntWritable>>(entry.getKey(),entry.getValue()))

}
}

//Letscreateanobject!:)
publicBasicInvertedIndex()
{
}

//Variablestoholdcmdlineargs
privatestaticfinalStringINPUT="input"
privatestaticfinalStringOUTPUT="output"
privatestaticfinalStringNUM_REDUCERS="numReducers"

@SuppressWarnings({"staticaccess"})
publicintrun(String[]args)throwsException
{

//Handlecommandlineargs

Optionsoptions=newOptions()

options.addOption(OptionBuilder.withArgName("path").hasArg()

.withDescription("inputpath").create(INPUT))

options.addOption(OptionBuilder.withArgName("path").hasArg()

.withDescription("outputpath").create(OUTPUT))

options.addOption(OptionBuilder.withArgName("num").hasArg()

.withDescription("numberofreducers").create(NUM_REDUCERS))

CommandLinecmdline=null

CommandLineParserparser=newXParser(true)

try

cmdline=parser.parse(options,args)
}
catch(ParseExceptionexp)
{
System.err.println("Errorparsingcommandline:"
+exp.getMessage())
System.err.println(cmdline)
return1
}
//Ifwearemissingtheinputoroutputflag,lettheuserknow
if(!cmdline.hasOption(INPUT)||!cmdline.hasOption(OUTPUT))
{
System.out.println("args:"+Arrays.toString(args))
HelpFormatterformatter=newHelpFormatter()
formatter.setWidth(120)
formatter.printHelp(this.getClass().getName(),options)
ToolRunner.printGenericCommandUsage(System.out)
return1
}
//CreateanewMapReduceJob
Configurationconf=newConfiguration()
Jobjob=newJob(conf)
StringinputPath=cmdline.getOptionValue(INPUT)
StringoutputPath=cmdline.getOptionValue(OUTPUT)
intreduceTasks=cmdline.hasOption(NUM_REDUCERS)?Integer
.parseInt(cmdline.getOptionValue(NUM_REDUCERS)):1
//SetthenameoftheJobandtheclassitisin
job.setJobName("BasicInvertedIndex")
job.setJarByClass(BasicInvertedIndex.class)
job.setNumReduceTasks(reduceTasks)

//SettheMapperandReducerclass(noneedforcombinerhere)
job.setMapperClass(Map.class)
job.setReducerClass(Reduce.class)

//SettheOutputClasses
job.setMapOutputKeyClass(Text.class)
job.setMapOutputValueClass(PairOfWritables.class)
job.setOutputKeyClass(Text.class)
job.setOutputValueClass(PairOfWritables.class)

//Settheinputandoutputfilepaths
FileInputFormat.setInputPaths(job,newPath(inputPath))
FileOutputFormat.setOutputPath(job,newPath(outputPath))

//Timethejobwhilstitisrunning
longstartTime=System.currentTimeMillis()
job.waitForCompletion(true)
LOG.info("JobFinishedin"+(System.currentTimeMillis()startTime)
/1000.0+"seconds")
//Returning0letseveryoneknowthejobwassuccessful
return0
}

publicstaticvoidmain(String[]args)throwsException
{
ToolRunner.run(newBasicInvertedIndex(),args)
}

You might also like