Word Count - CLASS
Word Count - CLASS
Count
Counting the number of occurances of words in a text is one of the most
popular first eercises when learning Map-Reduce Programming. It is the
equivalent to Hello World! in regular programming.
We will do it two way, a simpler way where sorting is done after the RDD
is collected, and a more sparky way, where the sorting is also done using
an RDD.
In [3]: %%time
text_file = sc.textFile(data_dir+'/'+filename)
type(text_file)
CPU times: user 1.41 ms, sys: 1.47 ms, total: 2.88 ms
Wall time: 422 ms
Counting the words
split line by spaces.
map word to (word,1)
count the number of occurances of each word.
In [4]: %%time
counts = text_file.flatMap(lambda line: line.split(" ")) \
.filter(lambda x: x!='')\
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
type(counts)
CPU times: user 9.68 ms, sys: 3.99 ms, total: 13.7 ms
Wall time: 168 ms
Have a look a the execution plan
Note that the earliest node in the dependency graph is the file
../../Data/Moby-Dick.txt.
In [6]: %%time
Count=counts.count()
Sum=counts.map(lambda (w,i): i).reduce(lambda x,y:x+y)
print 'Count=%f, sum=%f, mean=%f'%(Count,Sum,float(Sum)/Count)
In [7]: %%time
C=counts.collect()
print type(C)
<type 'list'>
CPU times: user 43.9 ms, sys: 7.95 ms, total: 51.9 ms
Wall time: 129 ms
Sort
In [10]: %%time
RDD=text_file.flatMap(lambda x: x.split(' '))\
.filter(lambda x: x!='')\
.map(lambda word: (word,1))
In [11]: %%time
RDD1=RDD.reduceByKey(lambda x,y:x+y)
CPU times: user 8.67 ms, sys: 2.94 ms, total: 11.6 ms
Wall time: 20.5 ms
Step 3 Reverse (word,count) to (count,word) and sort by key
In [12]: %%time
RDD2=RDD1.map(lambda (c,v):(v,c))
RDD3=RDD2.sortByKey(False)
CPU times: user 18.1 ms, sys: 5.12 ms, total: 23.2 ms
Wall time: 430 ms
Full execution plan
We now have a complete plan to compute the most common words in the
text. Nothing has been executed yet! Not even one one bye has been
read from the file Moby-Dick.txt !
For more on execution plans and lineage see jace Klaskowski's blog
RDD3:
(2) PythonRDD[19] at RDD at PythonRDD.scala:43 []
| MapPartitionsRDD[18] at mapPartitions at PythonRDD.scala:374 []
| ShuffledRDD[17] at partitionBy at NativeMethodAccessorImpl.java:-2 []
+-(2) PairwiseRDD[16] at sortByKey at <timed exec>:2 []
| PythonRDD[15] at sortByKey at <timed exec>:2 []
| MapPartitionsRDD[12] at mapPartitions at PythonRDD.scala:374 []
| ShuffledRDD[11] at partitionBy at NativeMethodAccessorImpl.java:-2 []
+-(2) PairwiseRDD[10] at reduceByKey at <timed exec>:1 []
| PythonRDD[9] at reduceByKey at <timed exec>:1 []
| ../../Data/Moby-Dick.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorI
mpl.java:-2 []
| ../../Data/Moby-Dick.txt HadoopRDD[0] at textFile at NativeMethodAccessorImpl.ja
Step 4 Take the top 5 words. only now the computer executes the plan!
In [14]: %%time
C=RDD3.take(5)
print 'most common words\n','\n'.join(['%d:\t%s'%c for c in C])