0% found this document useful (0 votes)

62 views15 pages

Word Count - CLASS

This document demonstrates counting word frequencies in a text file using MapReduce programming in two ways: (1) by collecting the results and sorting locally, and (2) by performing the entire computation using Spark operations. It reads a text file into an RDD, splits into words, counts occurrences of each word, and finds the most frequent words either by collecting and sorting or by multiple Spark operations including reduceByKey, sorting, and taking the top results.

Uploaded by

beto24

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views15 pages

Word Count - CLASS

Uploaded by

beto24

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Word

Count
Counting the number of occurances of words in a text is one of the most
popular first eercises when learning Map-Reduce Programming. It is the
equivalent to Hello World! in regular programming.

We will do it two way, a simpler way where sorting is done after the RDD
is collected, and a more sparky way, where the sorting is also done using
an RDD.

Read text into an RDD

Download data file from S3
In [2]: %%time
import urllib
data_dir='../../Data'
filename='Moby-Dick.txt'
f = urllib.urlretrieve ("https://mas-dse-open.s3.amazonaws.com/"+filename, data_dir+'/'+f
ilename)

# First, check that the text file is where we expect it to be

!ls -l $data_dir/$filename

-rw-r--r-- 1 yoavfreund staff 1257260 Apr 10 21:33 ../../Data/Moby-Dick.txt

CPU times: user 37.2 ms, sys: 35.2 ms, total: 72.4 ms
Wall time: 3.5 s
Define an RDD that will read the file
Note that, as execution is Lazy, this does not necessarily mean that actual
reading of the file content has occured.

In [3]: %%time
text_file = sc.textFile(data_dir+'/'+filename)
type(text_file)

CPU times: user 1.41 ms, sys: 1.47 ms, total: 2.88 ms
Wall time: 422 ms
Counting the words
split line by spaces.
map word to (word,1)
count the number of occurances of each word.

In [4]: %%time
counts = text_file.flatMap(lambda line: line.split(" ")) \
.filter(lambda x: x!='')\
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
type(counts)

CPU times: user 9.68 ms, sys: 3.99 ms, total: 13.7 ms
Wall time: 168 ms
Have a look a the execution plan
Note that the earliest node in the dependency graph is the file
../../Data/Moby-Dick.txt.

In [5]: print counts.toDebugString()

(2) PythonRDD[6] at RDD at PythonRDD.scala:43 []

| MapPartitionsRDD[5] at mapPartitions at PythonRDD.scala:374 []
| ShuffledRDD[4] at partitionBy at NativeMethodAccessorImpl.java:-2 []
+-(2) PairwiseRDD[3] at reduceByKey at <timed exec>:1 []
| PythonRDD[2] at reduceByKey at <timed exec>:1 []
| ../../Data/Moby-Dick.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorIm
pl.java:-2 []
| ../../Data/Moby-Dick.txt HadoopRDD[0] at textFile at NativeMethodAccessorImpl.jav
a:-2 []
Count!
Finally we count the number of times each word has occured. Now,
finally, the Lazy execution model finally performs some actual work,
which takes a significant amount of time.

In [6]: %%time
Count=counts.count()
Sum=counts.map(lambda (w,i): i).reduce(lambda x,y:x+y)
print 'Count=%f, sum=%f, mean=%f'%(Count,Sum,float(Sum)/Count)

Count=33782.000000, sum=215133.000000, mean=6.368273

CPU times: user 10.2 ms, sys: 4.53 ms, total: 14.7 ms
Wall time: 1.35 s
Finding the most common words
counts: RDD with 33301 pairs of the form (word,count).
Find the 2 most frequent words.
Method1: collect and sort on head node.
Method2: Pure Spark, collect only at the end.
Method1: collect and sort on head node

Collect the RDD into the driver node

Collect can take significant time.

In [7]: %%time
C=counts.collect()
print type(C)

<type 'list'>
CPU times: user 43.9 ms, sys: 7.95 ms, total: 51.9 ms
Wall time: 129 ms
Sort

RDD collected into list in driver node.

No longer using spark parallelism.
Sort in python
will not scale to very large documents.

In [8]: C.sort(key=lambda x:x[1])

print 'most common words\n','\n'.join(['%s:\t%d'%c for c in C[-5:]])
print '\nLeast common words\n','\n'.join(['%s:\t%d'%c for c in C[:5]])

most common words

to: 4510
a: 4533
and: 5951
of: 6587
the: 13766

Least common words

funereal: 1
unscientific: 1
lime-stone,: 1
shouted,: 1
pitch-pot,: 1
Method2: Pure Spark, collect only at the end.
Collect into the head node only the more frquent words.
Requires multiple stages
Step 1 split, clean and map to (word,1)

In [10]: %%time
RDD=text_file.flatMap(lambda x: x.split(' '))\
.filter(lambda x: x!='')\
.map(lambda word: (word,1))

CPU times: user 43 µs, sys: 13 µs, total: 56 µs

Wall time: 51 µs
Step 2 Count occurances of each word.

In [11]: %%time
RDD1=RDD.reduceByKey(lambda x,y:x+y)

CPU times: user 8.67 ms, sys: 2.94 ms, total: 11.6 ms
Wall time: 20.5 ms
Step 3 Reverse (word,count) to (count,word) and sort by key

In [12]: %%time
RDD2=RDD1.map(lambda (c,v):(v,c))
RDD3=RDD2.sortByKey(False)

CPU times: user 18.1 ms, sys: 5.12 ms, total: 23.2 ms
Wall time: 430 ms
Full execution plan

We now have a complete plan to compute the most common words in the
text. Nothing has been executed yet! Not even one one bye has been
read from the file Moby-Dick.txt !

For more on execution plans and lineage see jace Klaskowski's blog

In [13]: print 'RDD3:'

print RDD3.toDebugString()

RDD3:
(2) PythonRDD[19] at RDD at PythonRDD.scala:43 []
| MapPartitionsRDD[18] at mapPartitions at PythonRDD.scala:374 []
| ShuffledRDD[17] at partitionBy at NativeMethodAccessorImpl.java:-2 []
+-(2) PairwiseRDD[16] at sortByKey at <timed exec>:2 []
| PythonRDD[15] at sortByKey at <timed exec>:2 []
| MapPartitionsRDD[12] at mapPartitions at PythonRDD.scala:374 []
| ShuffledRDD[11] at partitionBy at NativeMethodAccessorImpl.java:-2 []
+-(2) PairwiseRDD[10] at reduceByKey at <timed exec>:1 []
| PythonRDD[9] at reduceByKey at <timed exec>:1 []
| ../../Data/Moby-Dick.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorI
mpl.java:-2 []
| ../../Data/Moby-Dick.txt HadoopRDD[0] at textFile at NativeMethodAccessorImpl.ja
Step 4 Take the top 5 words. only now the computer executes the plan!

In [14]: %%time
C=RDD3.take(5)
print 'most common words\n','\n'.join(['%d:\t%s'%c for c in C])

most common words

13766: the
6587: of
5951: and
4533: a
4510: to
CPU times: user 11.7 ms, sys: 3.73 ms, total: 15.5 ms
Wall time: 171 ms

Topic 2 Linear Programming
No ratings yet
Topic 2 Linear Programming
64 pages
Eco 1
No ratings yet
Eco 1
24 pages
SLEX 4 Monster Mash
No ratings yet
SLEX 4 Monster Mash
7 pages
PES3701 Assignment 3
No ratings yet
PES3701 Assignment 3
3 pages
2001 Nieuwaal
No ratings yet
2001 Nieuwaal
89 pages
Unit 8
No ratings yet
Unit 8
9 pages
Logabaalan 22AD042
No ratings yet
Logabaalan 22AD042
5 pages
Most Frequent Words Project VTU Format
No ratings yet
Most Frequent Words Project VTU Format
3 pages
Spark WordcountProgram
No ratings yet
Spark WordcountProgram
2 pages
Habib Rehman Presentation
No ratings yet
Habib Rehman Presentation
8 pages
2 - Hadoop MapReduce
No ratings yet
2 - Hadoop MapReduce
2 pages
Squib 1
No ratings yet
Squib 1
2 pages
Shinva 80L
100% (1)
Shinva 80L
5 pages
WEG - Transformer
No ratings yet
WEG - Transformer
20 pages
NW NSC GR 11 Maths Lit P1 Eng Memo Nov 2019
No ratings yet
NW NSC GR 11 Maths Lit P1 Eng Memo Nov 2019
7 pages
(: Subtitle) : Dissertation Title
No ratings yet
(: Subtitle) : Dissertation Title
18 pages
Big Data Practise Question Paper
No ratings yet
Big Data Practise Question Paper
10 pages
22 PDFsam Apache Spark Tutorial
No ratings yet
22 PDFsam Apache Spark Tutorial
7 pages
Bda Lab Exercises Lab Mannual - 2023
No ratings yet
Bda Lab Exercises Lab Mannual - 2023
72 pages
Hai Hadoop
No ratings yet
Hai Hadoop
14 pages
DSBDA Manual
No ratings yet
DSBDA Manual
54 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
LRFD 0.9F 0.75F 0.99F: LR F A LR
No ratings yet
LRFD 0.9F 0.75F 0.99F: LR F A LR
4 pages
M-35 Mix Design
No ratings yet
M-35 Mix Design
1 page
L7A - Spark RDD With Scala
No ratings yet
L7A - Spark RDD With Scala
21 pages
Spark Running Notes
No ratings yet
Spark Running Notes
19 pages
Hands - On Exercise: Using The Spark Shell..................................
100% (2)
Hands - On Exercise: Using The Spark Shell..................................
13 pages
MS For Survey Works (Draft) R5
No ratings yet
MS For Survey Works (Draft) R5
47 pages
Filtermedia HSL HSL-C Uk
No ratings yet
Filtermedia HSL HSL-C Uk
2 pages
RDD Task1
No ratings yet
RDD Task1
2 pages
Bavya NLP 0.1
No ratings yet
Bavya NLP 0.1
5 pages
Bavya NLP 0.1
No ratings yet
Bavya NLP 0.1
5 pages
34.1.18 AOAC Official Method 948.14 Succinic Acid in Eggs
No ratings yet
34.1.18 AOAC Official Method 948.14 Succinic Acid in Eggs
2 pages
Big Data Report
No ratings yet
Big Data Report
7 pages
Pyspark
No ratings yet
Pyspark
44 pages
Practical 3.1 PySpark Shell
No ratings yet
Practical 3.1 PySpark Shell
3 pages
Practice 2
No ratings yet
Practice 2
7 pages
1 Absolutism Vs Relavatism
No ratings yet
1 Absolutism Vs Relavatism
4 pages
Composing Software: An Exploration of Functional Programming and Object Composition in JavaScript
From Everand
Composing Software: An Exploration of Functional Programming and Object Composition in JavaScript
Eric Elliott
No ratings yet
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
084 Liza Bda File
No ratings yet
084 Liza Bda File
23 pages
BigData-Assignment3-CSP 554
No ratings yet
BigData-Assignment3-CSP 554
5 pages
PY0101EN 3 5 Practice - Lab 20230526 1685059200.jupyterlite
No ratings yet
PY0101EN 3 5 Practice - Lab 20230526 1685059200.jupyterlite
7 pages
IR Practical Code
No ratings yet
IR Practical Code
13 pages
List of Government Colleges Affiliated To The University of Jammu (ACADEMIC SESSION 2020-21)
No ratings yet
List of Government Colleges Affiliated To The University of Jammu (ACADEMIC SESSION 2020-21)
9 pages
Sentence Structure: Categories Noun
No ratings yet
Sentence Structure: Categories Noun
4 pages
Algebra and More For Analytics
No ratings yet
Algebra and More For Analytics
29 pages
Sistema de Frenos Freight m12
No ratings yet
Sistema de Frenos Freight m12
457 pages
03 Python
No ratings yet
03 Python
5 pages
Final Cbse Practicals
60% (5)
Final Cbse Practicals
21 pages
Marketnext Foundation
No ratings yet
Marketnext Foundation
4 pages
UNIT 2-tt1
No ratings yet
UNIT 2-tt1
7 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
20 pages
Map Reduce Design and Execution Framework Part 1
No ratings yet
Map Reduce Design and Execution Framework Part 1
19 pages
All Practicals
No ratings yet
All Practicals
33 pages
DSBDA GRP B Print
No ratings yet
DSBDA GRP B Print
21 pages
SIL Selection SIL Verification With ExSIlentia Syllabus
0% (1)
SIL Selection SIL Verification With ExSIlentia Syllabus
3 pages
Commands in Hadoop
No ratings yet
Commands in Hadoop
7 pages
DESIGN ALGORITHMS TO SOLVE COMMON PROBLEMS: Mastering Algorithm Design for Practical Solutions (2024 Guide)
From Everand
DESIGN ALGORITHMS TO SOLVE COMMON PROBLEMS: Mastering Algorithm Design for Practical Solutions (2024 Guide)
ARCHER PAUL
No ratings yet
Printlac High Gloss TDS
No ratings yet
Printlac High Gloss TDS
2 pages
Spark Using Python
No ratings yet
Spark Using Python
28 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Lab3 IR BIM
No ratings yet
Lab3 IR BIM
14 pages
2.RDDs in Spark
No ratings yet
2.RDDs in Spark
38 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
Admnadvt
No ratings yet
Admnadvt
2 pages
Critical Thinking
No ratings yet
Critical Thinking
3 pages
Economic-Geology-1965 - v60-n07 - P1459-P1477structural Analysis of Ore Shoots at Greenside
No ratings yet
Economic-Geology-1965 - v60-n07 - P1459-P1477structural Analysis of Ore Shoots at Greenside
19 pages
9-Mm Pistol Pmi Training: REF: FM 23 - 35
No ratings yet
9-Mm Pistol Pmi Training: REF: FM 23 - 35
30 pages
Introduction To Management Accounting
No ratings yet
Introduction To Management Accounting
30 pages
Lab - Activity-Iii: ST ND
No ratings yet
Lab - Activity-Iii: ST ND
9 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
Apache Spark: CS240A Winter 2016. T Yang
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
36 pages
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
PySpark Cheat Sheet Python
No ratings yet
PySpark Cheat Sheet Python
1 page
Practical Scientific Computing in Python A Workbook
No ratings yet
Practical Scientific Computing in Python A Workbook
43 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
Lisp Programming Language
From Everand
Lisp Programming Language
Faiz ul haque Zeya
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
Writing An Hadoop MapReduce Program in Python
No ratings yet
Writing An Hadoop MapReduce Program in Python
21 pages
Computer Science, Career and Job
From Everand
Computer Science, Career and Job
Ramkrishna Ghosh
No ratings yet
Bash Command Line Pro Tips
From Everand
Bash Command Line Pro Tips
Jason Cannon
4.5/5 (8)
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Map Reduce
No ratings yet
Map Reduce
28 pages
C in 30 Pages
From Everand
C in 30 Pages
U.Q. Magnusson
4.5/5 (2)
Footprinting, Reconnaissance, Scanning and Enumeration Techniques of Computer Networks
From Everand
Footprinting, Reconnaissance, Scanning and Enumeration Techniques of Computer Networks
Dr. Hidaia Mahmood Alassouli
No ratings yet

Word Count - CLASS

Uploaded by

Word Count - CLASS

Uploaded by

Word

Read text into an RDD

# First, check that the text file is where we expect it to be

-rw-r--r-- 1 yoavfreund staff 1257260 Apr 10 21:33 ../../Data/Moby-Dick.txt

In [5]: print counts.toDebugString()

(2) PythonRDD[6] at RDD at PythonRDD.scala:43 []

Count=33782.000000, sum=215133.000000, mean=6.368273

Collect the RDD into the driver node

Collect can take significant time.

RDD collected into list in driver node.

In [8]: C.sort(key=lambda x:x[1])

most common words

Least common words

CPU times: user 43 µs, sys: 13 µs, total: 56 µs

In [13]: print 'RDD3:'

most common words

You might also like