Menu

[r4141]: / trunk / py4science / workbook / wordfreqs.tex  Maximize  Restore  History

Download this file

58 lines (45 with data), 2.4 kB

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
\section{Dictionaries for counting words}
A common task in text processing is to produce a count of word frequencies.
While NumPy has a builtin histogram function for doing numerical histograms,
it won't work out of the box for couting discrete items, since it
is a binning histogram for a range of real values.
But the Python language provides very powerful string manipulation
capabilities, as well as a very flexible and efficiently implemented
builtin data type, the \emph{dictionary}, that makes this task a very
simple one.
In this problem, you will need to count the frequencies of all the
words contained in a compressed text file supplied as input.
The listing~\ref{code:wordfreqs} contains a skeleton for this
problem, with \texttt{XXX} marking various places that are incomplete.
\lstinputlisting[label=code:wordfreqs,caption={IGNORED}]{examples/wordfreqs.py}
\subsection*{Hints}
\begin{itemize}
\item The \texttt{print\_vk} function is already provided for you as a simple
way to summarize your results.
\item You will need to read the compressed file \texttt{HISTORY.gz}. Python
has facilities to do this without having to manually uncompress it.
\item Consider `words' simply the result of splitting the input text into
a list, using any form of whitespace as a separator. This is obviously
a very na´ve definition of `word', but it shall suffice for the purposes
of this exercise.
\item Python strings have a \texttt{.split()} method that allows for very
flexible splitting. You can easily get more details on it in IPython:
\end{itemize}
\begin{lstlisting}
In [2]: a = 'somestring'
In [3]: a.split?
Type: builtin_function_or_method
Base Class: <type 'builtin_function_or_method'>
Namespace: Interactive
Docstring:
S.split([sep [,maxsplit]]) -> list of strings
Return a list of the words in the string S, using sep as the
delimiter string. If maxsplit is given, at most maxsplit
splits are done. If sep is not specified or is None, any
whitespace string is a separator.
\end{lstlisting}
The complete set of methods of Python strings can be viewed by hitting
the TAB key in IPython after typing `\texttt{a.}', and each of them
can be similarly queried with the `\texttt{?}' operator as above.
For more details on Python strings and their companion sequence types,
see \url{http://docs.python.org/lib/typesseq.html}.
Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.