#LyX 1.4.3 created this file. For more info see http://www.lyx.org/
\lyxformat 245
\begin_document
\begin_header
\textclass amsbook
\begin_preamble
\input{preamble.tex}
\end_preamble
\language english
\inputencoding auto
\fontscheme default
\graphics default
\paperfontsize default
\spacing single
\papersize default
\use_geometry true
\use_amsmath 1
\cite_engine basic
\use_bibtopic false
\paperorientation portrait
\leftmargin 1.3in
\topmargin 1in
\rightmargin 1.3in
\bottommargin 1in
\secnumdepth 3
\tocdepth 3
\paragraph_separation indent
\defskip medskip
\quotes_language english
\papercolumns 1
\papersides 2
\paperpagestyle default
\tracking_changes false
\output_changes false
\end_header
\begin_body
\begin_layout Section
Dictionaries for counting words
\end_layout
\begin_layout Standard
A common task in text processing is to produce a count of word frequencies.
While NumPy has a builtin histogram function for doing numerical histograms,
it won't work out of the box for couting discrete items, since it is a
binning histogram for a range of real values.
\end_layout
\begin_layout Standard
But the Python language provides very powerful string manipulation capabilities,
as well as a very flexible and efficiently implemented builtin data type,
the
\emph on
dictionary
\emph default
, that makes this task a very simple one.
\end_layout
\begin_layout Standard
In this problem, you will need to count the frequencies of all the words
contained in a compressed text file supplied as input.
\end_layout
\begin_layout Standard
The listing\InsetSpace ~
\begin_inset LatexCommand \ref{code:wordfreqs_skel}
\end_inset
contains a skeleton for this problem, with
\family typewriter
XXX
\family default
marking various places that are incomplete.
\end_layout
\begin_layout Standard
\begin_inset ERT
status open
\begin_layout Standard
\backslash
lstinputlisting[label=code:wordfreqs_skel,caption={IGNORED}]{skel/wordfreqs_skel.
py}
\end_layout
\end_inset
\end_layout
\begin_layout Subsection*
Hints
\end_layout
\begin_layout Itemize
The
\family typewriter
print_vk
\family default
function is already provided for you as a simple way to summarize your results.
\end_layout
\begin_layout Itemize
You will need to read the compressed file
\family typewriter
HISTORY.gz
\family default
.
Python has facilities to do this without having to manually uncompress
it.
\end_layout
\begin_layout Itemize
Consider `words' simply the result of splitting the input text into a list,
using any form of whitespace as a separator.
This is obviously a very na´ve definition of `word', but it shall suffice
for the purposes of this exercise.
\end_layout
\begin_layout Itemize
Python strings have a
\family typewriter
.split()
\family default
method that allows for very flexible splitting.
You can easily get more details on it in IPython:
\end_layout
\begin_layout Standard
\begin_inset ERT
status open
\begin_layout Standard
\backslash
begin{lstlisting}
\end_layout
\begin_layout Standard
In [2]: a = 'somestring'
\end_layout
\begin_layout Standard
\end_layout
\begin_layout Standard
In [3]: a.split?
\end_layout
\begin_layout Standard
Type: builtin_function_or_method
\end_layout
\begin_layout Standard
Base Class: <type 'builtin_function_or_method'>
\end_layout
\begin_layout Standard
Namespace: Interactive
\end_layout
\begin_layout Standard
Docstring:
\end_layout
\begin_layout Standard
S.split([sep [,maxsplit]]) -> list of strings
\end_layout
\begin_layout Standard
\end_layout
\begin_layout Standard
Return a list of the words in the string S, using sep as the
\end_layout
\begin_layout Standard
delimiter string.
If maxsplit is given, at most maxsplit
\end_layout
\begin_layout Standard
splits are done.
If sep is not specified or is None, any
\end_layout
\begin_layout Standard
whitespace string is a separator.
\end_layout
\begin_layout Standard
\backslash
end{lstlisting}
\end_layout
\end_inset
\end_layout
\begin_layout Standard
The complete set of methods of Python strings can be viewed by hitting the
TAB key in IPython after typing `
\family typewriter
a.
\family default
', and each of them can be similarly queried with the `
\family typewriter
?
\family default
' operator as above.
For more details on Python strings and their companion sequence types,
see
\begin_inset LatexCommand \htmlurl{http://docs.python.org/lib/typesseq.html}
\end_inset
.
\end_layout
\end_body
\end_document