Menu

[r4620]: / trunk / py4science / workbook / wordfreqs.lyx  Maximize  Restore  History

Download this file

246 lines (182 with data), 4.5 kB

#LyX 1.4.3 created this file. For more info see http://www.lyx.org/
\lyxformat 245
\begin_document
\begin_header
\textclass amsbook
\begin_preamble
\input{preamble.tex}
\end_preamble
\language english
\inputencoding auto
\fontscheme default
\graphics default
\paperfontsize default
\spacing single
\papersize default
\use_geometry true
\use_amsmath 1
\cite_engine basic
\use_bibtopic false
\paperorientation portrait
\leftmargin 1.3in
\topmargin 1in
\rightmargin 1.3in
\bottommargin 1in
\secnumdepth 3
\tocdepth 3
\paragraph_separation indent
\defskip medskip
\quotes_language english
\papercolumns 1
\papersides 2
\paperpagestyle default
\tracking_changes false
\output_changes false
\end_header

\begin_body

\begin_layout Section
Dictionaries for counting words
\end_layout

\begin_layout Standard
A common task in text processing is to produce a count of word frequencies.
 While NumPy has a builtin histogram function for doing numerical histograms,
 it won't work out of the box for couting discrete items, since it is a
 binning histogram for a range of real values.
\end_layout

\begin_layout Standard
But the Python language provides very powerful string manipulation capabilities,
 as well as a very flexible and efficiently implemented builtin data type,
 the 
\emph on
dictionary
\emph default
, that makes this task a very simple one.
\end_layout

\begin_layout Standard
In this problem, you will need to count the frequencies of all the words
 contained in a compressed text file supplied as input.
 
\end_layout

\begin_layout Standard
The listing\InsetSpace ~

\begin_inset LatexCommand \ref{code:wordfreqs_skel}

\end_inset

 contains a skeleton for this problem, with 
\family typewriter
XXX
\family default
 marking various places that are incomplete.
 
\end_layout

\begin_layout Standard
\begin_inset ERT
status open

\begin_layout Standard


\backslash
lstinputlisting[label=code:wordfreqs_skel,caption={IGNORED}]{skel/wordfreqs_skel.
py}
\end_layout

\end_inset


\end_layout

\begin_layout Subsection*
Hints
\end_layout

\begin_layout Itemize
The 
\family typewriter
print_vk 
\family default
function is already provided for you as a simple way to summarize your results.
\end_layout

\begin_layout Itemize
You will need to read the compressed file 
\family typewriter
HISTORY.gz
\family default
.
 Python has facilities to do this without having to manually uncompress
 it.
\end_layout

\begin_layout Itemize
Consider `words' simply the result of splitting the input text into a list,
 using any form of whitespace as a separator.
 This is obviously a very na´ve definition of `word', but it shall suffice
 for the purposes of this exercise.
\end_layout

\begin_layout Itemize
Python strings have a 
\family typewriter
.split()
\family default
 method that allows for very flexible splitting.
 You can easily get more details on it in IPython:
\end_layout

\begin_layout Standard
\begin_inset ERT
status open

\begin_layout Standard


\backslash
begin{lstlisting}
\end_layout

\begin_layout Standard

In [2]: a = 'somestring'
\end_layout

\begin_layout Standard

\end_layout

\begin_layout Standard

In [3]: a.split?
\end_layout

\begin_layout Standard

Type:           builtin_function_or_method
\end_layout

\begin_layout Standard

Base Class:     <type 'builtin_function_or_method'>
\end_layout

\begin_layout Standard

Namespace:      Interactive
\end_layout

\begin_layout Standard

Docstring:
\end_layout

\begin_layout Standard

    S.split([sep [,maxsplit]]) -> list of strings
\end_layout

\begin_layout Standard

\end_layout

\begin_layout Standard

    Return a list of the words in the string S, using sep as the
\end_layout

\begin_layout Standard

    delimiter string.
  If maxsplit is given, at most maxsplit
\end_layout

\begin_layout Standard

    splits are done.
 If sep is not specified or is None, any
\end_layout

\begin_layout Standard

    whitespace string is a separator.
\end_layout

\begin_layout Standard


\backslash
end{lstlisting}
\end_layout

\end_inset


\end_layout

\begin_layout Standard
The complete set of methods of Python strings can be viewed by hitting the
 TAB key in IPython after typing `
\family typewriter
a.
\family default
', and each of them can be similarly queried with the `
\family typewriter
?
\family default
' operator as above.
 For more details on Python strings and their companion sequence types,
 see 
\begin_inset LatexCommand \htmlurl{http://docs.python.org/lib/typesseq.html}

\end_inset

.
\end_layout

\end_body
\end_document
Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.