Unix Text Analysis
Unix Text Analysis
Introduction
Selecting Rows
Next, we will select some data using grep. Try the following command
grep ‘^R’xresults.csv
This will display only three rows of the file. The expression in quotes is the search string. If we wish we
can direct the output of this process to a new file, like this
grep ‘^R’ xresults.csv > outputfile.txt
This command line uses the redirect output symbol. In Unix the default output destination is the screen,
and it’s known as stdout (when it needs naming). The default input source is the keyboard, known as
stdin. So when data is coming from or going to anywhere else, we use redirection. We use redirection
with > to pass the results of a process to a new output or with < to get data from a new input. If we want
to append the data to the end of an existing file (as new rows) we use >> instead of >.
We can use a similar command line to count the rows selected, but this time let’s change the grep
command slightly.
grep ‘^[RB] ’xresults.csv | wc -l
This command line uses the pipe symbol. We use piping with | to pass the results of one process to
another process. If we only wanted a count of the lines that match then instead of piping the result to
wc we could use the –c parameter on grep, like this
grep -c ‘^[RB] ’xresults.csv | wc -l
Also notice that in the cases above we have used the anchor ^ to limit the match by grep to the start of
a line. The anchor $ limits the search to matches at the end of the line. We have used the character
class, indicated by [ and ], containing Rand B and grep will succeed if it finds any of the characters in
the class. We enclose this regular expression in single quotes.
We use grep in this way to select row data.
The next thing to learn is how to match a class of characters rather than a specific character. Consider
matches
a
aa
aaa
and so on. Notice the blank line there? Probably not, but it’s there. This regular expression matches
zero or more instances of the preceding character.
Suppose that I wish to find a string that contains any sequence of characters followed by, for example
m. The grep command would be
grep ‘.+m’ xresults.csv
This is a greedy search: it is not satisfied with the very first successful match, it continues past the first
match it finds to match the longest string it can. For now we will just accept this greedy searching, but if
you investigate regular expressions further you will discover that some versions have non-greedy
matching strategies available.
Selecting Columns
We can also select columns. Because this is a delimited file we can split it into columns at each
delimiter - in this case a comma. This is equivalent to selecting fields from records.
Suppose that we want to extract column two from our data. We do this with the cut command. Here’s
an example
cut -d, -f2 xresults.csv| head
The first nine lines of the resulting display are
xx
xx
xx
xx
xx
xx
xx
xx
xx
xx
We can display several columns like this
cut d, -f1-3 xresults.csv
which displays a contiguous range of columns, or
cut -d, -f1,3 xresults.csv
Transforming Data
There is another comma delimited file called results.csv which has the following structure
Currently the grade is expressed as an alphabetic character. You should check this by viewing the
surnames and grades from this file. The command is
cut -d, -f1,3 results.txt
We can translate the alphabetic grade into a numeric grade (1=A, 2=B etc) with the command tr. Try
this
tr ‘,A’ ‘,1’ < results.txt
Notice that I included a leading comma in the search and replace strings because I wanted to catch just
the field containing A. I could have done this more elegantly by using ‘A$’ to anchor the match to the
end of the line.
In the example tr gets its input from the file by redirection. You can perform a multiple translation by
including more than one pair on the command line. For example
tr ‘,A ,B ,C’ ‘,1 ,2 ,3’< results.txt | less
You can use special characters in a tr command. For example to search for or replace a tab there are
two methods:
1. use the escape string \t to represent the tab
2. at the position in the command line where you want to insert a tab, first type control-v (^v) and
then press the tab key.
There are a number of different escape sequences (1 above) and there are different control sequences
(2 above) to represent special characters for example \n or sometimes ^M to represent a new line and \s
for white space. In general the escape sequence is easier to use.
Sorting
Alphabetically
Unix sorts alphabetically by default. This means that 100 comes before 11.
On Rows
You can sort with the command sort. For example
sort results.csv | less
This sorts the file in UNIX order on each character of the entire line. The default alphanumeric sort
order means that the numbers one to ten would be sorted like this
Descending
You can sort in reverse order with the option -r. Like this
sort -r results.csv | less
Numerically
To force a numeric sort, use the option -n.
sort -n results.csv
You can use a sort on numeric data to get maximum and minimum values for a variable. Sort then pipe
to head 1 and tail 1, which will produce the first and last records in the file.
On Columns
To sort on columns you must specify a delimiter, with -t and a field number with -k. To sort on the third
column of the results data, try this
sort -n -t ‘,’ -k3 results.csv | less
(I’ve used a slightly more verbose method of specifying the delimiter here). You can select rows after
sorting, like this
sort-n -t ‘,’ -k3 results.csv | grep ‘^[A]‘ | less
Which shows those pupils with surnames beginning with A sorted on the third field of the data file.
To sort on multiple columns we use more than one –k parameter. For example, to sort first on Maths
score and then on surname we use
sort -n -t ‘,’ -k2n -k1 xresults.csv | less
Paste
Paste has two modes of operation depending on the option selected. The first operation is simplest:
paste takes two files and treats each as column data and appends the second to the first. The
command is
paste first_file second_file
The other use of paste is to linearize a file. Suppose I have a file in the format
Jim
Tyson
UCL
Information Services
You can create this in a text editor. I can use paste to merge the four lines of data into one line
Jim Tyson UCL Information Services
The command is
Paste -s file
As well as the –s option, I can add a delimiter character with –d. Try this
Paste -d: -s file
Join
We have seen how to split a data file into different columns and we can also join two data files together.
To do this there must be a column of values that match in each file and the files must be sorted on the
field you are going to use to join them.
We start with files where for every row in file one there is a row in file two and vice versa.
Consider our two files. The first has the structure
The second
We can see then that these could be joined on the column surname with ease since surname is unique.
After sorting both files we can do this with the command line
join -t, -j1 results.csv xresults.csv | less
The option-t specifies the delimiter and-j allows us to specify a single field number where this is the
shared field.
If the columns on which to match for joining don’t appear in the same position in each file, you can use
the -jn m option several times where in each case n is the numeric file handle (look at the order that you
name the files later) and m is the number of the join field. In fact, we could write
join -t, -j1 1 -j2 1 results.csv xresults.csv | less
for the same result as our previous join command.
Essentially, join matches lines on the chosen fields and adds column data. We could send the resulting
output to a new file with > if we wished.
sed
Sed is a powerful Unix tool and there are books devoted to explaining it. The name stands for stream
editor a reminder that it reads and processes files line by line. One of the basic uses of sed is to search
a file - much like grep does - and replace the search expression with some other text specified by the
user. An example may make this clearer
sed s/abc/def/g input
After the command name, we have s for search followed by the search string and then the replace string
surrounded and separated by / and then g indicating that this operation is global- we are looking to
process every occurrence of abc in this file. The filename follows, in this case a file called input.
sed 'n;d'
This removes double line spacing - and does it in a rather crafty way. Assuming that all the first line
read is not blank, then all even lines should be blank, so alternately printing a line out and deleting a line
should result in a single spaced file.
sed '/regex/{x;p;x;}'
This command puts a blank line before every occurrence of the search sting regex.
sed -n '1~2p'
This command deletes odd lines from a file.
I leave the investigation of more sed wizardry to you.
But we can construct more complex processes quite easily. The following code won’t be difficult to
understand if you know any mainstream programming language
cut xresults.csv –d, -f2 | awk '{sum=0; for (i=1; i<=NF; i++) s=s+$i;
print sum}'
This code sums the three numeric fields and prints out the result.
As with sed there is a website for useful awk one-liners by Eric Pement at
http://www.pement.org/awk/awk1line.txt
“if $number is greater than or equal to four, print $number, else print the
string “less than four”
The real value of in-line programming comes when we learn that we can loop through the output of
other command line operations and execute Perl code. We do this with the option -n Here is an
example
cut-d, -f2 results.csv | perl -ne ‘$_>=55?print"welldone":print"what a shame" ‘
Or we could do some mathematics
cut -d, -f2 results.csv | perl -ne '$n += $_; END { print "$n\n" }’
which will sum the column of numbers.
Another very useful Perl function for command line use is split. In Perl, split takes a string and divides it
into separate data items at a delimiter character and then it puts the results into an array.
To illustrate this try the following
less resluts.txt | perl -ne ‘@fields=split(/,/,$_);print "@fields[0] , "\t", "@fields[1]" , "\t","@fields[2]","\n" ‘ <
resluts.txt |
In this example the input from each line ($_ in perl) is split at the comma (/,/ where the slashes are
delimiters to differentiate the comma we pass as a parameter from the comma needed by the split
command). The next statement prints each field from the resultsing array, a tab and then ends with a
new line. This example uses escape sequence for tab again: \t.
Final Exercise
:To practise and consolidate try the following
1. Take the original results.csv data and find the average mark for each column of examination
marks. Can you see a way to write this set of values to the end of the file on a row labelled
averages?
2. Take the original results.csv and find the average examination mark for each pupil. Can you
add this new column of data to the original file? (Hint: > outputs the data from a process to a
new file but >> appends it to the end of an existing file)
3. Take the original results.csv and find the average examination mark for each pupil and on the
basis of the following rule, assign them to a stream
Create a new file that includes these two new data items for each pupil.
Add solution.
THE END.