Awk
Awk
Diane Barlow Close Arnold D. Robbins Paul H. Rubin Richard Stallman Piet van Oostrum
This is Edition 1.0 of The AWK Manual, for the new implementation of AWK (sometimes called nawk).
Notice: This work is derived from the original gawk manual. Adaptions for NAWK made by Piet van Oostrum, Dec. 1995, July 1998.
Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies. Permission is granted to copy and distribute modied versions of this manual under the conditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one. Permission is granted to copy and distribute translations of this manual into another language, under the above conditions for modied versions, except that this permission notice may be stated in a translation approved by the Foundation.
Preface
Preface
If you are like many computer users, you would frequently like to make changes in various text les wherever certain patterns appear, or extract data from parts of certain lines while discarding the rest. To write a program to do this in a language such as C or Pascal is a time-consuming inconvenience that may take many lines of code. The job may be easier with awk. The awk utility interprets a special-purpose programming language that makes it possible to handle simple data-reformatting jobs easily with just a few lines of code. This manual teaches you what awk does and how you can use awk eectively. You should already be familiar with basic system commands such as ls. Using awk you can: manage small, personal databases generate reports validate data produce indexes, and perform other document preparation tasks even experiment with algorithms that can be adapted later to other computer languages
This manual has the dicult task of being both tutorial and reference. If you are a novice, feel free to skip over details that seem too complex. You should also ignore the many cross references; they are for the expert user, and for the on-line Info version of the manual.
History of awk
The name awk comes from the initials of its designers: Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan. The original version of awk was written in 1977. In 1985 a new version made the programming language more powerful, introducing user-dened functions, multiple input streams, and computed regular expressions. This new version became generally available with System V Release 3.1. The version in System V Release 4 added some new features and also cleaned up the behavior in some of the dark corners of the language. The specication for awk in the posix Command Language and Utilities standard further claried the language. We need to thank many people for their assistance in producing this manual. Jay Fenlason contributed many ideas and sample programs. Richard Mlynarik and Robert J. Chassell gave helpful comments on early drafts of this manual. The paper A Supplemental Document for awk by John W. Pierce of the Chemistry Department at UC San Diego, pinpointed several issues relevant both to awk implementation and to this manual, that would otherwise have escaped us. David Trueman, Pat Rankin, and Michal Jaegermann also contributed sections of the manual. The following people provided many helpful comments on this edition of the manual: Rick Adams, Michael Brennan, Rich Burridge, Diane Close, Christopher (Topher) Eliot, Michael Lijewski, Pat Rankin, Miriam Robbins, and Michal Jaegermann. Robert J. Chassell provided much valuable advice on the use of Texinfo.
Preamble
The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change free softwareto make sure the software is free for all its users. This General Public License applies to most of the Free Software Foundations software and to any other program whose authors commit to using it. (Some other Free Software Foundation software is covered by the GNU Library General Public License instead.) You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things. To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it. For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. We protect your rights with two steps: (1) copyright the software, and (2) oer you this license which gives you legal permission to copy, distribute and/or modify the software. Also, for each authors protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modied by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reect on the original authors reputations. Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in eect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyones free use or not licensed at all. The precise terms and conditions for copying, distribution and modication follow.
4. You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following: a. Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, b. Accompany it with a written oer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, c. Accompany it with the information you received as to the oer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an oer, in accord with Subsection b above.) The source code for a work means the preferred form of the work for making modications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface denition les, plus the scripts used to control compilation and installation of the executable. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable. If distribution of executable or object code is made by oering access to copy from a designated place, then oering equivalent access to copy the source code from the same place counts as distribution of the source code, even though third parties are not compelled to copy the source along with the object code. 5. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance. 6. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Program (or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Program or works based on it. 7. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties to this License. 8. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent license would not permit royalty-free redistribution of the Program by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program.
If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances. It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice. This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License. 9. If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License. 10. The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may dier in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program species a version number of this License which applies to it and any later version, you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of this License, you may choose any version ever published by the Free Software Foundation. 11. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are dierent, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally.
NO WARRANTY
12. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM AS IS WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 13. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
This General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Library General Public License instead of this License.
10
11
12
sabafoo
555-2127
1200/300
The second data le, called inventory-shipped, represents information about shipments during the year. Each record contains the month of the year, the number of green crates shipped, the number of red boxes shipped, the number of orange bags shipped, and the number of blue packages shipped, respectively. There are 16 entries, covering the 12 months of one year and 4 months of the next year. Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr 13 15 15 31 16 31 24 15 13 29 20 17 21 26 24 21 25 32 24 52 34 42 34 34 55 54 87 35 36 58 75 70 15 24 34 63 29 75 67 47 37 68 82 61 64 80 70 74 115 226 228 420 208 492 436 316 277 525 577 401 620 652 495 514
13
In an awk rule, either the pattern or the action can be omitted, but not both. If the pattern is omitted, then the action is performed for every input line. If the action is omitted, the default action is to print all lines that match the pattern.
14
Thus, we could leave out the action (the print statement and the curly braces) in the above example, and the result would be the same: all lines matching the pattern foo would be printed. By comparison, omitting the print statement but retaining the curly braces makes an empty action that does nothing; then no lines would be printed.
contains two rules. The rst rule has the string 12 as the pattern and print $0 as the action. The second rule has the string 21 as the pattern and also has print $0 as the action. Each rules action is enclosed in its own pair of braces. This awk program prints every line that contains the string 12 or the string 21. If a line contains both strings, it is printed twice, once by each rule. If we run this program on our two sample data les, BBS-list and inventory-shipped, as shown here: awk /12/ { print $0 } /21/ { print $0 } BBS-list inventory-shipped we get the following output: aardvark alpo-net barfly bites core fooey foot macfoo sdace sabafoo sabafoo Jan 21 36 Apr 21 70 555-5553 555-3412 555-7685 555-1675 555-2912 555-1234 555-6699 555-6480 555-3430 555-2127 555-2127 64 620 74 514 1200/300 2400/1200/300 1200/300 2400/1200/300 1200/300 2400/1200/300 1200/300 1200/300 2400/1200/300 1200/300 1200/300 B A A A C B B A A C C
15
Note how the line in BBS-list beginning with sabafoo was printed twice, once for each rule.
The rst eld contains read-write permissions, the second eld contains the number of links to the le, and the third eld identies the owner of the le. The fourth eld contains the size of the le in bytes. The fth, sixth, and seventh elds contain the month, day, and time, respectively, that the le was last modied. Finally, the eighth eld contains the name of the le. The $5 == "Nov" in our awk program is an expression that tests whether the fth eld of the output from ls -l matches the string Nov. Each time a line has the string Nov in its fth eld, the action { sum += $4 } is performed. This adds the fourth eld (the le size) to the variable sum. As a result, when awk has nished reading all the input lines, sum is the sum of the sizes of les whose lines matched the pattern. (This works because awk variables are automatically initialized to zero.) After the last line of output from ls has been processed, the END rule is executed, and the value of sum is printed. In this example, the value of sum would be 80600. These more advanced awk techniques are covered in later sections (see Chapter 7 [Overview of Actions], page 55). Before you can move on to more advanced awk programming, you have to know how awk interprets your input and displays your output. By manipulating elds and using print statements, you can produce some very useful and spectacular looking reports.
16
17
Kathy Ben Tom Beth Seth Karen Thomas Control-d then awk prints this output: Kathy Beth Seth as matching the pattern th. Notice that it did not recognize Thomas as matching the pattern. The awk language is case sensitive, and matches patterns exactly.
18
The #! mechanism works on Unix systems derived from Berkeley Unix, System V Release 4, and some System V Release 3 systems. The line beginning with #! lists the full pathname of an interpreter to be run, and an optional initial command line argument to pass to that interpreter. The operating system then runs the interpreter with the given argument and the full argument list of the executed program. The rst argument in the list is the full pathname of the awk program. The rest of the argument list will either be options to awk, or data les, or both.
19
Nearly all programming languages have provisions for comments, because programs are typically hard to understand without their extra help. In the awk language, a comment starts with the sharp sign character, #, and continues to the end of the line. The awk language ignores the rest of a line following a sharp sign. For example, we could have put the following into th-prog: # This program finds records containing the pattern th. # you continue comments on additional lines. /th/ This is how
You can put comment lines into keyboard-composed throw-away awk programs also, but this usually isnt very useful; the purpose of a comment is to help you or another person understand the program at a later time.
But sometimes statements can be more than one line, and lines can contain several statements. You can split a statement into multiple lines by inserting a newline after any of the following: , { ? : || && do else
A newline at any other point is considered the end of the statement. (Splitting lines after ? and : is a minor gawk extension. The ? and : referred to here is the three operand conditional expression described in Section 8.11 [Conditional Expressions], page 69.) If you would like to split a single statement into two lines at a point where a newline would terminate it, you can continue it by ending the rst line with a backslash character, \. This is allowed absolutely anywhere in the statement, even in the middle of a string or regular expression. For example: awk /This program is too long, so continue it\ on the next line/ { print $1 } We have generally not used backslash continuation in the sample programs in this manual. Since in awk there is no limit on the length of a line, it is never strictly necessary; it just makes programs prettier. We have preferred to make them even more pretty by keeping the statements short. Backslash continuation is most useful when your awk program is in a separate source le, instead of typed in on the command line. You should also note that many awk implementations are more picky about where you may use backslash continuation. For maximal portability of your awk programs, it is best not to split your lines in the middle of a regular expression or a string. Warning: backslash continuation does not work as described above with the C shell. Continuation with backslash works for awk programs in les, and also for one-shot programs provided you
20
are using a posix-compliant shell, such as the Bourne shell or the Bourne-again shell. But the C shell used on Berkeley Unix behaves dierently! There, you must use two backslashes in a row, followed by a newline. When awk statements within one rule are short, you might want to put more than one of them on a line. You do this by separating the statements with a semicolon, ;. This also applies to the rules themselves. Thus, the previous program could have been written: /12/ { print $0 } ; /21/ { print $0 } Note: the requirement that rules on the same line must be separated with a semicolon is a recent change in the awk language; it was done for consistency with the treatment of statements within an action.
21
22
This sets RS to / before processing BBS-list. Reaching the end of an input le terminates the current input record, even if the last character in the le is not the character in RS. The empty string, "" (a string of no characters), has a special meaning as the value of RS: it means that records are separated only by blank lines. See Section 3.6 [Multiple-Line Records], page 29, for more details. The awk utility keeps track of the number of records that have been read so far from the current input le. This value is stored in a built-in variable called FNR. It is reset to zero when a new le is started. Another built-in variable, NR, is the total number of input records read so far from all les. It starts at zero but is never automatically reset to zero. If you change the value of RS in the middle of an awk run, the new value is used to delimit subsequent records, but the record currently being processed (and records already processed) are not aected.
23
$0, which looks like an attempt to refer to the zeroth eld, is a special case: it represents the whole input record. This is what you would use if you werent interested in elds. Here are some more examples: awk $1 ~ /foo/ { print $0 } BBS-list This example prints each record in the le BBS-list whose rst eld contains the string foo. The operator ~ is called a matching operator (see Section 8.5 [Comparison Expressions], page 62); it tests whether a string (here, the eld $1) matches a given regular expression. By contrast, the following example: awk /foo/ { print $1, $NF } BBS-list looks for foo in the entire record and prints the rst eld and the last eld for each input record containing a match.
24
25
print "everything is normal" should print everything is normal, because NF+1 is certain to be out of range. (See Section 9.1 [The if Statement], page 73, for more information about awks if-else statements.) It is important to note that assigning to a eld will change the value of $0, but will not change the value of NF, even when you assign the null string to a eld. For example: echo a b c d | awk { OFS = ":"; $2 = "" ; print ; print NF } prints a::c:d 4 The eld is still there, it just has an empty value. You can tell because there are two colons in a row.
26
this awk program extracts the string 29 Oak St.. Sometimes your input data will contain separator characters that dont separate elds the way you thought they would. For instance, the persons name in the example weve been using might have a title or sux attached, such as John Q. Smith, LXIX. From input containing such a name: John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139 the previous sample program would extract LXIX, instead of 29 Oak St.. If you were expecting the program to print the address, you would be surprised. So choose your data layout and separator characters carefully to prevent such problems. As you know, by default, elds are separated by whitespace sequences (spaces and tabs), not by single spaces: two spaces in a row do not delimit an empty eld. The default value of the eld separator is a string " " containing a single space. If this value were interpreted in the usual way, each space character would separate elds, so two spaces in a row would make an empty eld between them. The reason this does not happen is that a single space as the value of FS is a special case: it is taken to specify the default manner of delimiting elds. If FS is any other single character, such as ",", then each occurrence of that character separates two elds. Two consecutive occurrences delimit an empty eld. If the character occurs at the beginning or the end of the line, that too delimits an empty eld. The space character is the only single character which does not follow these rules. More generally, the value of FS may be a string containing any regular expression. Then each match in the record for the regular expression separates elds. For example, the assignment: FS = ", \t" makes every area of an input line that consists of a comma followed by a space and a tab, into a eld separator. (\t stands for a tab.) For a less trivial example of a regular expression, suppose you want single spaces to separate elds the way single commas were used above. You can set FS to "[ ]". This regular expression matches a single space and nothing else. FS can be set on the command line. You use the -F argument to do so. For example: awk -F, program input-les sets FS to be the , character. Notice that the argument uses a capital F. Contrast this with -f, which species a le containing an awk program. Case is signicant in command options: the -F and -f options have nothing to do with each other. You can use both options at the same time to set the FS argument and get an awk program from a le. The value used for the argument to -F is processed in exactly the same way as assignments to the built-in variable FS. This means that if the eld separator contains special characters, they must be escaped appropriately. For example, to use a \ as the eld separator, you would have to type:
27
# same as FS = "\\" awk -F\\\\ . . . files . . . Since \ is used for quoting in the shell, awk will see -F\\. Then awk processes the \\ for escape characters (see Section 8.1 [Constant Expressions], page 57), nally yielding a single \ to be used for the eld separator. As a special case, in compatibility mode (see Chapter 14 [Invoking awk], page 105), if the argument to -F is t, then FS is set to the tab character. (This is because if you type -F\t, without the quotes, at the shell, the \ gets deleted, so awk gures that you really want your elds to be separated with tabs, and not ts. Use -v FS="t" on the command line if you really do want to separate your elds with ts.) For example, lets use an awk program le called baud.awk that contains the pattern /300/, and the action print $1. Here is the program: /300/ { print $1 }
Lets also set FS to be the - character, and run the program on the le BBS-list. The following command prints a list of the names of the bulletin boards that operate at 300 baud and the rst three digits of their phone numbers: awk -F- -f baud.awk BBS-list It produces this output: aardvark alpo barfly bites camelot core fooey foot macfoo sdace sabafoo 555 555 555 555 555 555 555 555 555 555
Note the second line of output. If you check the original le, you will see that the second line looked like this: alpo-net 555-3412 2400/1200/300 A
The - as part of the systems name was used as the eld separator, instead of the - in the phone number that was originally intended. This demonstrates why you have to be careful in choosing your eld and record separators. The following program searches the system password le, and prints the entries for users who have no password:
28
awk -F: $2 == "" /etc/passwd Here we use the -F option on the command line to set the eld separator. Note that elds in /etc/passwd are separated by colons. The second eld represents a users encrypted password, but if the eld is empty, that user has no password. According to the posix standard, awk is supposed to behave as if each record is split into elds at the time that it is read. In particular, this means that you can change the value of FS after a record is read, but before any of the elds are referenced. The value of the elds (i.e. how they were split) should reect the old value of FS, not the new one. However, many implementations of awk do not do this. Instead, they defer splitting the elds until a eld reference actually happens, using the current value of FS! This behavior can be dicult to diagnose. The following example illustrates the results of the two methods. (The sed command prints just the rst line of /etc/passwd.) sed 1q /etc/passwd | awk { FS = ":" ; print $1 } will usually print root on an incorrect implementation of awk, while gawk will print something like root:nSijPlPhZZwgE:0:0:Root:/: There is an important dierence between the two cases of FS = " " (a single blank) and FS = "[ \t]+" (which is a regular expression matching one or more blanks or tabs). For both values of FS, elds are separated by runs of blanks and/or tabs. However, when the value of FS is " ", awk will strip leading and trailing whitespace from the record, and then decide where the elds are. For example, the following expression prints b: echo a b c d | awk { print $2 } However, the following prints a: echo a b c d | awk BEGIN { FS = "[ \t]+" } ; { print $2 } In this case, the rst eld is null. The stripping of leading and trailing whitespace also comes into play whenever $0 is recomputed. For instance, this pipeline echo a b c d | awk { print; $2 = $2; print }
29
a b c d The rst print statement prints the record as it was read, with leading whitespace intact. The assignment to $2 rebuilds $0 by concatenating $1 through $NF together, separated by the value of OFS. Since the leading whitespace was ignored when nding $1, it is not part of the new $0. Finally, the last print statement prints the new $0. The following table summarizes how elds are split, based on the value of FS. FS == " " Fields are separated by runs of whitespace. Leading and trailing whitespace are ignored. This is the default.
FS == any single character Fields are separated by each occurrence of the character. Multiple successive occurrences delimit empty elds, as do leading and trailing occurrences. FS == regexp Fields are separated by occurrences of characters that match regexp. Leading and trailing matches of regexp delimit empty elds.
30
newline character to separate elds, since there is no way to prevent it. However, you can work around this by using the split function to break up the record manually (see Section 11.3 [Built-in Functions for String Manipulation], page 90).
31
This form of the getline command sets NF (the number of elds; see Section 3.2 [Examining Fields], page 22), NR (the number of records read so far; see Section 3.1 [How Input is Split into Records], page 21), FNR (the number of records read from this input le), and the value of $0. Note: the new value of $0 is used in testing the patterns of any subsequent rules. The original value of $0 that triggered the rule which executed getline is lost. By contrast, the next statement reads a new record but immediately begins processing it normally, starting with the rst rule in the program. See Section 9.7 [The next Statement], page 78. getline var This form of getline reads a record into the variable var. This is useful when you want your program to read the next record from the current input le, but you dont want to subject the record to the normal input processing. For example, suppose the next line is a comment, or a special string, and you want to read it, but you must make certain that it wont trigger any rules. This version of getline allows you to read that line and store it in a variable so that the main read-a-line-and-check-each-rule loop of awk never sees it. The following example swaps every two lines of input. For example, given: wan tew free phore it outputs: tew wan phore free Heres the program: awk { if ((getline tmp) > 0) { print tmp print $0 } else print $0 } The getline function used in this way sets only the variables NR and FNR (and of course, var ). The record is not split into elds, so the values of the elds (including $0) and the value of NF do not change. getline < le This form of the getline function takes its input from the le le. Here le is a string-valued expression that species the le name. < le is called a redirection since it directs input to come from a dierent place. This form is useful if you want to read your input from a particular le, instead of from the main input stream. For example, the following program reads its input record from the le foo.input when it encounters a rst eld with a value equal to 10 in the current input le. awk { if ($1 == 10) { getline < "foo.input"
32
print } else print } Since the main input stream is not used, the values of NR and FNR are not changed. But the record read is split into elds in the normal manner, so the values of $0 and other elds are changed. So is the value of NF. This does not cause the record to be tested against all the patterns in the awk program, in the way that would happen if the record were read normally by the main processing loop of awk. However the new record is tested against any subsequent rules, just as when getline is used without a redirection. getline var < le This form of the getline function takes its input from the le le and puts it in the variable var. As above, le is a string-valued expression that species the le from which to read. In this version of getline, none of the built-in variables are changed, and the record is not split into elds. The only variable changed is var. For example, the following program copies all the input les to the output, except for records that say @include lename . Such a record is replaced by the contents of the le lename. awk { if (NF == 2 && $1 == "@include") { while ((getline line < $2) > 0) print line close($2) } else print } Note here how the name of the extra input le is not built into the program; it is taken from the data, from the second eld on the @include line. The close function is called to ensure that if two identical @include lines appear in the input, the entire specied le is included twice. See Section 3.8 [Closing Input Files and Pipes], page 33. One deciency of this program is that it does not process nested @include statements the way a true macro preprocessor would. command | getline You can pipe the output of a command into getline. A pipe is simply a way to link the output of one program to the input of another. In this case, the string command is run as a shell command and its output is piped into awk to be used as input. This form of getline reads one record from the pipe. For example, the following program copies input to output, except for lines that begin with @execute, which are replaced by the output produced by running the rest of the line as a shell command: awk { if ($1 == "@execute") { tmp = substr($0, 10) while ((tmp | getline) > 0) print close(tmp) } else
33
print } The close function is called to ensure that if two identical @execute lines appear in the input, the command is run for each one. See Section 3.8 [Closing Input Files and Pipes], page 33. Given the input: foo bar baz @execute who bletch the program might produce: foo bar baz hack ttyv0 Jul 13 14:22 hack ttyp0 Jul 13 14:23 (gnu:0) hack ttyp1 Jul 13 14:23 (gnu:0) hack ttyp2 Jul 13 14:23 (gnu:0) hack ttyp3 Jul 13 14:23 (gnu:0) bletch Notice that this program ran the command who and printed the result. (If you try this program yourself, you will get dierent results, showing you who is logged in on your system.) This variation of getline splits the record into elds, sets the value of NF and recomputes the value of $0. The values of NR and FNR are not changed. command | getline var The output of the command command is sent through a pipe to getline and into the variable var. For example, the following program reads the current date and time into the variable current_time, using the date utility, and then prints it. awk BEGIN { "date" | getline current_time close("date") print "Report printed on " current_time } In this version of getline, none of the built-in variables are changed, and the record is not split into elds.
34
close(lename ) or close(command ) The argument lename or command can be any expression. Its value must exactly equal the string that was used to open the le or start the commandfor example, if you open a pipe with this: "sort -r names" | getline foo then you must close it with this: close("sort -r names") Once this function call is executed, the next getline from that le or command will reopen the le or rerun the command. close returns a value of zero if the close succeeded. Otherwise, the value will be non-zero.
35
4 Printing Output
One of the most common things that actions do is to output or print some or all of the input. For simple output, use the print statement. For fancier formatting use the printf statement. Both are described in this chapter.
36
line two line three Here is an example that prints the rst two elds of each input record, with a space between them: awk { print $1, $2 } inventory-shipped Its output looks like this: Jan 13 Feb 15 Mar 15 ... A common mistake in using the print statement is to omit the comma between two items. This often has the eect of making the items run together in the output, with no space. The reason for this is that juxtaposing two string expressions in awk means to concatenate them. For example, without the comma: awk { print $1 $2 } inventory-shipped prints: Jan13 Feb15 Mar15 ... Neither examples output makes much sense to someone unfamiliar with the le inventory-shipped. A heading line at the beginning would make it clearer. Lets add some headings to our table of months ($1) and green crates shipped ($2). We do this using the BEGIN pattern (see Section 6.7 [BEGIN and END Special Patterns], page 53) to force the headings to be printed only once: awk BEGIN { { print "Month Crates" print "----- ------" } print $1, $2 } inventory-shipped
Did you already guess what happens? This program prints the following: Month Crates ----- -----Jan 13 Feb 15 Mar 15 ... The headings and the table data dont line up! We can x this by printing some spaces between the two elds:
37
awk BEGIN { print "Month Crates" print "----- ------" } { print $1, " ", $2 } inventory-shipped You can imagine that this way of lining up columns can get pretty complicated when you have many columns to x. Counting spaces for two or three columns can be simple, but more than this and you can get lost quite easily. This is why the printf statement was created (see Section 4.5 [Using printf Statements for Fancier Printing], page 38); one of its specialties is lining up columns of data.
38
ications as the value of OFMT, you can change how print will print your numbers. As a brief example: awk BEGIN { OFMT = "%d" # print numbers as integers print 17.23 } will print 17.
39
Here is a list of the format-control letters: c d i e This prints a number as an ASCII character. Thus, printf "%c", 65 outputs the letter A. The output for a string value is the rst character of the string. This prints a decimal integer. This also prints a decimal integer. This prints a number in scientic (exponential) notation. For example, printf "%4.3e", 1950 prints 1.950e+03, with a total of four signicant gures of which three follow the decimal point. The 4.3 are modiers, discussed below. This prints a number in oating point notation. This prints a number in either scientic notation or oating point notation, whichever uses fewer characters. This prints an unsigned octal integer. This prints a string. This prints an unsigned hexadecimal integer. This prints an unsigned hexadecimal integer. However, for the values 10 through 15, it uses the letters A through F instead of a through f. This isnt really a format-control letter, but it does have a meaning when used after a %: the sequence %% outputs one %. It does not consume an argument.
f g o s x X %
width
40
.prec
This is a number that species the precision to use when printing. This species the number of digits you want printed to the right of the decimal point. For a string, it species the maximum number of characters from the string that should be printed.
The C library printfs dynamic width and prec capability (for example, "%*.*s") is supported. Instead of supplying explicit width and/or prec values in the format string, you pass them in the argument list. For example: w = 5 p = 3 s = "abcdefg" printf "<%*.*s>\n", w, p, s is exactly equivalent to s = "abcdefg" printf "<%5.3s>\n", s Both programs output <abc>. (We have used the bullet symbol to represent a space, to clearly show you that there are two spaces in the output.) Earlier versions of awk did not support this capability. You may simulate it by using concatenation to build up the format string, like so: w = 5 p = 3 s = "abcdefg" printf "<%" w "." p "s>\n", s This is not particularly easy to read, however.
41
aardvark alpo-net barfly bites camelot core fooey foot macfoo sdace sabafoo
555-5553 555-3412 555-7685 555-1675 555-0542 555-2912 555-1234 555-6699 555-6480 555-3430 555-2127
Did you notice that we did not specify that the phone numbers be printed as numbers? They had to be printed as strings because the numbers are separated by a dash. This dash would be interpreted as a minus sign if we had tried to print the phone numbers as numbers. This would have led to some pretty confusing results. We did not specify a width for the phone numbers because they are the last things on their lines. We dont need to put spaces after them. We could make our table look even nicer by adding headings to the tops of the columns. To do this, use the BEGIN pattern (see Section 6.7 [BEGIN and END Special Patterns], page 53) to force the header to be printed only once, at the beginning of the awk program: awk BEGIN { print "Name Number" print "---------" } { printf "%-10s %s\n", $1, $2 } BBS-list Did you notice that we mixed print and printf statements in the above example? We could have used just printf statements to get the same results: awk BEGIN { printf "%-10s %s\n", "Name", "Number" printf "%-10s %s\n", "----", "------" } { printf "%-10s %s\n", $1, $2 } BBS-list By outputting each column heading with the same format specication used for the elements of the column, we have made sure that the headings are aligned just like the columns. The fact that the same format specication is used three times can be emphasized by storing it in a variable, like this: awk BEGIN { format = "%-10s %s\n" printf format, "Name", "Number" printf format, "----", "------" } { printf format, $1, $2 } BBS-list See if you can use the printf statement to line up the headings and table data for our inventory-shipped example covered earlier in the section on the print statement (see Section 4.1 [The print Statement], page 35).
42
43
close(report) We call the close function here because its a good idea to close the pipe as soon as all the intended output has been sent to it. See Section 4.6.2 [Closing Output Files and Pipes], page 43, for more information on this. This example also illustrates the use of a variable to represent a le or command : it is not necessary to always use a string constant. Using a variable is generally a good idea, since awk requires you to spell the string value identically every time. Redirecting output using >, >>, or | asks the system to open a le or pipe only if the particular le or command youve specied has not already been written to by your program, or if it has been closed since it was last written to.
44
To run the same program a second time, with the same arguments. This is not the same thing as giving more input to the rst run! For example, suppose you pipe output to the mail program. If you output several lines redirected to this pipe without closing it, they make a single message of several lines. By contrast, if you close the pipe after each line of output, then each line makes a separate message. close returns a value of zero if the close succeeded. Otherwise, the value will be non-zero.
45
5 Useful One-liners
Useful awk programs are often short, just a line or two. Here is a collection of useful, short programs to get you started. Some of these programs contain constructs that havent been covered yet. The description of the program will give you a good idea of what is going on, but please read the rest of the manual to become an awk expert! awk { if (NF > max) max = NF } END { print max } This program prints the maximum number of elds on any input line. awk length($0) > 80 This program prints every line longer than 80 characters. The sole rule has a relational expression as its pattern, and has no action (so the default action, printing the record, is used). awk NF > 0 This program prints every line that has at least one eld. This is an easy way to delete blank lines from a le (or rather, to create a new le similar to the old le but from which the blank lines have been deleted). awk { if (NF > 0) print } This program also prints every line that has at least one eld. Here we allow the rule to match every line, then decide in the action whether to print. awk BEGIN { for (i = 1; i <= 7; i++) print int(101 * rand()) } This program prints 7 random numbers from 0 to 100, inclusive. ls -l les | awk { x += $4 } ; END { print "total bytes: " x } This program prints the total number of bytes used by les. expand le | awk { if (x < length()) x = length() } END { print "maximum line length is " x } This program prints the maximum line length of le. The input is piped through the expand program to change tabs into spaces, so the widths compared are actually the right-margin columns. awk BEGIN { FS = ":" } { print $1 | "sort" } /etc/passwd This program prints a sorted list of the login names of all users. awk { nlines++ } END { print nlines } This programs counts lines in a le. awk END { print NR } This program also counts lines in a le, but lets awk do the work. awk { print NR, $0 } This program adds line numbers to all its input les, similar to cat -n.
46
Chapter 6: Patterns
47
6 Patterns
Patterns in awk control the execution of rules: a rule is executed when its pattern matches the current input record. This chapter tells all about how to write patterns.
48
Regular expressions can also be used in comparison expressions. Then you can specify the string to match against; it need not be the entire current input record. These comparison expressions can be used as patterns or in if, while, for, and do statements. exp ~ /regexp / This is true if the expression exp (taken as a character string) is matched by regexp. The following example matches, or selects, all input records with the upper-case letter J somewhere in the rst eld: awk $1 ~ /J/ inventory-shipped So does this: awk { if ($1 ~ /J/) print } inventory-shipped exp !~ /regexp/ This is true if the expression exp (taken as a character string) is not matched by regexp. The following example matches, or selects, all input records whose rst eld does not contain the upper-case letter J: awk $1 !~ /J/ inventory-shipped The right hand side of a ~ or !~ operator need not be a constant regexp (i.e., a string of characters between slashes). It may be any expression. The expression is evaluated, and converted if necessary to a string; the contents of the string are used as the regexp. A regexp that is computed in this way is called a dynamic regexp. For example: identifier_regexp = "[A-Za-z_][A-Za-z_0-9]+" $0 ~ identifier_regexp sets identifier_regexp to a regexp that describes awk variable names, and tests if the input record matches this regexp.
Chapter 6: Patterns
49
.P matches any single character followed by a P in a string. Using concatenation we can make regular expressions like U.A, which matches any three-character sequence that begins with U and ends with A. [. . .] This is called a character set. It matches any one of the characters that are enclosed in the square brackets. For example: [MVX] matches any one of the characters M, V, or X in a string. Ranges of characters are indicated by using a hyphen between the beginning and ending characters, and enclosing the whole thing in brackets. For example: [0-9] matches any digit. To include the character \, ], - or ^ in a character set, put a \ in front of it. For example: [d\]] matches either d, or ]. This treatment of \ is compatible with other awk implementations, and is also mandated by the posix Command Language and Utilities standard. The regular expressions in awk are a superset of the posix specication for Extended Regular Expressions (EREs). posix EREs are based on the regular expressions accepted by the traditional egrep utility. In egrep syntax, backslash is not syntactically special within square brackets. This means that special tricks have to be used to represent the characters ], - and ^ as members of a character set. In egrep syntax, to match -, write it as ---, which is a range containing only -. You may also give - as the rst or last character in the set. To match ^, put it anywhere except as the rst character of a set. To match a ], make it the rst character in the set. For example: []d^] matches either ], d or ^. This is a complemented character set. The rst character after the [ must be a ^. It matches any characters except those in the square brackets (or newline). For example: [^0-9] matches any character that is not a digit. This is the alternation operator and it is used to specify alternatives. For example: ^P|[0-9] matches any string that matches either ^P or [0-9]. This means it matches any string that contains a digit or starts with P. The alternation applies to the largest possible regexps on either side. Parentheses are used for grouping in regular expressions as in arithmetic. They can be used to concatenate regular expressions containing the alternation operator, |. This symbol means that the preceding regular expression is to be repeated as many times as possible to nd a match. For example: ph* applies the * symbol to the preceding h and looks for matches to one p followed by any number of hs. This will also match just p if no hs are present.
[^ . . .]
(. . .) *
50
The * repeats the smallest possible preceding expression. (Use parentheses if you wish to repeat a larger expression.) It nds as many repetitions as possible. For example: awk /\(c[ad][ad]*r x\)/ { print } sample prints every record in the input containing a string of the form (car x), (cdr x), (cadr x), and so on. + This symbol is similar to *, but the preceding expression must be matched at least once. This means that: wh+y would match why and whhy but not wy, whereas wh*y would match all three of these strings. This is a simpler way of writing the last * example: awk /\(c[ad]+r x\)/ { print } sample This symbol is similar to *, but the preceding expression can be matched once or not at all. For example: fe?d will match fed and fd, but nothing else. This is used to suppress the special meaning of a character when matching. For example: \$ matches the character $. The escape sequences used for string constants (see Section 8.1 [Constant Expressions], page 57) are valid in regular expressions as well; they are also introduced by a \.
In regular expressions, the *, +, and ? operators have the highest precedence, followed by concatenation, and nally by |. As in arithmetic, parentheses can change how operators are grouped.
converts the rst eld to lower case before matching against it. x = "aB" if (x ~ /ab/) . . .
Chapter 6: Patterns
51
The operands of a relational operator are compared as numbers if they are both numbers. Otherwise they are converted to, and compared as, strings (see Section 8.9 [Conversion of Strings and Numbers], page 67, for the detailed rules). Strings are compared by comparing the rst character of each, then the second character of each, and so on, until there is a dierence. If the two strings are equal until the shorter one runs out, the shorter one is considered to be less than the longer one. Thus, "10" is less than "9", and "abc" is less than "abcd". The left operand of the ~ and !~ operators is a string. The right operand is either a constant regular expression enclosed in slashes (/regexp/), or any expression, whose string value is used as a dynamic regular expression (see Section 6.2.1 [How to Use Regular Expressions], page 47). The following example prints the second eld of each input record whose rst eld is precisely foo. awk $1 == "foo" { print $2 } BBS-list Contrast this with the following regular expression match, which would accept any record with a rst eld that contains foo: awk $1 ~ "foo" { print $2 } BBS-list or, equivalently, this one: awk $1 ~ /foo/ { print $2 } BBS-list
52
For example, the following command prints all records in the input le BBS-list that contain both 2400 and foo. awk /2400/ && /foo/ BBS-list The following command prints all records in the input le BBS-list that contain either 2400 or foo, or both. awk /2400/ || /foo/ BBS-list The following command prints all records in the input le BBS-list that do not contain the string foo. awk ! /foo/ BBS-list Note that boolean patterns are a special case of expression patterns (see Section 6.5 [Expressions as Patterns], page 52); they are expressions that use the boolean operators. See Section 8.6 [Boolean Expressions], page 64, for complete information on the boolean operators. The subpatterns of a boolean pattern can be constant regular expressions, comparisons, or any other awk expressions. Range patterns are not expressions, so they cannot appear inside boolean patterns. Likewise, the special patterns BEGIN and END, which never match any input record, are not expressions and cannot appear inside boolean patterns.
Chapter 6: Patterns
53
54
Multiple BEGIN and END sections are useful for writing library functions, since each library can have its own BEGIN or END rule to do its own initialization and/or cleanup. Note that the order in which library functions are named on the command line controls the order in which their BEGIN and END rules are executed. Therefore you have to be careful to write such rules in library les so that the order in which they are executed doesnt matter. See Chapter 14 [Invoking awk], page 105, for more information on using library functions. If an awk program only has a BEGIN rule, and no other rules, then the program exits after the BEGIN rule has been run. (Older versions of awk used to keep reading and ignoring input until end of le was seen.) However, if an END rule exists as well, then the input will be read, even if there are no other rules in the program. This is necessary in case the END rule checks the NR variable. BEGIN and END rules must have actions; there is no default action for these rules since there is no current record when they run.
55
7 Overview of Actions
An awk program or script consists of a series of rules and function denitions, interspersed. (Functions are described later. See Chapter 12 [User-dened Functions], page 95.) A rule contains a pattern and an action, either of which may be omitted. The purpose of the action is to tell awk what to do once a match for the pattern is found. Thus, the entire program looks somewhat like this: [pattern] [{ action }] [pattern] [{ action }] ... function name (args ) { . . . } ... An action consists of one or more awk statements, enclosed in curly braces ({ and }). Each statement species one thing to be done. The statements are separated by newlines or semicolons. The curly braces around an action must be used even if the action contains only one statement, or even if it contains no statements at all. However, if you omit the action entirely, omit the curly braces as well. (An omitted action is equivalent to { print $0 }.) Here are the kinds of statements supported in awk: Expressions, which can call functions or assign values to variables (see Chapter 8 [Expressions as Action Statements], page 57). Executing this kind of statement simply computes the value of the expression and then ignores it. This is useful when the expression has side eects (see Section 8.7 [Assignment Expressions], page 64). Control statements, which specify the control ow of awk programs. The awk language gives you C-like constructs (if, for, while, and so on) as well as a few special ones (see Chapter 9 [Control Statements in Actions], page 73). Compound statements, which consist of one or more statements enclosed in curly braces. A compound statement is used in order to put several statements together in the body of an if, while, do or for statement. Input control, using the getline command (see Section 3.7 [Explicit Input with getline], page 30), and the next statement (see Section 9.7 [The next Statement], page 78). Output statements, print and printf. See Chapter 4 [Printing Output], page 35. Deletion statements, for deleting array elements. See Section 10.6 [The delete Statement], page 85. The next two chapters cover in detail expressions and control statements, respectively. We go on to treat arrays and built-in functions, both of which are used in expressions. Then we proceed to discuss how to dene your own functions.
56
57
58
\\ \a \b \f \n \r \t \v \nnn \xhh. . .
Represents a literal backslash, \. Represents the alert character, control-g, ASCII code 7. Represents a backspace, control-h, ASCII code 8. Represents a formfeed, control-l, ASCII code 12. Represents a newline, control-j, ASCII code 10. Represents a carriage return, control-m, ASCII code 13. Represents a horizontal tab, control-i, ASCII code 9. Represents a vertical tab, control-k, ASCII code 11. Represents the octal value nnn, where nnn are one to three digits between 0 and 7. For example, the code for the ASCII ESC (escape) character is \033. Represents the hexadecimal value hh, where hh are hexadecimal digits (0 through 9 and either A through F or a through f). Like the same construct in ansi C, the escape sequence continues until the rst non-hexadecimal digit is seen. However, using more than two hexadecimal digits produces undened results. (The \x escape sequence is not allowed in posix awk.)
A constant regexp is a regular expression description enclosed in slashes, such as /^beginning and end$/. Most regexps used in awk programs are constant, but the ~ and !~ operators can also match computed or dynamic regexps (see Section 6.2.1 [How to Use Regular Expressions], page 47). Constant regexps may be used like simple expressions. When a constant regexp is not on the right hand side of the ~ or !~ operators, it has the same meaning as if it appeared in a pattern, i.e. ($0 ~ /foo/) (see Section 6.5 [Expressions as Patterns], page 52). This means that the two code segments, if ($0 ~ /barfly/ || $0 ~ /camelot/) print "found" and if (/barfly/ || /camelot/) print "found" are exactly equivalent. One rather bizarre consequence of this rule is that the following boolean expression is legal, but does not do what the user intended: if (/foo/ ~ $1) print "found foo" This code is obviously testing $1 for a match against the regexp /foo/. But in fact, the expression (/foo/ ~ $1) actually means (($0 ~ /foo/) ~ $1). In other words, rst match the input record against the regexp /foo/. The result will be either a 0 or a 1, depending upon the success or failure of the match. Then match that result against the rst eld in the record. Another consequence of this rule is that the assignment statement
59
matches = /foo/ will assign either 0 or 1 to the variable matches, depending upon the contents of the current input record. Constant regular expressions are also used as the rst argument for the sub and gsub functions (see Section 11.3 [Built-in Functions for String Manipulation], page 90). This feature of the language was never well documented until the posix specication. You may be wondering, when is $1 ~ /foo/ { . . . } preferable to $1 ~ "foo" { . . . } Since the right-hand sides of both ~ operators are constants, it is more ecient to use the /foo/ form: awk can note that you have supplied a regexp and store it internally in a form that makes pattern matching more ecient. In the second form, awk must rst convert the string into this internal form, and then perform the pattern matching. The rst form is also better style; it shows clearly that you intend a regexp match.
8.2 Variables
Variables let you give names to values and refer to them later. You have already seen variables in many of the examples. The name of a variable must be a sequence of letters, digits and underscores, but it may not begin with a digit. Case is signicant in variable names; a and A are distinct variables. A variable name is a valid expression by itself; it represents the variables current value. Variables are given new values with assignment operators and increment operators. See Section 8.7 [Assignment Expressions], page 64. A few variables have special built-in meanings, such as FS, the eld separator, and NF, the number of elds in the current input record. See Chapter 13 [Built-in Variables], page 101, for a list of them. These built-in variables can be used and assigned just like all other variables, but their values are also used or changed automatically by awk. Each built-in variables name is made entirely of upper case letters. Variables in awk can be assigned either numeric or string values. By default, variables are initialized to the null string, which is eectively zero if converted to a number. There is no need to initialize each variable explicitly in awk, the way you would in C or most other traditional languages.
60
61
+x x *y x /y x %y
Unary plus. No real eect on the expression. Multiplication. Division. Since all numbers in awk are double-precision oating point, the result is not rounded to an integer: 3 / 4 has the value 0.75. Remainder. The quotient is rounded toward zero to an integer, multiplied by y and this result is subtracted from x. This operation is sometimes known as trunc-mod. The following relation always holds: b * int(a / b) + (a % b) == a One possibly undesirable eect of this denition of remainder is that x % y is negative if x is negative. Thus, -17 % 8 = -1 In other awk implementations, the signedness of the remainder may be machine dependent. Exponentiation: x raised to the y power. 2 ^ 3 has the value 8. The character sequence ** is equivalent to ^. (The posix standard only species the use of ^ for exponentiation.)
x ^y x ** y
62
print "something meaningful" > (file name) We recommend you use parentheses around concatenation in all but the most common contexts (such as in the right-hand operand of =).
subscript in array True if array array has an element with the subscript subscript. Comparison expressions have the value 1 if true and 0 if false. The rules gawk uses for performing comparisons are based on those in draft 11.2 of the posix standard. The posix standard introduced the concept of a numeric string, which is simply a string that looks like a number, for example, " +2". When performing a relational operation, gawk considers the type of an operand to be the type it received on its last assignment, rather than the type of its last use (see Section 8.10 [Numeric and String Values], page 68). This type is unknown when the operand is from an external source: eld variables, command line arguments, array elements resulting from a split operation, and the value of an ENVIRON element. In this case only, if the operand is a numeric string, then it is considered to be of both string type and numeric type. If at least one operand of a comparison is of string type only, then a string comparison is performed. Any numeric operand will be converted to a string using the value of CONVFMT (see Section 8.9 [Conversion of Strings and Numbers], page 67). If one operand of a comparison is numeric, and the other operand is either numeric or both numeric and string, then awk does a numeric comparison. If both operands have both types, then the comparison is numeric. Strings are compared by comparing the rst character of each, then the second character of each, and so on. Thus "10" is less than "9". If there are two strings where one is a prex of the other, the shorter string is less than the longer one. Thus "abc" is less than "abcd". Here are some sample expressions, how awk compares them, and what the result of the comparison is. 1.5 <= 2.0 numeric comparison (true)
63
"abc" >= "xyz" string comparison (false) 1.5 != " +2" string comparison (true) "1e2" < "3" string comparison (true) a = 2; b = "2" a == b string comparison (true) echo 1e2 3 | awk { print ($1 < $2) ? "true" : "false" } prints false since both $1 and $2 are numeric strings and thus have both string and numeric types, thus dictating a numeric comparison. The purpose of the comparison rules and the use of numeric strings is to attempt to produce the behavior that is least surprising, while still doing the right thing. String comparisons and regular expression comparisons are very dierent. For example, $1 == "foo" has the value of 1, or is true, if the rst eld of the current input record is precisely foo. By contrast, $1 ~ /foo/ has the value 1 if the rst eld contains foo, such as foobar. The right hand operand of the ~ and !~ operators may be either a constant regexp (/. . ./), or it may be an ordinary expression, in which case the value of the expression as a string is a dynamic regexp (see Section 6.2.1 [How to Use Regular Expressions], page 47). In very recent implementations of awk, a constant regular expression in slashes by itself is also an expression. The regexp /regexp/ is an abbreviation for this comparison expression: $0 ~ /regexp / In some contexts it may be necessary to write parentheses around the regexp to avoid confusing the awk parser. For example, (/x/ - /y/) > threshold is not allowed, but ((/x/) - (/y/)) > threshold parses properly. One special place where /foo/ is not an abbreviation for $0 ~ /foo/ is when it is the right-hand operand of ~ or !~! See Section 8.1 [Constant Expressions], page 57, where this is discussed in more detail.
64
65
Assignments can store string values also. For example, this would store the value "this food is good" in the variable message: thing = "food" predicate = "good" message = "this " thing " is " predicate (This also illustrates concatenation of strings.) The = sign is called an assignment operator. It is the simplest assignment operator because the value of the right-hand operand is stored unchanged. Most operators (addition, concatenation, and so on) have no eect except to compute a value. If you ignore the value, you might as well not use the operator. An assignment operator is dierent; it does produce a value, but even if you ignore the value, the assignment still makes itself felt through the alteration of the variable. We call this a side eect. The left-hand operand of an assignment need not be a variable (see Section 8.2 [Variables], page 59); it can also be a eld (see Section 3.4 [Changing the Contents of a Field], page 24) or an array element (see Chapter 10 [Arrays in awk], page 81). These are all called lvalues, which means they can appear on the left-hand side of an assignment operator. The right-hand operand may be any expression; it produces the new value which the assignment stores in the specied variable, eld or array element. It is important to note that variables do not have permanent types. The type of a variable is simply the type of whatever value it happens to hold at the moment. In the following program fragment, the variable foo has a numeric value at rst, and a string value later on: foo = print foo = print 1 foo "bar" foo
When the second assignment gives foo a string value, the fact that it previously had a numeric value is forgotten. An assignment is an expression, so it has a value: the same value that is assigned. Thus, z = 1 as an expression has the value 1. One consequence of this is that you can write multiple assignments together: x = y = z = 0 stores the value 0 in all three variables. It does this because the value of z = 0, which is 0, is stored into y, and then the value of y = z = 0, which is 0, is stored into x. You can use an assignment anywhere an expression is called for. For example, it is valid to write x != (y = 1) to set y to 1 and then test whether x equals 1. But this style tends to make programs hard to read; except in a one-shot program, you should rewrite it to get rid of such nesting of assignments. This is never very hard.
66
Aside from =, there are several other assignment operators that do arithmetic with the old value of the variable. For example, the operator += computes a new value by adding the righthand value to the old value of the variable. Thus, the following assignment adds 5 to the value of foo: foo += 5 This is precisely equivalent to the following: foo = foo + 5 Use whichever one makes the meaning of your program clearer. Here is a table of the arithmetic assignment operators. In each case, the right-hand operand is an expression whose value is converted to a number. lvalue += increment Adds increment to the value of lvalue to make the new value of lvalue. lvalue -= decrement Subtracts decrement from the value of lvalue. lvalue *= coecient Multiplies the value of lvalue by coecient. lvalue /= quotient Divides the value of lvalue by quotient. lvalue %= modulus Sets lvalue to its remainder by modulus. lvalue ^= power lvalue **= power Raises lvalue to the power power. (Only the ^= operator is specied by posix.)
67
The post-increment foo++ is nearly equivalent to writing (foo += 1) - 1. It is not perfectly equivalent because all numbers in awk are oating point: in oating point, foo + 1 - 1 does not necessarily equal foo. But the dierence is minute as long as you stick to numbers that are fairly small (less than a trillion). Any lvalue can be incremented. Fields and array elements are incremented just like variables. (Use $(i++) when you wish to do a eld reference and a variable increment at the same time. The parentheses are necessary because of the precedence of the eld reference operator, $.) The decrement operator -- works just like ++ except that it subtracts 1 instead of adding. Like ++, it can be used before the lvalue to pre-decrement or after it to post-decrement. Here is a summary of increment and decrement expressions. ++lvalue lvalue ++ --lvalue lvalue-This expression increments lvalue and the new value becomes the value of this expression. This expression causes the contents of lvalue to be incremented. The value of the expression is the old value of lvalue. Like ++lvalue , but instead of adding, it subtracts. It decrements lvalue and delivers the value that results. Like lvalue ++, but instead of adding, it subtracts. It decrements lvalue. The value of the expression is the old value of lvalue.
68
CONVFMTs default value is "%.6g", which prints a value with at least six signicant digits. For some applications you will want to change it to specify more precision. Double precision on most modern machines gives you 16 or 17 decimal digits of precision. Strange results can happen if you set CONVFMT to a string that doesnt tell sprintf how to format oating point numbers in a useful way. For example, if you forget the % in the format, all numbers will be converted to the same constant string. As a special case, if a number is an integer, then the result of converting it to a string is always an integer, no matter what the value of CONVFMT may be. Given the following code fragment: CONVFMT = "%2.2f" a = 12 b = a "" b has the value "12", not "12.00". Prior to the posix standard, awk specied that the value of OFMT was used for converting numbers to strings. OFMT species the output format to use when printing numbers with print. CONVFMT was introduced in order to separate the semantics of conversions from the semantics of printing. Both CONVFMT and OFMT have the same default value: "%.6g". In the vast majority of cases, old awk programs will not change their behavior. However, this use of OFMT is something to keep in mind if you must port your program to other implementations of awk; we recommend that instead of changing your programs, you just port gawk itself!
69
The variable a receives a string value in the concatenation and assignment to b. The string value of a is "123.3". If the numeric value was lost when it was converted to a string, then the numeric use of a in the last statement would lose information. c would be assigned the value 124.954 instead of 124.975. Such errors accumulate rapidly, and very adversely aect numeric computations. Once a numeric value acquires a corresponding string value, it stays valid until a new assignment is made. If CONVFMT (see Section 8.9 [Conversion of Strings and Numbers], page 67) changes in the meantime, the old string value will still be used. For example: BEGIN { CONVFMT = "%2.2f" a = 123.456 b = a "" printf "a = %s\n", a CONVFMT = "%.6g" printf "a = %s\n", a a += 0 printf "a = %s\n", a }
This program prints a = 123.46 twice, and then prints a = 123.456. See Section 8.9 [Conversion of Strings and Numbers], page 67, for the rules that specify how string values are made from numeric values.
70
This is guaranteed to increment i exactly once, because each time one or the other of the two increment expressions is executed, and the other is not.
Do not put any space between the function name and the open-parenthesis! A user-dened function name looks just like the name of a variable, and space would make the expression look like concatenation of a variable with an expression inside parentheses. Space before the parenthesis is harmless with built-in functions, but it is best not to get into the habit of using space to avoid mistakes with user-dened functions. Each function expects a particular number of arguments. For example, the sqrt function must be called with a single argument, the number to take the square root of: sqrt(argument) Some of the built-in functions allow you to omit the nal argument. If you do so, they use a reasonable default. See Chapter 11 [Built-in Functions], page 89, for full details. If arguments are omitted in calls to user-dened functions, then those arguments are treated as local variables, initialized to the null string (see Chapter 12 [User-dened Functions], page 95). Like every other expression, the function call has a value, which is computed by the function based on the arguments you give it. In this example, the value of sqrt(argument) is the square root of the argument. A function can also have side eects, such as assigning the values of certain variables or doing I/O. Here is a command to read numbers, one number per line, and print the square root of each one:
71
72
which could be the operand of another operator. As a result, it does not make sense to use a redirection operator near another operator of lower precedence, without parentheses. Such combinations, for example print foo > a ? b : c, result in syntax errors. concatenation No special token is used to indicate concatenation. The operands are simply written side by side. add, subtract +, -. multiply, divide, mod *, /, %. unary plus, minus, not +, -, !. exponentiation ^, **. These operators group right-to-left. (The ** operator is not specied by posix.) increment, decrement ++, --. eld $.
73
74
75
Even if condition is false at the start, body is executed at least once (and only once, unless executing body makes condition true). Contrast this with the corresponding while statement: while (condition) body This statement does not execute body even once if condition is false to begin with. Here is an example of a do statement: awk { i = 1 do { print $0 i++ } while (i <= 10) } prints each input record ten times. It isnt a very realistic example, since in this case an ordinary while would do just as well. But this reects actual experience; there is only occasionally a real use for a do statement.
76
The same is true of the increment part; to increment additional variables, you must write separate statements at the end of the loop. The C compound expression, using Cs comma operator, would be useful in this context, but it is not supported in awk. Most often, increment is an increment expression, as in the example above. But this is not required; it can be any expression whatever. For example, this statement prints all the powers of 2 between 1 and 100: for (i = 1; i <= 100; i *= 2) print i Any of the three expressions in the parentheses following the for may be omitted if there is nothing to be done there. Thus, for (;x > 0;) is equivalent to while (x > 0). If the condition is omitted, it is treated as true, eectively yielding an innite loop (i.e., a loop that will never terminate). In most cases, a for loop is an abbreviation for a while loop, as shown here: initialization while (condition) { body increment } The only exception is when the continue statement (see Section 9.6 [The continue Statement], page 77) is used inside the loop; changing a for statement to a while statement in this way can change the eect of the continue statement inside the loop. There is an alternate version of the for loop, for iterating over all the indices of an array: for (i in array) do something with array[i] See Chapter 10 [Arrays in awk], page 81, for more information on this version of the for loop. The awk language has a for statement in addition to a while statement because often a for loop is both less work to type and more natural to think of. Counting the number of iterations is very common in loops. It can be easier to think of this counting as part of looping rather than as something to do inside the loop. The next section has more complicated examples of for loops.
77
{ num = $1 for (div = 2; div*div <= num; div++) if (num % div == 0) break if (num % div == 0) printf "Smallest divisor of %d is %d\n", num, div else printf "%d is prime\n", num } When the remainder is zero in the rst if statement, awk immediately breaks out of the containing for loop. This means that awk proceeds immediately to the statement following the loop and continues processing. (This is very dierent from the exit statement which stops the entire awk program. See Section 9.8 [The exit Statement], page 79.) Here is another program equivalent to the previous one. It illustrates how the condition of a for or while could just as well be replaced with a break inside an if: awk # find smallest divisor of num { num = $1 for (div = 2; ; div++) { if (num % div == 0) { printf "Smallest divisor of %d is %d\n", num, div break } if (div*div > num) { printf "%d is prime\n", num break } } }
78
If one of the input records contains the string ignore, this example skips the print statement for that record, and continues back to the rst statement in the loop. This is not a practical example of continue, since it would be just as easy to write the loop like this: for (x in names) if (names[x] !~ /ignore/) print names[x] The continue statement in a for loop directs awk to skip the rest of the body of the loop, and resume execution with the increment-expression of the for statement. The following program illustrates this fact: awk BEGIN { for (x = 0; x <= 20; x++) { if (x == 5) continue printf ("%d ", x) } print "" } This program prints all the numbers from 0 to 20, except for 5, for which the printf is skipped. Since the increment x++ is not skipped, x does not remain stuck at 5. Contrast the for loop above with the while loop: awk BEGIN { x = 0 while (x <= 20) { if (x == 5) continue printf ("%d ", x) x++ } print "" } This program loops forever once x gets to 5. As described above, the continue statement has no meaning when used outside the body of a loop.
79
Contrast this with the eect of the getline function (see Section 3.7 [Explicit Input with getline], page 30). That too causes awk to read the next record immediately, but it does not alter the ow of control in any way. So the rest of the current action executes with a new input record. At the highest level, awk program execution is a loop that reads an input record and then tests each rules pattern against it. If you think of this loop as a for statement whose body contains the rules, then the next statement is analogous to a continue statement: it skips to the end of the body of this implicit loop, and executes the increment (which reads another record). For example, if your awk program works only on records with four elds, and you dont want it to fail when given bad input, you might use this rule near the beginning of the program: NF != 4 { printf("line %d skipped: doesnt have 4 fields", FNR) > "/dev/stderr" next } so that the following rules will not see the bad record. The error message is redirected to the standard error output stream, as error messages should be. See Section 4.7 [Standard I/O Streams], page 44. According to the posix standard, the behavior is undened if the next statement is used in a BEGIN or END rule. gawk will treat it as a syntax error. If the next statement causes the end of the input to be reached, then the code in the END rules, if any, will be executed. See Section 6.7 [BEGIN and END Special Patterns], page 53.
80
BEGIN { if (("date" | getline date_now) < 0) { print "Cant get system date" > "/dev/stderr" exit 4 } }
81
10 Arrays in awk
An array is a table of values, called elements. The elements of an array are distinguished by their indices. Indices may be either numbers or strings. Each array has a name, which looks like a variable name, but must not be in use as a variable name in the same awk program.
value index
Only the values are stored; the indices are implicit from the order of the values. 8 is the value at index 0, because 8 appears in the position with 0 elements before it. Arrays in awk are dierent: they are associative. This means that each array is a collection of pairs: an index, and its corresponding array element value: Element Element Element Element 4 2 1 3 Value Value Value Value 30 "foo" 8 ""
We have shown the pairs in jumbled order because their order is irrelevant.
82
One advantage of an associative array is that new pairs can be added at any time. For example, suppose we add to the above array a tenth element whose value is "number ten". The result is this: Element Element Element Element Element 10 4 2 1 3 Value Value Value Value Value "number ten" 30 "foo" 8 ""
Now the array is sparse (i.e., some indices are missing): it has elements 14 and 10, but doesnt have elements 5, 6, 7, 8, or 9. Another consequence of associative arrays is that the indices dont have to be positive integers. Any number, or even a string, can be an index. For example, here is an array which translates words from English into French: Element Element Element Element "dog" "cat" "one" 1 Value Value Value Value "chien" "chat" "un" "un"
Here we decided to translate the number 1 in both spelled-out and numeric formthus illustrating that a single array can have both numbers and strings as indices. When awk creates an array for you, e.g., with the split built-in function, that arrays indices are consecutive integers starting at 1. (See Section 11.3 [Built-in Functions for String Manipulation], page 90.)
83
index in array This expression tests whether or not the particular index exists, without the side eect of creating that element if it is not present. The expression has the value 1 (true) if array [index ] exists, and 0 (false) if it does not exist. For example, to test whether the array frequencies contains the index "2", you could write this statement: if ("2" in frequencies) print "Subscript \"2\" is present." Note that this is not a test of whether or not the array frequencies contains an element whose value is "2". (There is no way to do that except to scan all the elements.) Also, this does not create frequencies["2"], while the following (incorrect) alternative would do so: if (frequencies["2"] != "") print "Subscript \"2\" is present."
84
The rst rule keeps track of the largest line number seen so far; it also stores each line into the array arr, at an index that is the lines number. The second rule runs after all the input has been read, to print out all the lines. When this program is run with the following input: 5 2 4 1 3 I am the Five man Who are you? The new number two! . . . And four on the floor Who is number one? I three you.
its output is this: 1 2 3 4 5 Who is number one? Who are you? The new number two! I three you. . . . And four on the floor I am the Five man
If a line number is repeated, the last line with a given number overrides the others. Gaps in the line numbers can be handled with an easy improvement to the programs END rule: END { for (x = 1; x <= max; x++) if (x in arr) print arr[x] }
85
the word as index. The second rule scans the elements of used to nd all the distinct words that appear in the input. It prints each word that is more than 10 characters long, and also prints the number of such words. See Chapter 11 [Built-in Functions], page 89, for more information on the built-in function length. # Record a 1 for each word that is used at least once. { for (i = 1; i <= NF; i++) used[$i] = 1 } # Find number of distinct words more than 10 characters long. END { for (x in used) if (length(x) > 10) { ++num_long_words print x } print num_long_words, "words longer than 10 characters" } See Appendix B [Sample Program], page 119, for a more detailed example of this type. The order in which elements of the array are accessed by this statement is determined by the internal arrangement of the array elements within awk and cannot be controlled or changed. This can lead to problems if new elements are added to array by statements in body ; you cannot predict whether or not the for loop will reach them. Similarly, changing var inside the loop can produce strange results. It is best to avoid such things.
86
if (4 in foo) print "This will never be printed" It is not an error to delete an element which does not exist.
87
Multi-dimensional arrays are supported in awk through concatenation of indices into one string. What happens is that awk converts the indices into strings (see Section 8.9 [Conversion of Strings and Numbers], page 67) and concatenates them together, with a separator between them. This creates a single string that describes the values of the separate indices. The combined string is used as a single index into an ordinary, one-dimensional array. The separator used is the value of the built-in variable SUBSEP. For example, suppose we evaluate the expression foo[5,12]="value" when the value of SUBSEP is "@". The numbers 5 and 12 are converted to strings and concatenated with an @ between them, yielding "5@12"; thus, the array element foo["5@12"] is set to "value". Once the elements value is stored, awk has no record of whether it was stored with a single index or a sequence of indices. The two expressions foo[5,12] and foo[5 SUBSEP 12] always have the same value. The default value of SUBSEP is the string "\034", which contains a nonprinting character that is unlikely to appear in an awk program or in the input data. The usefulness of choosing an unlikely character comes from the fact that index values that contain a string matching SUBSEP lead to combined strings that are ambiguous. Suppose that SUBSEP were "@"; then foo["a@b", "c"] and foo["a", "b@c"] would be indistinguishable because both would actually be stored as foo["a@b@c"]. Because SUBSEP is "\034", such confusion can arise only when an index contains the character with ASCII code 034, which is a rare event. You can test whether a particular index-sequence exists in a multi-dimensional array with the same operator in used for single dimensional arrays. Instead of a single index as the left-hand operand, write the whole sequence of indices, separated by commas, in parentheses: (subscript1, subscript2, . . .) in array The following example treats its input as a two-dimensional array of elds; it rotates this array 90 degrees clockwise and prints the result. It assumes that all lines have the same number of elements. awk { if (max_nf < NF) max_nf = NF max_nr = NR for (x = 1; x <= NF; x++) vector[x, NR] = $x } END { for (x = 1; x <= max_nf; x++) { for (y = max_nr; y >= 1; --y) printf("%s ", vector[x, y]) printf("\n") } } When given the input:
88
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 1
5 6 1 2
6 1 2 3
it produces: 4 5 6 1 2 3 3 4 5 6 1 2 2 3 4 5 6 1 1 2 3 4 5 6
89
11 Built-in Functions
Built-in functions are functions that are always available for your awk program to call. This chapter denes all the built-in functions in awk; some of them are mentioned in other sections, but they are summarized here for your convenience. (You can also dene new functions yourself. See Chapter 12 [User-dened Functions], page 95.)
90
rand()
This gives you a random number. The values of rand are uniformly-distributed between 0 and 1. The value is never 0 and never 1. Often you want random integers instead. Here is a user-dened function you can use to obtain a random nonnegative integer less than n: function randint(n) { return int(n * rand()) } The multiplication produces a random real number greater than 0 and less than n. We then make it an integer (using int) between 0 and n 1. Here is an example where a similar function is used to produce random integers between 1 and n. Note that this program will print a new random number for each input record. awk # Function to roll a simulated die. function roll(n) { return 1 + int(rand() * n) } # Roll 3 six-sided dice and print total number of points. { printf("%d points\n", roll(6)+roll(6)+roll(6)) } Note: rand starts generating numbers from the same point, or seed, each time you run awk. This means that a program will produce the same results each time you run it. The numbers are random within one awk run, but predictable from run to run. This is convenient for debugging, but if you want a program to do dierent things each time it is used, you must change the seed to a value that will be dierent in each run. To do this, use srand.
srand(x )
The function srand sets the starting point, or seed, for generating random numbers to the value x. Each seed value leads to a particular sequence of random numbers. Thus, if you set the seed to the same value a second time, you will get the same sequence of random numbers again. If you omit the argument x, as in srand(), then the current date and time of day are used for a seed. This is the way to get random numbers that are truly unpredictable. The return value of srand is the previous seed. This makes it easy to keep track of the seeds for use in consistently reproducing sequences of random numbers.
91
is 5. By contrast, length(15 * 35) works out to 3. How? Well, 15 * 35 = 525, and 525 is then converted to the string "525", which has three characters. If no argument is supplied, length returns the length of $0. In older versions of awk, you could call the length function without any parentheses. Doing so is marked as deprecated in the posix standard. This means that while you can do this in your programs, it is a feature that can eventually be removed from a future version of the standard. Therefore, for maximal portability of your awk programs you should always supply the parentheses. match(string, regexp) The match function searches the string, string, for the longest, leftmost substring matched by the regular expression, regexp. It returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string ). If no match if found, it returns 0. The match function sets the built-in variable RSTART to the index. It also sets the built-in variable RLENGTH to the length in characters of the matched substring. If no match is found, RSTART is set to 0, and RLENGTH to 1. For example: awk { if ($1 == "FIND") regex = $2 else { where = match($0, regex) if (where) print "Match of", regex, "found at", where, "in", $0 } } This program looks for lines that match the regular expression stored in the variable regex. This regular expression can be changed. If the rst word on a line is FIND, regex is changed to be the second word on that line. Therefore, given: FIND fo*bar My program was a foobar But none of it would doobar FIND Melvin JF+KM This line is property of The Reality Engineering Co. This file created by Melvin. awk prints: Match of fo*bar found at 18 in My program was a foobar Match of Melvin found at 26 in This file created by Melvin. split(string, array, eldsep ) This divides string into pieces separated by eldsep, and stores the pieces in array. The rst piece is stored in array [1], the second piece in array [2], and so forth. The string value of the third argument, eldsep, is a regexp describing where to split string (much as FS can be a regexp describing where to split input records). If the eldsep is omitted, the value of FS is used. split returns the number of elements created. The split function, then, splits strings into pieces in a manner similar to the way input lines are split into elds. For example: split("auto-da-fe", a, "-") splits the string auto-da-fe into three elds using - as the separator. It sets the contents of the array a as follows:
92
a[1] = "auto" a[2] = "da" a[3] = "fe" The value returned by this call to split is 3. As with input eld-splitting, when the value of eldsep is " ", leading and trailing whitespace is ignored, and the elements are separated by runs of whitespace. sprintf(format, expression1,. . .) This returns (without printing) the string that printf would have printed out with the same arguments (see Section 4.5 [Using printf Statements for Fancier Printing], page 38). For example: sprintf("pi = %.2f (approx.)", 22/7) returns the string "pi = 3.14 (approx.)". sub(regexp, replacement, target) The sub function alters the value of target. It searches this value, which should be a string, for the leftmost substring matched by the regular expression, regexp, extending this match as far as possible. Then the entire string is changed by replacing the matched text with replacement. The modied string becomes the new value of target. This function is peculiar because target is not simply used to compute a value, and not just any expression will do: it must be a variable, eld or array reference, so that sub can store a modied value there. If this argument is omitted, then the default is to use and alter $0. For example: str = "water, water, everywhere" sub(/at/, "ith", str) sets str to "wither, water, everywhere", by replacing the leftmost, longest occurrence of at with ith. The sub function returns the number of substitutions made (either one or zero). If the special character & appears in replacement, it stands for the precise substring that was matched by regexp. (If the regexp can match more than one string, then this precise substring may vary.) For example: awk { sub(/candidate/, "& and his wife"); print } changes the rst occurrence of candidate to candidate and his wife on each input line. Here is another example: awk BEGIN { str = "daabaaa" sub(/a*/, "c&c", str) print str } prints dcaacbaaa. This show how & can represent a non-constant string, and also illustrates the leftmost, longest rule. The eect of this special character (&) can be turned o by putting a backslash before it in the string. As usual, to insert one backslash in the string, you must write two backslashes. Therefore, write \\& in a string constant to include a literal & in the replacement. For example, here is how to replace the rst | on each line with an &: awk { sub(/\|/, "\\&"); print } Note: as mentioned above, the third argument to sub must be an lvalue. Some versions of awk allow the third argument to be an expression which is not an lvalue. In such
93
a case, sub would still search for the pattern and return 0 or 1, but the result of the substitution (if any) would be thrown away because there is no place to put it. Such versions of awk accept expressions like this: sub(/USA/, "United States", "the USA and Canada") But that is considered erroneous in gawk. gsub(regexp, replacement, target) This is similar to the sub function, except gsub replaces all of the longest, leftmost, nonoverlapping matching substrings it can nd. The g in gsub stands for global, which means replace everywhere. For example: awk { gsub(/Britain/, "United Kingdom"); print } replaces all occurrences of the string Britain with United Kingdom for all input records. The gsub function returns the number of substitutions made. If the variable to be searched and altered, target, is omitted, then the entire input record, $0, is used. As in sub, the characters & and \ are special, and the third argument must be an lvalue. substr(string, start, length) This returns a length-character-long substring of string, starting at character number start. The rst character of a string is character number one. For example, substr("washington", 5, 3) returns "ing". If length is not present, this function returns the whole sux of string that begins at character number start. For example, substr("washington", 5) returns "ington". This is also the case if length is greater than the number of characters remaining in the string, counting from character number start. tolower(string ) This returns a copy of string, with each upper-case character in the string replaced with its corresponding lower-case character. Nonalphabetic characters are left unchanged. For example, tolower("MiXeD cAsE 123") returns "mixed case 123". toupper(string ) This returns a copy of string, with each lower-case character in the string replaced with its corresponding upper-case character. Nonalphabetic characters are left unchanged. For example, toupper("MiXeD cAsE 123") returns "MIXED CASE 123".
94
END { system("mail -s awk run done operator < /dev/null") } the system operator will be sent mail when the awk program nishes processing input and begins its end-of-input processing. Note that much the same result can be obtained by redirecting print or printf into a pipe. However, if your awk program is interactive, system is useful for cranking up large self-contained programs, such as a shell or an editor. Some operating systems cannot implement the system function. system causes a fatal error if it is not supported.
95
12 User-dened Functions
Complicated awk programs can often be simplied by dening your own functions. User-dened functions can be called just like built-in ones (see Section 8.12 [Function Calls], page 70), but it is up to you to dene themto tell awk what they should do.
96
The arguments and local variables last only as long as the function body is executing. Once the body nishes, the shadowed variables come back. The function body can contain expressions which call functions. They can even call this function, either directly or by way of another function. When this happens, we say the function is recursive. There is no need in awk to put the denition of a function before all uses of the function. This is because awk reads the entire program before starting to execute any of it.
This program prints, in our special format, all the third elds that contain a positive number in our input. Therefore, when given: 1.2 3.4 5.6 7.8 9.10 11.12 -13.14 15.16 17.18 19.20 21.22 23.24 this program, using our function to format the results, prints: 5.6 21.2 Here is a rather contrived example of a recursive function. It prints a string backwards: function rev (str, len) { if (len == 0) { printf "\n" return } printf "%c", substr(str, len, 1) rev(str, len - 1) }
97
98
} BEGIN { a[1] = 1 ; a[2] = 2 ; a[3] = 3 changeit(a, 2, "two") printf "a[1] = %s, a[2] = %s, a[3] = %s\n", a[1], a[2], a[3] } prints a[1] = 1, a[2] = two, a[3] = 3, because calling changeit stores "two" in the second element of a.
99
awk function maxelt (vec, i, ret) { for (i in vec) { if (ret == "" || vec[i] > ret) ret = vec[i] } return ret } # Load all fields of each record into nums. { for(i = 1; i <= NF; i++) nums[NR, i] = $i } END { print maxelt(nums) } Given the following input: 1 5 23 8 16 44 3 5 2 8 26 256 291 1396 2962 100 -6 467 998 1101 99385 11 0 225 our program tells us (predictably) that: 99385 is the largest number in our array.
100
101
13 Built-in Variables
Most awk variables are available for you to use for your own purposes; they never change except when your program assigns values to them, and never aect anything except when your program examines them. A few variables have special built-in meanings. Some of them awk examines automatically, so that they enable you to tell awk how to do certain things. Others are set automatically by awk, so that they carry information from the internal workings of awk to your program. This chapter documents all the built-in variables of awk. Most of them are also documented in the chapters where their areas of activity are described.
FS
OFMT
OFS
ORS
RS
SUBSEP
102
foo[12,3], it really accesses foo["12\0343"] (see Section 10.8 [Multi-dimensional Arrays], page 86).
The command-line arguments available to awk programs are stored in an array called ARGV. ARGC is the number of command-line arguments present. See Chapter 14 [Invoking awk], page 105. ARGV is indexed from zero to ARGC - 1. For example: awk BEGIN { for (i = 0; i < ARGC; i++) print ARGV[i] } inventory-shipped BBS-list In this example, ARGV[0] contains "awk", ARGV[1] contains "inventory-shipped", and ARGV[2] contains "BBS-list". The value of ARGC is 3, one more than the index of the last element in ARGV since the elements are numbered from zero. The names ARGC and ARGV, as well the convention of indexing the array from 0 to ARGC - 1, are derived from the C languages method of accessing command line arguments. Notice that the awk program is not entered in ARGV. The other special command line options, with their arguments, are also not entered. But variable assignments on the command line are treated as arguments, and do show up in the ARGV array. Your program can alter ARGC and the elements of ARGV. Each time awk reaches the end of an input le, it uses the next element of ARGV as the name of the next input le. By storing a dierent string there, your program can change which les are read. You can use "-" to represent the standard input. By storing additional elements and incrementing ARGC you can cause additional les to be read. If you decrease the value of ARGC, that eliminates input les from the end of the list. By recording the old value of ARGC elsewhere, your program can treat the eliminated arguments as something other than le names. To eliminate a le from the middle of the list, store the null string ("") into ARGV in place of the les name. As a special feature, awk ignores le names that have been replaced with the null string.
ENVIRON
This is an array that contains the values of the environment. The array indices are the environment variable names; the values are the values of the particular environment variables. For example, ENVIRON["HOME"] might be /u/close. Changing this array does not aect the environment passed on to any programs that awk may spawn via redirection or the system function. Some operating systems may not have environment variables. On such systems, the array ENVIRON is empty. This is the name of the le that awk is currently reading. If awk is reading from the standard input (in other words, there are no les listed on the command line), FILENAME is set to "-". FILENAME is changed each time a new le is read (see Chapter 3 [Reading Input Files], page 21).
FILENAME
103
FNR
FNR is the current record number in the current le. FNR is incremented each time a new record is read (see Section 3.7 [Explicit Input with getline], page 30). It is reinitialized to 0 each time a new input le is started. NF is the number of elds in the current input record. NF is set each time a new record is read, when a new eld is created, or when $0 changes (see Section 3.2 [Examining Fields], page 22). This is the number of input records awk has processed since the beginning of the programs execution. (see Section 3.1 [How Input is Split into Records], page 21). NR is set each time a new record is read. RLENGTH is the length of the substring matched by the match function (see Section 11.3 [Built-in Functions for String Manipulation], page 90). RLENGTH is set by invoking the match function. Its value is the length of the matched string, or 1 if no match was found. RSTART is the start-index in characters of the substring matched by the match function (see Section 11.3 [Built-in Functions for String Manipulation], page 90). RSTART is set by invoking the match function. Its value is the position of the string where the matched substring starts, or 0 if no match was found.
NF
NR
RLENGTH
RSTART
104
105
14 Invoking awk
There are two ways to run awk: with an explicit program, or with one or more program les. Here are templates for both of them; items enclosed in [. . .] in these templates are optional.
-f source-le Indicates that the awk program is to be found in source-le instead of in the rst non-option argument. -v var =val Sets the variable var to the value val before execution of the program begins. Such variable values are available inside the BEGIN rule (see below for a fuller explanation). The -v option can only set one variable, but you can use it more than once, setting another variable each time, like this: -v foo=1 -v bar=2. Any other options are agged as invalid with a warning message, but are otherwise ignored. If the -f option is not used, then the rst non-option command line argument is expected to be the program text. The -f option may be used more than once on the command line. If it is, awk reads its program source from all of the named les, as if they had been concatenated together into one big le. This is useful for creating libraries of awk functions. Useful functions can be written once, and then retrieved from a standard place, instead of having to be included into each individual program. You can still type in a program at the terminal and use library functions, by specifying -f /dev/tty. awk will read a le from the terminal to use as part of the awk program. After typing your program, type Control-d (the end-of-le character) to terminate it. (You may also use -f - to read program source from the standard input, but then you will not be able to also use the standard input as a source of data.)
106
The distinction between le name arguments and variable-assignment arguments is made when awk is about to open the next input le. At that point in execution, it checks the le name to see whether it is really a variable assignment; if so, awk sets the variable instead of reading a le. Therefore, the variables actually receive the specied values after all previously specied les have been read. In particular, the values of variables assigned in this fashion are not available inside a BEGIN rule (see Section 6.7 [BEGIN and END Special Patterns], page 53), since such rules are run before awk begins scanning the argument list. The values given on the command line are processed for escape sequences (see Section 8.1 [Constant Expressions], page 57). In some earlier implementations of awk, when a variable assignment occurred before any le names, the assignment would happen before the BEGIN rule was executed. Some applications came to depend upon this feature. When awk was changed to be more consistent, the -v option was added to accommodate applications that depended upon this old behavior. The variable assignment feature is most useful for assigning to variables such as RS, OFS, and ORS, which control input and output formats, before scanning the data les. It is also useful for controlling state if multiple passes are needed over a data le. For example: awk pass == 1 pass == 2 { pass 1 stu } { pass 2 stu } pass=1 datafile pass=2 datafile
Given the variable assignment feature, the -F option is not strictly necessary. It remains for historical compatibility.
107
-f program-le Read the awk program source from the le program-le, instead of from the rst command line argument. -v var =val Assign the variable var the value val before program execution begins. -Signal the end of options. This is useful to allow further arguments to the awk program itself to start with a -. This is mainly for consistency with the argument parsing conventions of posix.
Any other options are agged as invalid, but are otherwise ignored. See Chapter 14 [Invoking awk], page 105, for more details.
awk rst reads the program source from the program-le (s) if specied, or from the rst nonoption argument on the command line. The -f option may be used multiple times on the command line. awk reads the program text from all the program-le les, eectively concatenating them in the order they are specied. This is useful for building libraries of awk functions, without having to include them in each new awk program that uses them. To use a library function in a le from a program typed in on the command line, specify -f /dev/tty; then type your program, and end it with a Control-d. See Chapter 14 [Invoking awk], page 105.
108
awk compiles the program into an internal form, and then proceeds to read each le named in the ARGV array. If there are no les named on the command line, awk reads the standard input. If a le named on the command line has the form var =val , it is treated as a variable assignment: the variable var is assigned the value val. If any of the les have a value that is the null string, that element in the list is skipped. For each line in the input, awk tests to see if it matches any pattern in the awk program. For each pattern that the line matches, the associated action is executed.
A.3.1 Fields
As each input line is read, awk splits the line into elds, using the value of the FS variable as the eld separator. If FS is a single character, elds are separated by that character. Otherwise, FS is expected to be a full regular expression. In the special case that FS is a single blank, elds are separated by runs of blanks and/or tabs. Each eld in the input line may be referenced by its position, $1, $2, and so on. $0 is the whole line. The value of a eld may be assigned to as well. Field numbers need not be constants: n = 5 print $n prints the fth eld in the input line. The variable NF is set to the total number of elds in the input line. References to nonexistent elds (i.e., elds after $NF) return the null-string. However, assigning to a nonexistent eld (e.g., $(NF+2) = 5) increases the value of NF, creates any intervening elds with the null string as their value, and causes the value of $0 to be recomputed, with the elds being separated by the value of OFS. See Chapter 3 [Reading Input Files], page 21, for a full description of the way awk denes and uses elds.
109
The array of command line arguments. The array is indexed from 0 to ARGC 1. Dynamically changing the contents of ARGV can control the les used for data. The conversion format to use when converting numbers to strings. An array containing the values of the environment variables. The array is indexed by variable name, each element being the value of that variable. Thus, the environment variable HOME would be in ENVIRON["HOME"]. Its value might be /u/close. Changing this array does not aect the environment seen by programs which awk spawns via redirection or the system function. Some operating systems do not have environment variables. The array ENVIRON is empty when running on these systems. The name of the current input le. If no les are specied on the command line, the value of FILENAME is -. The input record number in the current input le. The input eld separator, a blank by default. The number of elds in the current input record. The total number of input records seen so far. The output format for numbers for the print statement, "%.6g" by default. The output eld separator, a blank by default. The output record separator, by default a newline. The input record separator, by default a newline. RS is exceptional in that only the rst character of its string value is used for separating records. If RS is set to the null string, then records are separated by blank lines. When RS is set to the null string, then the newline character always acts as a eld separator, in addition to whatever value FS may have. The index of the rst character matched by match; 0 if no match. The length of the string matched by match; 1 if no match. The string used to separate multiple subscripts in array elements, by default "\034".
A.3.3 Arrays
Arrays are subscripted with an expression between square brackets ([ and ]). Array subscripts are always strings; numbers are converted to strings as necessary, following the standard conversion rules (see Section 8.9 [Conversion of Strings and Numbers], page 67). If you use multiple expressions separated by commas inside the square brackets, then the array subscript is a string consisting of the concatenation of the individual subscript values, converted to strings, separated by the subscript separator (the value of SUBSEP). The special operator in may be used in an if or while statement to see if an array has an index consisting of a particular value.
110
if (val in array) print array[val] If the array has multiple subscripts, use (i, j, . . .) in array to test for existence of an element. The in construct may also be used in a for loop to iterate over all the elements of an array. See Section 10.5 [Scanning all Elements of an Array], page 84. An element may be deleted from an array using the delete statement. See Chapter 10 [Arrays in awk], page 81, for more detailed information.
111
{ print } which prints the entire line. Comments begin with the # character, and continue until the end of the line. Blank lines may be used to separate statements. Normally, a statement ends with a newline, however, this is not the case for lines ending in a ,, {, ?, :, &&, or ||. Lines ending in do or else also have their statements automatically continued on the following line. In other cases, a line can be continued by ending it with a \, in which case the newline is ignored. Multiple statements may be put on one line by separating them with a ;. This applies to both the statements within the action part of a rule (the usual case), and to the rule statements. See Section 2.5 [Comments in awk Programs], page 18, for information on awks commenting convention; see Section 2.6 [awk Statements versus Lines], page 19, for a description of the line continuation mechanism in awk.
A.4.1 Patterns
awk patterns may be one of the following: /regular expression/ relational expression pattern && pattern pattern || pattern pattern ? pattern : pattern (pattern) ! pattern pattern1, pattern2 BEGIN END BEGIN and END are two special kinds of patterns that are not tested against the input. The action parts of all BEGIN rules are merged as if all the statements had been written in a single BEGIN rule. They are executed before any of the input is read. Similarly, all the END rules are merged, and executed when all the input is exhausted (or when an exit statement is executed). BEGIN and END patterns cannot be combined with other patterns in pattern expressions. BEGIN and END rules cannot have missing action parts. For /regular-expression/ patterns, the associated statement is executed for each input line that matches the regular expression. Regular expressions are extensions of those in egrep, and are summarized below. A relational expression may use any of the operators dened below in the section on actions. These generally test whether certain elds match certain regular expressions. The &&, ||, and ! operators are logical and, logical or, and logical not, respectively, as in C. They do short-circuit evaluation, also as in C, and are used for combining more primitive pattern expressions. As in most languages, parentheses may be used to change the order of evaluation.
112
The ?: operator is like the same operator in C. If the rst pattern matches, then the second pattern is matched against the input record; otherwise, the third is matched. Only one of the second and third patterns is matched. The pattern1, pattern2 form of a pattern is called a range pattern. It matches all input lines starting with a line that matches pattern1, and continuing until a line that matches pattern2, inclusive. A range pattern cannot be used as an operand to any of the pattern operators. See Chapter 6 [Patterns], page 47, for a full description of the pattern part of awk rules.
[^abc . . .] matches any character except abc . . . and newline (negated character class). r1 |r2 r1r2 r+ r* r? (r ) matches either r1 or r2 (alternation). matches r1, and then r2 (concatenation). matches one or more r s. matches zero or more r s. matches zero or one r s. matches r (grouping).
See Section 6.2 [Regular Expressions as Patterns], page 47, for a more detailed explanation of regular expressions. The escape sequences allowed in string constants are also valid in regular expressions (see Section 8.1 [Constant Expressions], page 57).
A.4.3 Actions
Action statements are enclosed in braces, { and }. Action statements consist of the usual assignment, conditional, and looping statements found in most languages. The operators, control statements, and input/output statements available are patterned after those in C.
113
A.4.3.1 Operators
The operators in awk, in order of increasing precedence, are: = += -= *= /= %= ^= Assignment. Both absolute assignment (var =value ) and operator assignment (the other forms) are supported. ?: A conditional expression, as in C. This has the form expr1 ? expr2 : expr3 . If expr1 is true, the value of the expression is expr2 ; otherwise it is expr3. Only one of expr2 and expr3 is evaluated. Logical or. Logical and. Regular expression match, negated match.
|| && ~ !~
< <= > >= != == The usual relational operators. blank +*/% +-! ^ ++ -$ String concatenation. Addition and subtraction. Multiplication, division, and modulus. Unary plus, unary minus, and logical negation. Exponentiation (** may also be used, and **= for the assignment operator, but they are not specied in the posix standard). Increment and decrement, both prex and postx. Field reference.
See Chapter 8 [Expressions as Action Statements], page 57, for a full description of all the operators listed above. See Section 3.2 [Examining Fields], page 22, for a description of the eld reference operator.
114
getline <le Set $0 from next record of le ; set NF. getline var Set var from next input record; set NF, FNR. getline var <le Set var from next record of le. next Stop processing the current input record. The next input record is read and processing starts over with the rst pattern in the awk program. If the end of the input data is reached, the END rule(s), if any, are executed. Prints the current record.
print expr-list Prints expressions. print expr-list > le Prints expressions on le. printf fmt, expr-list Format and print. printf fmt, expr-list > file Format and print on le. Other input/output redirections are also allowed. For print and printf, >> le appends output to the le, and | command writes on a pipe. In a similar fashion, command | getline pipes input into getline. getline returns 0 on end of le, and 1 on an error. See Section 3.7 [Explicit Input with getline], page 30, for a full description of the getline statement. See Chapter 4 [Printing Output], page 35, for a full description of print and printf. Finally, see Section 9.7 [The next Statement], page 78, for a description of how the next statement works.
%d %i %e %f
115
%g %o %s %x %X %%
Use %e or %f conversion, whichever produces a shorter string, with nonsignicant zeros suppressed. An unsigned octal number (again, an integer). A character string. An unsigned hexadecimal number (an integer). Like %x, except use A through F instead of a through f for decimal 10 through 15. A single % character; no argument is converted. There are optional, additional parameters that may lie between the % and the control letter:
width .prec
The expression should be left-justied within its eld. The eld should be padded to this width. If width has a leading zero, then the eld is padded with zeros. Otherwise it is padded with blanks. A number indicating the maximum width of strings or digits to the right of the decimal point.
Either or both of the width and prec values may be specied as *. In that case, the particular value is taken from the argument list. See Section 4.5 [Using printf Statements for Fancier Printing], page 38, for examples and for a more detailed description.
sin(expr ) returns the sine in radians. sqrt(expr ) the square root function. srand(expr ) use expr as a new seed for the random number generator. If no expr is provided, the time of day is used. The return value is the previous seed for the random number generator.
116
117
\r \t \v
\xhex digits The character represented by the string of hexadecimal digits following the \x. As in ansi C, all following hexadecimal digits are considered part of the escape sequence. (This feature should tell us something about language design by committee.) E.g., "\x1B" is a string containing the ASCII ESC (escape) character. (The \x escape sequence is not in posix awk.) \ddd \c The character represented by the 1-, 2-, or 3-digit sequence of octal digits. Thus, "\033" is also a string containing the ASCII ESC (escape) character. The literal character c.
The escape sequences may also be used inside constant regular expressions (e.g., the regexp /[ \t\f\n\r\v]/ matches whitespace characters). See Section 8.1 [Constant Expressions], page 57.
A.5 Functions
Functions in awk are dened as follows: function name (parameter list) { statements } Actual parameters supplied in the function call are used to instantiate the formal parameters declared in the function. Arrays are passed by reference, other variables are passed by value. If there are fewer arguments passed than there are names in parameter-list, the extra names are given the null string as value. Extra names have the eect of local variables. The open-parenthesis in a function call of a user-dened function must immediately follow the function name, without any intervening white space. This is to avoid a syntactic ambiguity with the concatenation operator. The word func may be used in place of function (but not in posix awk). Use the return statement to return a value from a function. See Chapter 12 [User-dened Functions], page 95, for a more complete description.
118
119
120
characters. Finally, we use the system sort utility to process the output of the awk script. First, here is the new version of the program: awk # Print list of word frequencies { $0 = tolower($0) # remove case distinctions gsub(/[^a-z0-9_ \t]/, "", $0) # remove punctuation for (i = 1; i <= NF; i++) freq[$i]++ } END { for (word in freq) printf "%s\t%d\n", word, freq[word] } Assuming we have saved this program in a le named frequency.awk, and that the data is in file1, the following pipeline awk -f frequency.awk file1 | sort +1 -nr produces a table of the words appearing in file1 in order of decreasing frequency. The awk program suitably massages the data and produces a word frequency table, which is not ordered. The awk scripts output is then sorted by the sort command and printed on the terminal. The options given to sort in this example specify to sort using the second eld of each input line (skipping one eld), that the sort keys should be treated as numeric quantities (otherwise 15 would come before 5), and that the sorting should be done in descending (reverse) order. We could have even done the sort from within the program, by changing the END action to: END { sort = "sort +1 -nr" for (word in freq) printf "%s\t%d\n", word, freq[word] | sort close(sort) } See the general operating system documentation for more information on how to use the sort command.
Appendix C: Glossary
121
Appendix C Glossary
Action A series of awk statements attached to a rule. If the rules pattern matches an input record, the awk language executes the rules action. Actions are always enclosed in curly braces. See Chapter 7 [Overview of Actions], page 55.
Amazing awk Assembler Henry Spencer at the University of Toronto wrote a retargetable assembler completely as awk scripts. It is thousands of lines long, including machine descriptions for several 8-bit microcomputers. It is a good example of a program that would have been better written in another language. ansi Assignment An awk expression that changes the value of some awk variable or data object. An object that you can assign to is called an lvalue. See Section 8.7 [Assignment Expressions], page 64. awk Language The language in which awk programs are written. awk Program An awk program consists of a series of patterns and actions, collectively known as rules. For each input record given to the program, the programs rules are all processed in turn. awk programs may also contain function denitions. awk Script Another name for an awk program. Built-in Function The awk language provides built-in functions that perform various numerical, time stamp related, and string computations. Examples are sqrt (for the square root of a number) and substr (for a substring of a string). See Chapter 11 [Built-in Functions], page 89. Built-in Variable ARGC, ARGV, CONVFMT, ENVIRON, FILENAME, FNR, FS, NF, NR, OFMT, OFS, ORS, RLENGTH, RSTART, RS, and SUBSEP, are the variables that have special meaning to awk. Changing some of them aects awks running environment. See Chapter 13 [Built-in Variables], page 101. Braces C See Curly Braces. The system programming language that most GNU software is written in. The awk programming language has C-like syntax, and this manual points out similarities between awk and C when appropriate. A preprocessor for pic that reads descriptions of molecules and produces pic input for drawing them. It was written by Brian Kernighan, and is available from netlib@research.att.com. The American National Standards Institute. This organization produces many standards, among them the standard for the C programming language.
CHEM
Compound Statement A series of awk statements, enclosed in curly braces. Compound statements may be nested. See Chapter 9 [Control Statements in Actions], page 73. Concatenation Concatenating two strings means sticking them together, one after another, giving a new string. For example, the string foo concatenated with the string bar gives the string foobar. See Section 8.4 [String Concatenation], page 61.
122
Conditional Expression An expression using the ?: ternary operator, such as expr1 ? expr2 : expr3 . The expression expr1 is evaluated; if the result is true, the value of the whole expression is the value of expr2 otherwise the value is expr3. In either case, only one of expr2 and expr3 is evaluated. See Section 8.11 [Conditional Expressions], page 69. Constant Regular Expression A constant regular expression is a regular expression written within slashes, such as /foo/. This regular expression is chosen when you write the awk program, and cannot be changed doing its execution. See Section 6.2.1 [How to Use Regular Expressions], page 47. Comparison Expression A relation that is either true or false, such as (a < b). Comparison expressions are used in if, while, and for statements, and in patterns to select which input records to process. See Section 8.5 [Comparison Expressions], page 62. Curly Braces The characters { and }. Curly braces are used in awk for delimiting actions, compound statements, and function bodies. Data Objects These are numbers and strings of characters. Numbers are converted into strings and vice versa, as needed. See Section 8.9 [Conversion of Strings and Numbers], page 67. Dynamic Regular Expression A dynamic regular expression is a regular expression written as an ordinary expression. It could be a string constant, such as "foo", but it may also be an expression whose value may vary. See Section 6.2.1 [How to Use Regular Expressions], page 47. Escape Sequences A special sequence of characters used for describing nonprinting characters, such as \n for newline, or \033 for the ASCII ESC (escape) character. See Section 8.1 [Constant Expressions], page 57. Field When awk reads an input record, it splits the record into pieces separated by whitespace (or by a separator regexp which you can change by setting the built-in variable FS). Such pieces are called elds. See Section 3.1 [How Input is Split into Records], page 21. Format strings are used to control the appearance of output in the printf statement. Also, data conversions from numbers to strings are controlled by the format string contained in the built-in variable CONVFMT. See Section 4.5.2 [Format-Control Letters], page 38. A specialized group of statements often used to encapsulate general or program-specic tasks. awk has a number of built-in functions, and also allows you to dene your own. See Chapter 11 [Built-in Functions], page 89. Also, see Chapter 12 [User-dened Functions], page 95. The GNU implementation of awk. GNUs not Unix. An on-going project of the Free Software Foundation to create a complete, freely distributable, posix-compliant computing environment.
Format
Function
gawk GNU
Input Record A single chunk of data read in by awk. Usually, an awk input record consists of one line of text. See Section 3.1 [How Input is Split into Records], page 21. Keyword In the awk language, a keyword is a word that has special meaning. Keywords are reserved and may not be used as variable names. awks keywords are: if, else, while, do. . .while, for, for. . .in, break, continue, delete, next, function, func, and exit.
Appendix C: Glossary
123
Lvalue
An expression that can appear on the left side of an assignment operator. In most languages, lvalues can be variables or array elements. In awk, a eld designator can also be used as an lvalue. A numeric valued data object. The awk implementation uses double precision oating point to represent numbers. Patterns tell awk which input records are interesting to which rules. A pattern is an arbitrary conditional expression against which input is tested. If the condition is satised, the pattern is said to match the input record. A typical pattern might compare the input record against a regular expression. See Chapter 6 [Patterns], page 47. The name for a series of standards being developed by the ieee that specify a Portable Operating System interface. The IX denotes the Unix heritage of these standards. The main standard of interest for awk users is P1003.2, the Command Language and Utilities standard.
Number Pattern
posix
Range (of input lines) A sequence of consecutive lines from the input le. A pattern can specify ranges of input lines for awk to process, or it can specify single lines. See Chapter 6 [Patterns], page 47. Recursion When a function calls itself, either directly or indirectly. If this isnt clear, refer to the entry for recursion. Redirection Redirection means performing input from other than the standard input stream, or output to other than the standard output stream. You can redirect the output of the print and printf statements to a le or a system command, using the >, >>, and | operators. You can redirect input to the getline statement using the < and | operators. See Section 4.6 [Redirecting Output of print and printf], page 42. Regular Expression See regexp. Regexp Short for regular expression. A regexp is a pattern that denotes a set of strings, possibly an innite set. For example, the regexp R.*xp matches any string starting with the letter R and ending with the letters xp. In awk, regexps are used in patterns and in conditional expressions. Regexps may contain escape sequences. See Section 6.2 [Regular Expressions as Patterns], page 47. A segment of an awk program, that species how to process single input records. A rule consists of a pattern and an action. awk reads an input record; then, for each rule, if the input record satises the rules pattern, awk executes the rules action. Otherwise, the rule does nothing for that input record. A side eect occurs when an expression has an eect aside from merely producing a value. Assignment expressions, increment expressions and function calls have side eects. See Section 8.7 [Assignment Expressions], page 64. Special File A le name interpreted internally by awk, instead of being handed directly to the underlying operating system. For example, /dev/stdin. See Section 4.7 [Standard I/O Streams], page 44. Stream Editor A program that reads records from an input stream and processes them one or more at a time. This is in contrast with batch programs, which may expect to read their input
Rule
Side Eect
124
les in entirety before starting to do anything, and with interactive programs, which require input from the user. String A datum consisting of a sequence of characters, such as I am a string. Constant strings are written with double-quotes in the awk language, and may contain escape sequences. See Section 8.1 [Constant Expressions], page 57. A sequence of blank or tab characters occurring inside an input record or a string.
Whitespace
Index
125
Index
#
# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 #! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 boolean expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 boolean operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 boolean patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 break statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 buering output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 buers, ushing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 built-in functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 built-in variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 built-in variables, user modiable . . . . . . . . . . . . . . . . 101
$
$ (eld operator) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
C
call by reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 call by value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 calling a function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 case sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 changing contents of a eld . . . . . . . . . . . . . . . . . . . . . . . . 24 changing the record separator . . . . . . . . . . . . . . . . . . . . . 21 close . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33, 43 closing input les and pipes . . . . . . . . . . . . . . . . . . . . . . . 33 closing output les and pipes . . . . . . . . . . . . . . . . . . . . . . 43 command line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 command line formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 command line, setting FS on . . . . . . . . . . . . . . . . . . . . . . . 26 comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 comparison expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 comparison expressions as patterns . . . . . . . . . . . . . . . . 51 computed regular expressions . . . . . . . . . . . . . . . . . . . . . . 48 concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 conditional expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 constants, types of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 continuation of lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 continue statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 control statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 conversion of strings and numbers . . . . . . . . . . . . . 67, 68 conversions, during subscripting . . . . . . . . . . . . . . . . . . . 86 CONVFMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62, 67, 86, 101 curly braces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
A
accessing elds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 acronym . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 action, curly braces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 action, default . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 action, denition of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 action, separating statements . . . . . . . . . . . . . . . . . . . . . . 55 addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 and operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 applications of awk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 ARGC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 arguments in function call . . . . . . . . . . . . . . . . . . . . . . . . . 70 arguments, command line . . . . . . . . . . . . . . . . . . . . . . . . 105 ARGV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102, 105 arithmetic operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 array assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 array reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 arrays, denition of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 arrays, deleting an element . . . . . . . . . . . . . . . . . . . . . . . . 85 arrays, multi-dimensional subscripts . . . . . . . . . . . . . . . 86 arrays, presence of elements . . . . . . . . . . . . . . . . . . . . . . . 82 arrays, special for statement . . . . . . . . . . . . . . . . . . . . . . 84 assignment operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 assignment to elds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 associative arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 awk language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 awk program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
D
default action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . default pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . dening functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . delete statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . deleting elements of arrays . . . . . . . . . . . . . . . . . . . . . . . . dierences between gawk and awk . . . . . . . . . . . . . 57, dierences: gawk and awk . . . . . . . . . . . . . . . . . . . . . . . . . . division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . documenting awk programs . . . . . . . . . . . . . . . . . . . . . . . . dynamic regular expressions . . . . . . . . . . . . . . . . . . . . . . . 13 13 95 85 85 61 44 60 18 48
B
backslash continuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . basic function of awk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . BBS-list le . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . BEGIN special pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . body of a loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 13 11 53 74
126
E
element assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 element of array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 empty pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 END special pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 ENVIRON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 escape sequence notation . . . . . . . . . . . . . . . . . . . . . . . . . . 57 examining elds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 executable scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 exit statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 explicit input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 exponentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 expression, conditional . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 expressions, assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 expressions, boolean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 expressions, comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
input redirection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 input, explicit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 input, getline command . . . . . . . . . . . . . . . . . . . . . . . . . . 30 input, multiple line records . . . . . . . . . . . . . . . . . . . . . . . . 29 input, standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 interaction, awk and other programs . . . . . . . . . . . . . . . 93 inventory-shipped le . . . . . . . . . . . . . . . . . . . . . . . . . . 12 invocation of awk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
L
language, awk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 logical operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 long options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 loops, exiting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 lvalue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
F
eld separator, choice of . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 eld separator, FS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 eld separator: on command line . . . . . . . . . . . . . . . . . . 26 eld, changing contents of . . . . . . . . . . . . . . . . . . . . . . . . . 24 elds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 elds, separating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 le descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 le, awk program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 FILENAME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21, 102 ushing buers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 FNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22, 103 for (x in . . .) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 for statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 format specier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 format string . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 formatted output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 FS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25, 101 function call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 function denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 functions, user-dened . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
M
manual, using this . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90, 91 metacharacters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 modiers (in format speciers) . . . . . . . . . . . . . . . . . . . . 39 multi-dimensional subscripts . . . . . . . . . . . . . . . . . . . . . . 86 multiple line records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 multiple passes over data . . . . . . . . . . . . . . . . . . . . . . . . . 106 multiple statements on one line . . . . . . . . . . . . . . . . . . . . 20 multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
N
next statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 NF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22, 103 not operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 NR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22, 103 number of elds, NF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 number of records, NR or FNR . . . . . . . . . . . . . . . . . . . . . . 22 numbers, used as subscripts . . . . . . . . . . . . . . . . . . . . . . . 86 numeric constant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 numeric value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
G
getline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 gsub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
O
OFMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37, 68, 101 OFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37, 101 one-liners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 operator precedence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 operators, $ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 operators, arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 operators, assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 operators, boolean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 operators, increment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 operators, regexp matching . . . . . . . . . . . . . . . . . . . . . . . . 47 operators, relational . . . . . . . . . . . . . . . . . . . . . . . . . . 51, 62 operators, string . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
H
history of awk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 how awk works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
I
if statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . increment operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . input le, sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 66 21 11
Index
127
operators, string-matching . . . . . . . . . . . . . . . . . . . . . . . . 47 options, command line . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 options, long . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 or operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 ORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37, 101 output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 output eld separator, OFS . . . . . . . . . . . . . . . . . . . . . . . . 37 output record separator, ORS . . . . . . . . . . . . . . . . . . . . . . 37 output redirection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 output, buering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 output, formatted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 output, piping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
P
passes, multiple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 pattern, case sensitive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 pattern, comparison expressions . . . . . . . . . . . . . . . . . . . 51 pattern, default . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 pattern, denition of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 pattern, empty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 pattern, regular expressions . . . . . . . . . . . . . . . . . . . . . . . 47 patterns, BEGIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 patterns, boolean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 patterns, END . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 patterns, range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 patterns, types of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 pipes for output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 precedence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 print $0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 print statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 printf statement, syntax of . . . . . . . . . . . . . . . . . . . . . . . 38 printf, format-control characters . . . . . . . . . . . . . . . . . 38 printf, modiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 printing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 program le . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 program, awk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 program, denition of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 program, self contained . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 programs, documenting . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
regexp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 regexp as expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 regexp operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 regexp search operators . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 regular expression matching operators . . . . . . . . . . . . . 47 regular expression metacharacters . . . . . . . . . . . . . . . . . 48 regular expressions as eld separators . . . . . . . . . . . . . 26 regular expressions as patterns . . . . . . . . . . . . . . . . . . . . 47 regular expressions, computed . . . . . . . . . . . . . . . . . . . . . 48 relational operators . . . . . . . . . . . . . . . . . . . . . . . . . . . 51, 62 remainder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 removing elements of arrays . . . . . . . . . . . . . . . . . . . . . . . 85 return statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 RLENGTH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91, 103 RS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21, 101 RSTART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91, 103 rule, denition of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 running awk programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 running long programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
S
sample input le . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 scanning an array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 script, denition of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 scripts, executable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 scripts, shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 self contained programs . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 shell scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 side eect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 single quotes, why needed . . . . . . . . . . . . . . . . . . . . . . . . . 16 split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 sprintf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 standard error output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 standard input . . . . . . . . . . . . . . . . . . . . . . . . . . . 16, 21, 44 standard output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 string constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 string operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 string-matching operators . . . . . . . . . . . . . . . . . . . . . . . . . 47 sub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 subscripts in arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 SUBSEP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86, 101 substr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Q
quotient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
R
range pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . reading les . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . reading les, getline command . . . . . . . . . . . . . . . . . . . reading les, multiple line records . . . . . . . . . . . . . . . . . record separator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . records, multiple line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . redirection of input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . redirection of output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . reference to array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 21 30 29 21 29 31 42 82
T
tolower . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 toupper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
U
use of comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 user-dened functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 user-dened variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
128
W
what is awk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 when to use awk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 while statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
V
variables, user-dened . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Short Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 GNU GENERAL PUBLIC LICENSE . . . . . . . . . . . . . . . . . . . . . . . . . 3 1 Using this Manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 Getting Started with awk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3 Reading Input Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4 Printing Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5 Useful One-liners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6 Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 7 Overview of Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 8 Expressions as Action Statements . . . . . . . . . . . . . . . . . . . . . . . 57 9 Control Statements in Actions . . . . . . . . . . . . . . . . . . . . . . . . . . 73 10 Arrays in awk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 11 Built-in Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 12 User-dened Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 13 Built-in Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 14 Invoking awk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Appendix A awk Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Appendix B Sample Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Appendix C Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
ii
iii
Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
History of awk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 How to Apply These Terms to Your New Programs . . . . . . . . . . . . . . . . . . . 8
1 2
11
Printing Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1 4.2 4.3 4.4 4.5 The print Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Examples of print Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Output Separators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Controlling Numeric Output with print . . . . . . . . . . . . . . . . . . . . . . . Using printf Statements for Fancier Printing . . . . . . . . . . . . . . . . . . 4.5.1 Introduction to the printf Statement . . . . . . . . . . . . . . . . . 4.5.2 Format-Control Letters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 Modiers for printf Formats . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.4 Examples of Using printf . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Redirecting Output of print and printf . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Redirecting Output to Files and Pipes . . . . . . . . . . . . . . . . . 4.6.2 Closing Output Files and Pipes . . . . . . . . . . . . . . . . . . . . . . . 4.7 Standard I/O Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 35 37 37 38 38 38 39 40 42 42 43 44
iv
5 6
47
47 47 47 48 50 51 51 52 53 53 54
7 8
Overview of Actions . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1 Constant Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Assigning Variables on the Command Line . . . . . . . . . . . . . 8.3 Arithmetic Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 String Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Comparison Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Boolean Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7 Assignment Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8 Increment Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.9 Conversion of Strings and Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.10 Numeric and String Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.11 Conditional Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.12 Function Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.13 Operator Precedence (How Operators Nest) . . . . . . . . . . . . . . . . . . .
55
57 59 60 60 61 62 64 64 66 67 68 69 70 71
10
Arrays in awk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 10.9 Introduction to Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Referring to an Array Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Assigning Array Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Basic Example of an Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Scanning all Elements of an Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . The delete Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Using Numbers to Subscript Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . Multi-dimensional Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Scanning Multi-dimensional Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 82 83 83 84 85 86 86 88
11
Built-in Functions . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.1 11.2 11.3 11.4 Calling Built-in Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Numeric Built-in Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Built-in Functions for String Manipulation . . . . . . . . . . . . . . . . . . . . Built-in Functions for Input/Output . . . . . . . . . . . . . . . . . . . . . . . . . .
89
89 89 90 93
12
User-dened Functions . . . . . . . . . . . . . . . . . . . . . . . 95
12.1 12.2 12.3 12.4 Syntax of Function Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Function Denition Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Calling User-dened Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The return Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 96 97 98
13
14
Appendix A
awk Summary . . . . . . . . . . . . . . . . . . . . .
107
107 107 108 108 108 109 110 110 111 112 112 113 113 114 114 115 116 116 117
A.1 Command Line Options Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Language Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 Variables and Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3.1 Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3.2 Built-in Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3.3 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3.4 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4 Patterns and Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4.1 Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4.2 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4.3 Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4.3.1 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4.3.2 Control Statements . . . . . . . . . . . . . . . . . . . . . . . . A.4.3.3 I/O Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4.3.4 printf Summary . . . . . . . . . . . . . . . . . . . . . . . . . . A.4.3.5 Numeric Functions . . . . . . . . . . . . . . . . . . . . . . . . . A.4.3.6 String Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4.3.7 String Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Appendix B Appendix C
Sample Program . . . . . . . . . . . . . . . . . .
119
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi