0% found this document useful (0 votes)
270 views

A Practical Guide To Learning GNU Awk

This document provides an introduction to the GNU Awk (gawk) programming language. It discusses what awk is, its origins and capabilities for processing text files. Awk can search files for patterns and perform actions on matching lines. While often used for one-line commands, it is also a full programming language. The document recommends resources for learning awk and provides an overview of the chapters which teach awk syntax and programming techniques through tutorials and examples.

Uploaded by

Angulo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
270 views

A Practical Guide To Learning GNU Awk

This document provides an introduction to the GNU Awk (gawk) programming language. It discusses what awk is, its origins and capabilities for processing text files. Awk can search files for patterns and perform actions on matching lines. While often used for one-line commands, it is also a full programming language. The document recommends resources for learning awk and provides an overview of the chapters which teach awk syntax and programming techniques through tutorials and examples.

Uploaded by

Angulo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Opensource.

com

A practical guide to
learning GNU Awk
OPENSOURCE.COM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ABOUT OPENSOURCE.COM

What is Opensource.com?

OPENSOURCE.COM publishes stories about creating,


adopting, and sharing open source
solutions. Visit Opensource.com to learn more about how the open source
way is improving technologies, education, business, government, health, law,
entertainment, humanitarian efforts, and more.

Submit a story idea: opensource.com/story

Email us: [email protected]

2 A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ABOUT THE AUTHORS

AUTHORS SETH KENLON, DAVE MORRISS, AND ROBERT YOUNG

SETH KENLON is an independent multimedia artist, free culture


advocate, and UNIX geek. He has worked in the
film and computing industry, often at the same time. He is one of the maintainers
of the Slackware-based multimedia production project, http://slackermedia.info.

DAVE MORRISS is a retired IT Manager now contributing to


the “Hacker Public Radio” community podcast
(http://hackerpublicradio.org) as a podcast host and an administrator.

ROBERT YOUNG is the Owner and Principal Consultant at Lab


Insights, LLC. He has led dozens of laboratory
informatics and data manage projects over the last 10 years. Robert Holds a degree
in Cell Biology/Biochemistry and a masters in Bioinformatics.

CONTRIBUTORS

Jim Hall
Lazarus Lazaridis
Dave Neary
Moshe Zadka

A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM 3


CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTERS

LEARN

What is awk? 5

Getting started with awk, a powerful text-parsing tool 6

Fields, records, and variables in awk 8

A guide to intermediate awk scripting 11

How to use loops in awk 13

How to use regular expressions in awk 15

4 ways to control the flow of your awk script 18

PRACTICE

Advance your awk skills with two easy tutorials 21

How to remove duplicate lines from files with awk 24

Awk one-liners and scripts to help you sort text files 26

A gawk script to convert smart quotes 29

Drinking coffee with AWK 31

CHEAT SHEET

GNU awk cheat sheet 33

4 A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . WHAT IS AWK?

What is awk?
awk is known for its robust ability to process and interpret data from text files.

AWK is a programming language and a POSIX [1]


specification that originated at AT&T Bell Lab-
oratories in 1977. Its name comes from the initials of its
to awk what data you want to work with and then what you
want it to do when such data is found. There are no boiler-
plate constructors to create, no elaborate class structure to
designers: Aho, Weinberger, and Kernighan. awk features design, no stream objects to create. awk is built for a specific
user-defined functions, multiple input streams, TCP/IP purpose, so there’s a lot you can take for granted and allow
networking access, and a rich set of regular expressions. awk to handle.
It’s often used to process raw text files, interpreting the
data it finds as records and fields to be manipulated by What’s the difference between awk and gawk?
the user. Awk is an open source POSIX specification, so anyone can
At its most basic, awk searches files for some unit of text (in theory) implement a version of the command and lan-
(usually lines terminated with an end-of-line character) con- guage. On Linux or any system that provides GNU awk [2],
taining some user-specified pattern. When a line matches the command to invoke awk is gawk, but it’s symlinked
one of the patterns, awk performs some set of user-defined to the generic command awk. The same is true for sys-
actions on that line, then processes input lines until the end tems that provide nawk or mawk or any other variety of
of the input files. awk implementation. Most versions of awk implement
awk is used as a command as often as it is used as an the core functionality and literal functions defined by the
interpreted script. One-liners are popular and useful ways POSIX spec, although they may add special new features
of filtering output from files or output streams or as stand- not present in others. For that reason, there’s some risk
alone commands. awk even has an interactive mode of sorts of learning one implementation and coming to rely on a
because, without input, it acts upon any line the user types special feature, but this “problem” is tempered by the fact
into the terminal: that most of them are open source, so they usually can be
installed as needed.
$ awk '/foo/ { print toupper($0); }'
This line contains bar. Learning awk
This line contains foo. There are many great resources for learning awk. The GNU
THIS LINE CONTAINS FOO. awk manual, GAWK: Effective awk programming [3], is a
definitive guide to the language. You can find many other
However, awk is a programming language with user-defined tutorials for awk [4] on Opensource.com, including “Getting
functions, loops, conditionals, flow control, and more. It’s ro- started with awk, a powerful text-parsing tool.” [5]
bust enough as a language that it has been used to program
a wiki and even (believe it or not) a retargetable assembler
for eight-bit microprocessors. Links
[1] https://opensource.com/article/19/7/what-posix-richard-
Why use awk? stallman-explains
awk may seem outdated in a world fortunate enough to have [2] https://www.gnu.org/software/gawk/
Python available by default on several major operating sys- [3] https://www.gnu.org/software/gawk/manual/
tems, but its longevity is well-earned. In many ways, pro- [4]  https://opensource.com/sitewide-search?search_api_
grams written in awk are different from programs in other views_fulltext=awk
languages because awk is data-driven. That is, you describe [5]  https://opensource.com/article/19/10/intro-awk

A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM 5


GETTING STARTED WITH AWK, A POWERFUL TEXT-PARSING TOOL . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Getting started with awk,


a powerful text-parsing tool
Let’s jump in and start using it.

Printing a column
AWK IS A POWERFUL text-parsing tool for Unix and
Unix-like systems, but because it has
programmed functions that you can use to perform com-
In awk, the print function displays whatever you specify.
There are many predefined variables you can use, but some
mon parsing tasks, it’s also considered a programming of the most common are integers designating columns in a
language. You probably won’t be developing your next text file. Try it out:
GUI application with awk, and it likely won’t take the place
of your default scripting language, but it’s a powerful utility $ awk '{print $2;}' colours.txt
for specific tasks. color
What those tasks may be is surprisingly diverse. The best red
way to discover which of your problems might be best solved yellow
by awk is to learn awk; you’ll be surprised at how awk can red
help you get more done but with a lot less effort. purple
Awk’s basic syntax is: green
purple
awk [options] 'pattern {action}' file brown
brown
To get started, create this sample file and save it as colours.txt yellow

name color amount In this case, awk displays the second column, denoted by
apple red 4 $2. This is relatively intuitive, so you can probably guess that
banana yellow 6 print $1 displays the first column, and print $3 displays the
strawberry red 3 third, and so on.
grape purple 10 To display all columns, use $0.
apple green 8 The number after the dollar sign ($) is an expression, so
plum purple 2 $2 and $(1+1) mean the same thing.
kiwi brown 4
potato brown 9 Conditionally selecting columns
pineapple yellow 5 The example file you’re using is very structured. It has a row
that serves as a header, and the columns relate directly to
This data is separated into columns by one or more spac- one another. By defining conditional requirements, you can
es. It’s common for data that you are analyzing to be qualify what you want awk to return when looking at this
organized in some way. It may not always be columns data. For instance, to view items in column 2 that match “yel-
separated by whitespace, or even a comma or semico- low” and print the contents of column 1:
lon, but especially in log files or data dumps, there’s gen-
erally a predictable pattern. You can use patterns of data awk '$2=="yellow"{print $1}' colours.txt
to help awk extract and process the data that you want banana
to focus on. pineapple

6 A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM


. . . . . . . . . . . . . . . . . . . . . . . . . . . . GETTING STARTED WITH AWK, A POWERFUL TEXT-PARSING TOOL

Regular expressions work as well. This conditional looks apple,green,8


at $2 for approximate matches to the letter p followed by plum,purple,2
any number of (one or more) characters, which are in turn kiwi,brown,4
followed by the letter p: potato,brown,9
pineapple,yellow,5
$ awk '$2 ~ /p.+p/ {print $0}' colours.txt
grape purple 10 Awk can treat the data in exactly the same way, as long as
plum purple 2 you specify which character it should use as the field separa-
tor in your command. Use the --field-separator (or just -F for
Numbers are interpreted naturally by awk. For instance, short) option to define the delimiter:
to print any row with a third column containing an integer
greater than 5: $ awk -F"," '$2=="yellow" {print $1}' file1.csv
banana
awk '$3>5 {print $1, $2}' colours.txt pineapple
name color Saving output
banana yellow
grape purple Using output redirection, you can write your results to a file.
apple green For example:
potato brown
$ awk -F, '$3>5 {print $1, $2} colours.csv > output.txt
Field separator
By default, awk uses whitespace as the field separator. This creates a file with the contents of your awk query.
Not all text files use whitespace to define fields, though. You can also split a file into multiple files grouped by col-
For example, create a file called colours.csv with this umn data. For example, if you want to split colours.txt into
content: multiple files according to what color appears in each row,
you can cause awk to redirect per query by including the
name,color,amount redirection in your awk statement:
apple,red,4
banana,yellow,6 $ awk '{print > $2".txt"}' colours.txt
strawberry,red,3
grape,purple,10 This produces files named yellow.txt, red.txt, and so on.

A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM 7


FIELDS, RECORDS, AND VARIABLES IN AWK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Fields, records,
and variables in awk
In the second article in this intro to awk series, learn about
fields, records, and some powerful awk variables.

AWK COMES in several varieties: There is the


original awk, written in 1977 at
AT&T Bell Laboratories, and several reimplementations,
treats multiple space separators as one, so this line contains
two fields:

such as mawk, nawk, and the one that ships with most Li- raspberry red
nux distributions, GNU awk, or gawk. On most Linux dis-
tributions, awk and gawk are synonyms referring to GNU As does this one:
awk, and typing either invokes the same awk command.
See the GNU awk user’s guide [1] for the full history of tuxedo black
awk and gawk.
The first article in this series showed that awk is invoked Other separators are not treated this way. Assuming that the
on the command line with this syntax: field separator is a comma, the following example record con-
tains three fields, with one probably being zero characters long
$ awk [options] 'pattern {action}' inputfile (assuming a non-printable character isn’t hiding in that field):

Awk is the command, and it can take options (such as -F a,,b


to define the field separator). The action you want awk to
perform is contained in single quotes, at least when it’s The awk program
issued in a terminal. To further emphasize which part of The program part of an awk command consists of a series
the awk command is the action you want it to take, you of rules. Normally, each rule begins on a new line in the pro-
can precede your program with the -e option (but it’s not gram (although this is not mandatory). Each rule consists of
required): a pattern and one or more actions:

$ awk -F, -e '{print $2;}' colours.txt pattern { action }


yellow
blue In a rule, you can define a pattern as a condition to control
green whether the action will run on a record. Patterns can be sim-
[...] ple comparisons, regular expressions, combinations of the
two, and more.
Records and fields For instance, this will print a record only if it contains the
Awk views its input data as a series of records, which are word “raspberry”:
usually newline-delimited lines. In other words, awk general-
ly sees each line in a text file as a new record. Each record $ awk '/raspberry/ { print $0 }' colours.txt
contains a series of fields. A field is a component of a record raspberry red 99
delimited by a field separator.
By default, awk sees whitespace, such as spaces, tabs, If there is no qualifying pattern, the action is applied to every
and newlines, as indicators of a new field. Specifically, awk record.

8 A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . FIELDS, RECORDS, AND VARIABLES IN AWK

Also, a rule can consist of only a pattern, in which case the The format argument (or format string ) defines how each of
entire record is written as if the action was { print }. the other arguments will be output. It uses format specifiers
Awk programs are essentially data-driven in that actions to do this, including %s to output a string and %d to output a
depend on the data, so they are quite a bit different from decimal number. The following printf statement outputs the
programs in many other programming languages. record followed by the number of fields in parentheses:

The NF variable $ awk 'printf "%s (%d)\n",$0,NF}' colours.txt


Each field has a variable as a designation, but there are name color amount (3)
special variables for fields and records, too. The variable NF raspberry red 4 (3)
stores the number of fields awk finds in the current record. banana yellow 6 (3)
This can be printed or used in tests. Here is an example us- [...]
ing the text file [2] from the previous article:
In this example, %s (%d) provides the structure for each
$ awk '{ print $0 " (" NF ")" }' colours.txt line, while $0,NF defines the data to be inserted into the %s
name color amount (3) and %d positions. Note that, unlike with the print function,
apple red 4 (3) no newline is generated without explicit instructions. The es-
banana yellow 6 (3) cape sequence \n does this.
[...]
Awk scripting
Awk’s print function takes a series of arguments (which All of the awk code in this article has been written and exe-
may be variables or strings) and concatenates them to- cuted in an interactive Bash prompt. For more complex pro-
gether. This is why, at the end of each line in this example, grams, it’s often easier to place your commands into a file or
awk prints the number of fields as an integer enclosed by script. The option -f FILE (not to be confused with -F, which
parentheses. denotes the field separator) may be used to invoke a file
containing a program.
The NR variable For example, here is a simple awk script. Create a file
In addition to counting the fields in each record, awk also called example1.awk with this content:
counts input records. The record number is held in the vari-
able NR, and it can be used in the same way as any other /^a/ {print "A: " $0}
variable. For example, to print the record number before /^b/ {print "B: " $0}
each line:
It’s conventional to give such files the extension .awk to
$ awk '{ print NR ": " $0 }' colours.txt make it clear that they hold an awk program. This naming
1: name color amount is not mandatory, but it gives file managers and editors (and
2: apple red 4 you) a useful clue about what the file is.
3: banana yellow 6 Run the script:
4: raspberry red 3
5: grape purple 10 $ awk -f example1.awk colours.txt
[...] A: raspberry red 4
B: banana yellow 6
Note that it’s acceptable to write this command with no spac- A: apple green 8
es other than the one after print, although it’s more difficult
for a human to parse: A file containing awk instructions can be made into a script
by adding a #! line at the top and making it executable. Cre-
$ awk '{print NR": "$0}' colours.txt ate a file called example2.awk with these contents:

The printf() function #!/usr/bin/awk -f


For greater flexibility in how the output is formatted, you can #
use the awk printf() function. This is similar to printf in C, # Print all but line 1 with the line number on the front
Lua, Bash, and other languages. It takes a format argument #
followed by a comma-separated list of items. The argument
list may be enclosed in parentheses. NR > 1 {
printf "%d: %s\n",NR,$0
$ printf format, item1, item2, ... }

A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM 9


FIELDS, RECORDS, AND VARIABLES IN AWK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Arguably, there’s no advantage to having just one line in writing an awk script with more than one rule and at least
a script, but sometimes it’s easier to execute a script than one conditional pattern. If you want to try more functions
to remember and type even a single line. A script file also than just print and printf, refer to the gawk manual [3]
provides a good opportunity to document what a command online.
does. Lines starting with the # symbol are comments, which Here’s an idea to get you started:
awk ignores.
Grant the file executable permission: #!/usr/bin/awk -f
#
$ chmod u+x example2.awk # Print each record EXCEPT
# IF the first record contains "raspberry",
Run the script: # THEN replace "red" with "pi"

$ ./example2.awk colours.txt $1 == "raspberry" {


2: apple red 4 gsub(/red/,"pi")
2: banana yellow 6 }
4: raspberry red 3
5: grape purple 10 { print }
[...]
Try this script to see what it does, and then try to write
An advantage of placing your awk instructions in a script file your own.
is that it’s easier to format and edit. While you can write awk
on a single line in your terminal, it can get overwhelming
when it spans several lines. Links
[1] https://www.gnu.org/software/gawk/manual/html_node/
Try it History.html#History
You now know enough about how awk processes your in- [2]  https://opensource.com/article/19/10/intro-awk
structions to be able to write a complex awk program. Try [3] https://www.gnu.org/software/gawk/manual/

10 A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A GUIDE TO INTERMEDIATE AWK SCRIPTING

A guide to intermediate
awk scripting
Learn how to structure commands into executable scripts.

THIS ARTICLE explores awk’s capabilities,


which are easier to use now
that you know how to structure your command into an ex-
}
next;

ecutable script. $3 >= 5 {


printf "%s\t%s\n", $0, "*";
Logical operators and conditionals next;
You can use the logical operators and (written &&) and or }
(written ||) to add specificity to your conditionals.
For example, to select and print only records with the string $3 < 5 {
“purple” in the second column and an amount less than five print $0;
in the third column: }

$2 == "purple" && $3 < 5 {print $1} BEGIN command


The BEGIN command lets you print and set variables before
If a record has “purple” in column two but a value greater awk starts scanning a text file. For instance, you can set the
than or equal to 5 in column three, then it is not selected. input and output field separators inside your awk script by
Similarly, if a record matches column three’s requirement but defining them in a BEGIN statement. This example adapts
lacks “purple” in column two, it is also not selected. the simple script from the previous article for a file with fields
delimited by commas instead of whitespace:
Next command
Say you want to select every record in your file where the #!/usr/bin/awk -f
amount is greater than or equal to eight and print a matching #
record with two asterisks (**). You also want to flag every # Print each record EXCEPT
record with a value between five (inclusive) and eight with # IF the first record contains "raspberry",
only one asterisk (*). There are a few ways to do this, and # THEN replace "red" with "pi"
one way is to use the next command to instruct awk that
after it takes an action, it should stop scanning and proceed BEGIN {
to the next record. FS=",";
Here’s an example: }

NR == 1 { $1 == "raspberry" {
print $0; gsub(/red/,"pi")
next; }
}
END command
$3 >= 8 { The END command, like BEGIN, allows you to perform
printf "%s\t%s\n", $0, "**"; actions in awk after it completes its scan through the text

A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM 11


A GUIDE TO INTERMEDIATE AWK SCRIPTING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

file you are processing. If you want to print cumulative re- df -l | awk -f total.awk
sults of some value in all records, you can do that only
after all records have been scanned and processed. The used and available variables act like variables in many
The BEGIN and END commands run only once each. All other programming languages. You create them arbitrarily
rules between them run zero or more times on each record. and without declaring their type, and you add values to them
In other words, most of your awk script is a loop that is exe- at will. At the end of the loop, the script adds the records in
cuted at every new line of the text file you’re processing, with the respective columns together and prints the totals.
the exception of the BEGIN and END rules, which run before
and after the loop. Math
Here is an example that wouldn’t be possible without the As you can probably tell from all the logical operators and
END command. This script accepts values from the output of casual calculations so far, awk does math quite naturally.
the df Unix command and increments two custom variables This arguably makes it a very useful calculator for your
(used and available) with each new record. terminal. Instead of struggling to remember the rather un-
usual syntax of bc, you can just use awk along with its
$1 != "tempfs" { special BEGIN function to avoid the requirement of a file
used += $3; argument:
available += $4;
} $ awk 'BEGIN { print 2*21 }'
42
END { $ awk 'BEGIN {print 8*log(4) }'
printf "%d GiB used\n%d GiB available\n", 11.0904
used/2^20, available/2^20;
} Admittedly, that’s still a lot of typing for simple (and not so
simple) math, but it wouldn’t take much effort to write a fron-
Save the script as total.awk and try it: tend, which is an exercise for you to explore.

12 A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . HOW TO USE LOOPS IN AWK

How to use loops in awk


Learn how to use different types of loops to run commands on a record multiple times.

AWK SCRIPTS have three main sections: the


optional BEGIN and END func-
tions and the functions you write that are executed on each
stop condition has been met. The commands are repeated
only while the test returns true (that is, the end condition has
not been met). If a test fails, the loop is broken because the
record. In a way, the main body of an awk script is a loop, end condition has been met.
because the commands in the functions run for each record.
However, sometimes you want to run commands on a record #!/usr/bin/awk -f
more than once, and for that to happen, you must write a loop. BEGIN {
There are several kinds of loops, each serving a unique
purpose. i=2;
do {
While loop print i, " to the second power is ", i*i;
A while loop tests a condition and performs commands i = i + 1
while the test returns true. Once a test returns false, the }
loop is broken. while (i < 10)

#!/bin/awk -f exit;
}
BEGIN {
# Loop through 1 to 10 For loops
There are two kinds of for loops in awk.
i=1; One kind of for loop initializes a variable, performs a test,
while (i <= 10) { and increments the variable together, performing commands
print i, " to the second power is ", i*i; while the test is true.
i = i+1;
} #!/bin/awk -f
exit;
} BEGIN {
for (i=1; i <= 10; i++) {
In this simple example, awk prints the square of whatever print i, " to the second power is ", i*i;
integer is contained in the variable i. The while (i <= 10) }
phrase tells awk to perform the loop only as long as the value exit;
of i is less than or equal to 10. After the final iteration (while i }
is 10), the loop ends.
Another kind of for loop sets a variable to successive indices
Do while loop of an array, performing a collection of commands for each
The do while loop performs commands after the keyword index. In other words, it uses an array to “collect” data from
do. It performs a test afterward to determine whether the a record.

A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM 13


HOW TO USE LOOPS IN AWK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

This example implements a simplified version of the Unix The third column of the sample data file contains the num-
command uniq. By adding a list of strings into an array called ber of items listed in the first column. You can use an array
a as a key and incrementing the value each time the same and a for loop to tally the items in the third column by
key occurs, you get a count of the number of times a string ap- color:
pears (like the --count option of uniq). If you print the keys of
the array, you get every string that appears one or more times. #! /usr/bin/awk -f
For example, using the demo file colours.txt (from the
previous articles): BEGIN {
FS=" ";
name color amount OFS="\t";
apple red 4 print("color\tsum");
banana yellow 6 }
raspberry red 99 NR != 1 {
strawberry red 3 a[$2]+=$3;
grape purple 10 }
apple green 8 END {
plum purple 2 for (b in a) {
kiwi brown 4 print b, a[b]
potato brown 9 }
pineapple yellow 5 }

Here is a simple version of uniq -c in awk form: As you can see, you are also printing a header column in the
BEFORE function (which always happens only once) prior to
#! /usr/bin/awk -f processing the file.

NR != 1 { Loops
a[$2]++ Loops are a vital part of any programming language, and
} awk is no exception. Using loops can help you control how
END { your awk script runs, what information it’s able to gather,
for (key in a) { and how it processes your data. Our next article will cover
print a[key] " " key switch statements, continue, and next.
}
}

14 A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . HOW TO USE REGULAR EXPRESSIONS IN AWK

How to use regular


expressions in awk
Use regex to search code using dynamic and complex pattern definitions.

IN AWK, egular expressions (regex) allow for dy-


namic and complex pattern definitions.
You’re not limited to searching for simple strings but also
Adding an o inside the square brackets creates a new
pattern to match:

patterns within patterns. $ awk -e '$1 ~ /p[o]/ {print $0}' colours.txt


The syntax for using regular expressions to match lines apple red 4
in awk is: grape purple 10
apple green 8
word ~ /match/ plum purple 2
pineapple yellow 5
The inverse of that is not matching a pattern: potato brown 9

word !~ /match/ Regular expression basics


Certain characters have special meanings when they’re
If you haven’t already, create the sample file from our used in regular expressions.
previous article:
Anchors
name color amount Anchor Function
apple red 4
banana yellow 6 ^ Indicates the beginning of the line
strawberry red 3 $ Indicates the end of a line
raspberry red 99
grape purple 10 \A Denotes the beginning of a string
apple green 8 \z Denotes the end of a string
plum purple 2
\b Marks a word boundary
kiwi brown 4
potato brown 9
pineapple yellow 5 For example, this awk command prints any record contain-
ing an r character:
Save the file as colours.txt and run:
$ awk -e '$1 ~ /r/ {print $0}' colours.txt
$ awk -e '$1 ~ /p[el]/ {print $0}' colours.txt strawberry red 3
apple red 4 raspberry red 99
grape purple 10 grape purple 10
apple green 8
plum purple 2 Add a ^ symbol to select only records where r occurs at the
pineapple yellow 5 beginning of the line:

You have selected all records containing the letter p followed $ awk -e '$1 ~ /^r/ {print $0}' colours.txt
by either an e or an l. raspberry red 99

A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM 15


HOW TO USE REGULAR EXPRESSIONS IN AWK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Many quantifiers modify the character sets that precede


Characters
them. For example, . means any character that appears ex-
Character Function actly once, but .* means any or no character. Here’s an ex-
[ad] Selects a or d ample; look at the regex pattern carefully:
[a-d] Selects any character a through d (a, b, c, or d)
[^a-d] Selects any character except a through d $ printf "red\nrd\n"
(e, f, g, h…) red
\w Selects any word rd
\s Selects any whitespace character $ printf "red\nrd\n" | awk -e '$0 ~ /^r.d/ {print}'
red
\d Selects any digit
$ printf "red\nrd\n" | awk -e '$0 ~ /^r.*d/ {print}'
The capital versions of w, s, and d are negations; for exam- red
ple, \D does not select any digit. rd
POSIX [1] regex offers easy mnemonics for character
classes: Similarly, numbers in braces specify the number of times
something occurs. To find records in which an e character
POSIX Function
mnemonic occurs exactly twice:
[:alnum:] Alphanumeric characters
$ awk -e '$2 ~ /e{2}/ {print $0}' colours.txt
[:alpha:] Alphabetic characters apple green 8
[:space:] Space characters (such as space, tab, and
formfeed) Grouped matches
[:blank:] Space and tab characters Quantifier Function
[:upper:] Uppercase alphabetic characters (red) Parentheses indicate that the enclosed
[:lower:] Lowercase alphabetic characters letters must appear contiguously
[:digit:] Numeric characters | Means or in the context of a grouped
match
[:xdigit:] Characters that are hexadecimal digits
[:punct:] Punctuation characters (i.e., characters For instance, the pattern (red) matches the word red and
that are not letters, digits, control ordered but not any word that contains all three of those
characters, or space characters) letters in another order (such as the word order).
[:cntrl:] Control characters
[:graph:] Characters that are both printable and Awk like sed with sub() and gsub()
visible (e.g., a space is printable but not Awk features several functions that perform find-and-replace
visible, whereas an a is both) actions, much like the Unix command sed. These are func-
[:print:] Printable characters (i.e., characters that tions, just like print and printf, and can be used in awk rules
are not control characters) to replace strings with a new string, whether the new string
is a string or a variable.
Quantifiers The sub function substitutes the first matched entity (in a
Quantifier Function record) with a replacement string. For example, if you have
this rule in an awk script:
. Matches any character
+ Modifies the preceding set to mean { sub(/apple/, "nut", $1);
one or more times
print $1 }
* Modifies the preceding set to mean
zero or more times running it on the example file colours.txt produces this output:
? Modifies the preceding set to mean
zero or one time name
{n} Modifies the preceding set to mean nut
exactly n times banana
{n,} Modifies the preceding set to mean raspberry
n or more times strawberry
{n,m} Modifies the preceding set to mean grape
between n and m times nut

16 A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . HOW TO USE REGULAR EXPRESSIONS IN AWK

plum This searches for the group of characters Awk and stores it
kiwi in memory, represented by the special character &. Then it
potato substitutes the string for GNU &, meaning GNU Awk. The
pinenut 1 character at the end tells gensub() to replace the first
occurrence.
The reason both apple and pineapple were replaced with
nut is that both are the first match of their records. If the $ printf "Awk\nAwk is not Awkward" \
records were different, then the results could differ: | awk -e ' { print gensub(/(Awk)/, "GNU &",1) }'
GNU Awk
$ printf "apple apple\npineapple apple\n" | \ GNU Awk is not Awkward
awk -e 'sub(/apple/, "nut")'
nut apple There’s a time and a place
pinenut apple Awk is a powerful tool, and regex are complex. You might
think awk is so very powerful that it could easily replace
The gsub command substitutes all matching items: grep and sed and tr and sort [2] and many more, and
in a sense, you’d be right. However, awk is just one tool
$ printf "apple apple\npineapple apple\n" | \ in a toolbox that’s overflowing with great options. You
awk -e 'gsub(/apple/, "nut")' have a choice about what you use and when you use it,
nut nut so don’t feel that you have to use one tool for every job
pinenut nut great and small.
With that said, awk really is a powerful tool with lots of
Gensub great functions. The more you use it, the better you get to
An even more complex version of these functions, called know it. Remember its capabilities, and fall back on it occa-
gensub(), is also available. sionally so can you get comfortable with it.
The gensub function allows you to use the & character to
recall the matched text. For example, if you have a file with
the word Awk and you want to change it to GNU Awk, you Links
could use this rule: [1] 
https://opensource.com/article/19/7/what-posix-richard-
stallman-explains
{ print gensub(/(Awk)/, "GNU &", 1) } [2] 
https://opensource.com/article/19/10/get-sorted-sort

A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM 17


4 WAYS TO CONTROL THE FLOW OF YOUR AWK SCRIPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 ways to control the


flow of your awk script
Learn to use switch statements and the break, continue, and next commands to control awk scripts.

THERE ARE MANY WAYS to control the


flow of an awk script,
including loops [1], switch statements and the break, con-
The expression part can be any awk expression that re-
turns a numeric or string result. The VALUE part (after the
word case) is a numeric or string constant or a regular
tinue, and next commands. expression.
When a switch statement runs, the expression is evalu-
Sample data ated, and the result is matched against each case value. If
Create a sample data set called colours.txt and copy this there’s a match, then the code contained within a case defi-
content into it: nition is executed. If there’s no match in any case definition,
then the default statement is executed.
name color amount The keyword break is at the end of the code in each case
apple red 4 definition to break the loop. Without break, awk would con-
banana yellow 6 tinue to search for matching case values.
strawberry red 3 Here’s an example switch statement:
raspberry red 99
grape purple 10 #!/usr/bin/awk -f
apple green 8 #
plum purple 2 # Example of the use of 'switch' in GNU Awk.
kiwi brown 4
potato brown 9 NR > 1 {
pineapple yellow 5 printf "The %s is classified as: ",$1
Switch statements
switch ($1) {
The switch statement is a feature specific to GNU awk, so case "apple":
you can only use it with gawk. If your system or your target print "a fruit, pome"
system doesn’t have gawk, then you should not use a switch break
statement. case "banana":
The switch statement in gawk is similar to the one in C case "grape":
and many other languages. The syntax is: case "kiwi":
print "a fruit, berry"
switch (expression) { break
case VALUE: case "raspberry":
<do something here> print "a computer, pi"
[...] break
default: case "plum":
<do something here> print "a fruit, drupe"
} break

18 A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 WAYS TO CONTROL THE FLOW OF YOUR AWK SCRIPT

case "pineapple": # Make an infinite FOR loop


print "a fruit, fused berries (syncarp)" for (divisor = 2; ; divisor++) {
break
case "potato": # If num is divisible by divisor, then break
print "a vegetable, tuber" if (num % divisor == 0) {
break printf "
Smallest divisor of %d is %d\n",
default: num, divisor
print "[unclassified]" break
} }
}
# If divisor has gotten too large, the number
This script notably ignores the first line of the file, which in # has no divisor, so is a prime
the case of the sample data is just a header. It does this if (divisor * divisor > num) {
by operating only on records with an index number greater printf "%d is prime\n", num
than 1. On all other records, this script compares the con- break
tents of the first field ($1, as you know from previous arti- }
cles) to the value of each case definition. If there’s a match, }
the print function is used to print the botanical classification }
of the entry. If there are no matches, then the default in-
stance prints "[unclassified]". Try running the script to see its results:
The banana, grape, and kiwi are all botanically classified
as a berry, so there are three case definitions associated $ echo 67 | ./divisor.awk
with one print result. 67 is prime
Run the script on the colours.txt sample file, and you $ echo 69 | ./divisor.awk
should get this: Smallest divisor of 69 is 3

The apple is classified as: a fruit, pome As you can see, even though the script starts out with an
The banana is classified as: a fruit, berry explicit infinite loop with no end condition, the break function
The strawberry is classified as: [unclassified] ensures that the script eventually terminates.
The raspberry is classified as: a computer, pi
The grape is classified as: a fruit, berry Continue
The apple is classified as: a fruit, pome The continue function is similar to break. It can be used in
The plum is classified as: a fruit, drupe a for, while, or do-while loop (it’s not relevant to a switch
The kiwi is classified as: a fruit, berry statements, though). Invoking continue skips the rest of the
The potato is classified as: a vegetable, tuber enclosing loop and begins the next cycle.
The p
ineapple is classified as: a fruit, fused berries Here’s another good example from the GNU awk manual
(syncarp) to demonstrate a possible use of continue:

Break #!/usr/bin/awk -f
The break statement is mainly used for the early termination
of a for, while, or do-while loop or a switch statement. In a # Loop, printing numbers 0-20, except 5
loop, break is often used where it’s not possible to determine
the number of iterations of the loop beforehand. Invoking BEGIN {
break terminates the enclosing loop (which is relevant when for (x = 0; x <= 20; x++) {
there are nested loops or loops within loops). if (x == 5)
This example, straight out of the GNU awk manual [2], shows continue
a method of finding the smallest divisor. Read the additional printf "%d ", x
comments for a clear understanding of how the code works: }
print ""
#!/usr/bin/awk -f }

{ This script analyzes the value of x before printing any-


num = $1 thing. If the value is exactly 5, then continue is invoked,
causing the printf line to be skipped, but leaves the loop

A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM 19


4 WAYS TO CONTROL THE FLOW OF YOUR AWK SCRIPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

unbroken. Try the same code but with break instead to This sample uses next in the first rule to avoid the first line of
see the difference. the file, which is a header row. The second rule skips lines
when the color name is less than six characters long, but
Next it also saves that line in an array called skip, using the line
This statement is not related to loops like break and continue number as the key (also known as the index).
are. Instead, next applies to the main record processing cycle The third rule prints anything it sees, but it is not invoked if
of awk: the functions you place between the BEGIN and END either rule 1 or rule 2 causes it to be skipped.
functions. The next statement causes awk to stop processing Finally, at the end of all the processing, the END rule prints
the current input record and to move to the next one. the contents of the array.
As you know from the earlier articles in this series, awk Run the sample script on the colours.txt file from above
reads records from its input stream and applies rules to (and previous articles):
them. The next statement stops the execution of rules for
the current record and moves to the next one. $ ./next.awk colours.txt
Here’s an example of next being used to “hold” information banana yellow 6
upon a specific condition: grape purple 10
plum purple 2
#!/usr/bin/awk -f pineapple yellow 5

# Ignore the header Skipped:


NR == 1 { next } 2: apple red 4
4: strawberry red 3
# If field 2 (colour) is less than 6 6: apple green 8
# characters, then save it with its 8: kiwi brown 4
# line number and skip it 9: potato brown 9

length($2) < 6 { Control freak


skip[NR] = $0 In summary, switch, continue, next, and break are import-
next ant preemptive exceptions to awk rules that provide greater
} control of your script. You don’t have to use them directly;
often, you can gain the same logic through other means,
# It's not the header and but they’re great convenience functions that make the cod-
# the colour name is > 6 characters, er’s life a lot easier. The next article in this series covers the
# so print the line printf statement.
{
print
} Links
[1] https://opensource.com/article/19/11/loops-awk
# At the end, show what was skipped [2] https://www.gnu.org/software/gawk/manual/
END {
printf "\nSkipped:\n"
for (n in skip)
print n": "skip[n]
}

20 A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ADVANCE YOUR AWK SKILLS WITH TWO EASY TUTORIALS

Advance your awk skills


with two easy tutorials
Go beyond one-line awk scripts with mail merge and word counting.

AWK IS ONE OF THE OLDEST TOOLS in the Unix and


Linux user’s toolbox. Created
in the 1970s by Alfred Aho, Peter Weinberger, and Bri-
In awk terminology, the current buffer is a record. There are
a number of special variables that affect how awk reads and
processes a file:
an Kernighan (the A, W, and K of the tool’s name), awk
was created for complex processing of text streams. It is • FS (field separator): By default, this is any whitespace
a companion tool to sed, the stream editor, which is de- (spaces or tabs)
signed for line-by-line processing of text files. Awk allows • RS (record separator): By default, a newline (\n)
more complex structured programs and is a complete pro- • NF (number of fields): When awk parses a line, this variable
gramming language. is set to the number of fields that have been parsed
This article will explain how to use awk for more struc- • $0: The current record
tured and complex tasks, including a simple mail merge • $1, $2, $3, etc.: The first, second, third, etc. field from the
application. current record
• NR (number of records): The number of records that have
Awk program structure been parsed so far by the awk script
An awk script is made up of functional blocks surrounded
by {} (curly brackets). There are two special function blocks, There are many other variables that affect awk’s behavior,
BEGIN and END, that execute before processing the first but this is enough to start with.
line of the input stream and after the last line is processed. In
between, blocks have the format: Awk one-liners
For a tool so powerful, it’s interesting that most of awk’s us-
pattern { action statements } age is basic one-liners. Perhaps the most common awk pro-
gram prints selected fields from an input line from a CSV file,
Each block executes when the line in the input buffer match- a log file, etc. For example, the following one-liner prints a list
es the pattern. If no pattern is included, the function block of usernames from /etc/passwd:
executes on every line of the input stream.
Also, the following syntax can be used to define functions awk -F":" '{print $1 }' /etc/passwd
in awk that can be called from any block:
As mentioned above, $1 is the first field in the current record.
function name(parameter list) { statements } The -F option sets the FS variable to the character :.
The field separator can also be set in a BEGIN function
This combination of pattern-matching blocks and functions block:
allows the developer to structure awk programs for reuse
and readability. awk 'BEGIN { FS=":" } {print $1 }' /etc/passwd

How awk processes text streams In the following example, every user whose shell is not /sbin/
Awk reads text from its input file or stream one line at a time nologin can be printed by preceding the block with a pattern
and uses a field separator to parse it into a number of fields. match:

A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM 21


ADVANCE YOUR AWK SKILLS WITH TWO EASY TUTORIALS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

awk '
BEGIN { FS=":" } ! /\/sbin\/nologin/ {print $1 }' BEGIN {
/etc/passwd FS=",";
template="email_template.txt";
Advanced awk: Mail merge output="acceptance";
Now that you have some of the basics, try delving deep- getline;
er into awk with a more structured example: creating a mail NR=0;
merge. }
A mail merge uses two files, one (called in this example
email_template.txt) containing a template for an email you The main function is very straightforward: for each line pro-
want to send: cessed, a variable is set for the various fields—firstname,
lastname, email, and title. The template file is read line by
From: Program committee <[email protected]> line, and the function sub is used to substitute any occur-
To: {firstname} {lastname} <{email}> rence of the special character sequences with the value of
Subject: Your presentation proposal the relevant variable. Then the line, with any substitutions
made, is output to the output file.
Dear {firstname}, Since you are dealing with the template file and a dif-
ferent output file for each line, you need to clean up and
Thank you for your presentation proposal: close the file handles for these files before processing the
{title} next record.

We are pleased to inform you that your proposal has {


been successful! We # Read relevant fields from input file
will contact you shortly with further information about firstname=$1;
the event lastname=$2;
schedule. email=$3;
title=$4;
Thank you,
The Program Committee # Set output filename
outfile=(output NR ".txt");
And the other is a CSV file (called proposals.csv) with the
people you want to send the email to: # Read a line from template, replace special
# fields, and print result to output file
firstname,lastname,email,title while ( (getline ln < template) > 0 )
Harry,Potter,[email protected],"Defeating your {
nemesis in 3 easy steps" sub(/{firstname}/,firstname,ln);
Jack,Reacher,[email protected],"Hand-to-hand combat sub(/{lastname}/,lastname,ln);
for beginners" sub(/{email}/,email,ln);
Mickey,Mouse,[email protected],"Surviving public sub(/{title}/,title,ln);
speaking with a squeaky voice" print(ln) > outfile;
Santa,Claus,[email protected],"Efficient list-making" }

You want to read the CSV file, replace the relevant fields in # Close template and output file in advance of
the first file (skipping the first line), then write the result to next record
a file called acceptanceN.txt, incrementing N for each line close(outfile);
you parse. close(template);
Write the awk program in a file called mail_merge.awk. }
Statements are separated by ; in awk scripts. The first task
is to set the field separator variable and a couple of other You’re done! Run the script on the command line with:
variables the script needs. You also need to read and dis-
card the first line in the CSV, or a file will be created starting awk -f mail_merge.awk proposals.csv
with Dear firstname. To do this, use the special function
getline and reset the record counter to 0 after reading it. or

22 A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ADVANCE YOUR AWK SKILLS WITH TWO EASY TUTORIALS

awk -f mail_merge.awk < proposals.csv for (word in words) {


pr
intf "%s\t%d\n", word, words[word] |
and you will find text files generated in the current directory. sort_head;
}
Advanced awk: Word frequency count close (sort_head);
One of the most powerful features in awk is the associative }
array. In most programming languages, array entries are typ-
ically indexed by a number, but in awk, arrays are referenced Running this script on an earlier draft of this article produced
by a key string. You could store an entry from the file propos- this output:
als.txt from the previous section. For example, in a single
associative array, like this: [[email protected]]$ awk -f wordcount.
awk < awk_article.txt
proposer["firstname"]=$1; the 79
proposer["lastname"]=$2; awk 41
proposer["email"]=$3; a 39
proposer["title"]=$4; and 33
of 32
This makes text processing very easy. A simple program that in 27
uses this concept is the idea of a word frequency counter. to 26
You can parse a file, break out words (ignoring punctuation) is 25
in each line, increment the counter for each word in the line, line 23
then output the top 20 words that occur in the text. for 23
First, in a file called wordcount.awk, set the field separator to will 22
a regular expression that includes whitespace and punctuation: file 21
we 16
BEGIN { We 15
# ignore 1 or more consecutive occurrences of with 12
the characters which 12
# in the character group below by 12
FS="[ .,:;()<>{}@!\"'\t]+"; this 11
} output 11
function 11
Next, the main loop function will iterate over each field, ignor-
ing any empty fields (which happens if there is punctuation What’s next?
at the end of a line), and increment the word count for the If you want to learn more about awk programming, I strongly
words in the line. recommend the book Sed and awk [1] by Dale Dougherty
and Arnold Robbins.
{ One of the keys to progressing in awk programming is
for (i = 1; i <= NF; i++) { mastering “extended regular expressions.” Awk offers sever-
if ($i != "") { al powerful additions to the sed regular expression [2] syntax
words[$i]++; you may already be familiar with.
} Another great resource for learning awk is the GNU awk
} user guide [3]. It has a full reference for awk’s built-in func-
} tion library, as well as lots of examples of simple and com-
plex awk scripts.
Finally, after the text is processed, use the END function to
print the contents of the array, then use awk’s capability of Links
piping output into a shell command to do a numerical sort [1] 
https://www.amazon.com/sed-awk-Dale-Dougherty/
and print the 20 most frequently occurring words: dp/1565922255/book
[2] 
https://en.wikibooks.org/wiki/Regular_Expressions/POSIX-
END { Extended_Regular_Expressions
sort_head = "sort -k2 -nr | head -n 20"; [3] 
https://www.gnu.org/software/gawk/manual/gawk.html

A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM 23


HOW TO REMOVE DUPLICATE LINES FROM FILES WITH AWK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

How to remove duplicate lines


from files with awk
Learn how to use awk ‘!visited[$0]++’ without sorting or changing their order.

SUPPOSE YOU HAVE a text file and you need to re-


move all of its duplicate lines.
• If the value is empty, awk converts it to 0 (number) auto-
matically and then it gets increased.
• Note: The operation is executed after we access the vari-
TL;DR able’s value.
To remove the duplicate lines while preserving their order in
the file, use: Summing up, the whole expression evaluates to:

awk '!visited[$0]++' your_file > deduplicated_file • true if the occurrences are zero/empty string
• false if the occurrences are greater than zero
How it works
The script keeps an associative array with indices equal to the awk statements consist of a pattern-expression and an as-
unique lines of the file and values equal to their occurrences. sociated action [5].
For each line of the file, if the line occurrences are zero, then
it increases them by one and prints the line, otherwise, it just <pattern/expression> { <action> }
increases the occurrences without printing the line.
I was not familiar with awk, and I wanted to understand If the pattern succeeds, then the associated action is exe-
how this can be accomplished with such a short script (awk- cuted. If we don’t provide an action, awk, by default, prints
ward). I did my research, and here is what is going on: the input.

• The awk “script” !visited[$0]++ is executed for each line of An omitted action is equivalent to { print $0 }.
the input file.
• visited[] is a variable of type associative array [1] (a.k.a.
Map [2]). We don’t have to initialize it because awk will do Our script consists of one awk statement with an expression,
it the first time we access it. omitting the action. So this:
• The $0 variable holds the contents of the line currently be-
ing processed. awk '!visited[$0]++' your_file > deduplicated_file
• visited[$0] accesses the value stored in the map with a
key equal to $0 (the line being processed), a.k.a. the occur- is equivalent to this:
rences (which we set below).
• The ! negates the occurrences’ value: aw
k '!visited[$0]++ { print $0 }' your_file >
•  In awk, any nonzero numeric value or any nonempty deduplicated_file
string value is true [3].
• By default, variables are initialized to the empty string [4], For every line of the file, if the expression succeeds, the line
which is zero if converted to a number. is printed to the output. Otherwise, the action is not execut-
• That being said: ed, and nothing is printed.
•  If visited[$0] returns a number greater than zero, this
negation is resolved to false. Why not use the uniq command?
•  If visited[$0] returns a number equal to zero or an emp- The uniq command removes only the adjacent duplicate
ty string, this negation is resolved to true. lines. Here’s a demonstration:
• The ++ operation increases the variable’s value (visit-
ed[$0]) by one.

24 A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . HOW TO REMOVE DUPLICATE LINES FROM FILES WITH AWK

$ cat test.txt
6 def
A
7 ghi
A
A 8 klm
B
B sort -uk2 sorts the lines based on the second column (k2
B option) and keeps only the first occurrence of the lines with
A the same second column value (u option).
A
C 1 abc
C 4 def
C 2 ghi
B 8 klm
B 5 xyz
A
$ uniq < test.txt sort -nk1 sorts the lines based on their first column (k1 op-
A
tion) treating the column as a number (-n option).
B
A
1 abc
C
2 ghi
B
4 def
A
5 xyz
Other approaches 8 klm
Using the sort command
We can also use the following sort [6] command to remove Finally, cut -f2- prints each line starting from the second
the duplicate lines, but the line order is not preserved. column until its end (-f2- option: Note the - suffix, which in-
structs it to include the rest of the line).
sort -u your_file > sorted_deduplicated_file
abc
Using cat, sort, and cut ghi
The previous approach would produce a de-duplicated file def
whose lines would be sorted based on the contents. Piping a xyz
bunch of commands [7] can overcome this issue: klm

cat -n your_file | sort -uk2 | sort -nk1 | cut -f2- References


• The GNU awk user’s guide
How it works • Arrays in awk
Suppose we have the following file: • Awk—Truth values
• Awk expressions
abc • How can I delete duplicate lines in a file in Unix?
ghi • Remove duplicate lines without sorting [duplicate]
abc • How does awk ‘!a[$0]++’ work?
def
xyz Links
def [1]  http://kirste.userpage.fu-berlin.de/chemnet/use/info/gawk/
ghi gawk_12.html
klm [2]  https://en.wikipedia.org/wiki/Associative_array
[3] https://www.gnu.org/software/gawk/manual/html_node/
cat -n test.txt prepends the order number in each line. Truth-Values.html
[4]  https://ftp.gnu.org/old-gnu/Manuals/gawk-3.0.3/html_
1 abc chapter/gawk_8.html
2 ghi [5] http://kirste.userpage.fu-berlin.de/chemnet/use/info/gawk/
3 abc gawk_9.html
4 def [6] http://man7.org/linux/man-pages/man1/sort.1.html
5 xyz [7] https://stackoverflow.com/a/20639730/2292448

A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM 25


AWK ONE-LINERS AND SCRIPTS TO HELP YOU SORT TEXT FILES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Awk one-liners and scripts


to help you sort text files
Awk is a powerful tool for doing tasks that might otherwise be left to other
common utilities, including sort.

AWK IS THE UBIQUITOUS UNIX COMMAND for scanning


and processing text containing pre-
dictable patterns. However, because it features functions, it’s
Eudyptula;minor;Bonaparte;1867;Little Blue
Spheniscus;demersus;Brisson;1760;African
Megadyptes;antipodes;Milne-Edwards;1880;Yellow-eyed
also justifiably called a programming language.. Eudyptes;chrysocome;Viellot;1816;Sothern Rockhopper
Confusingly, there is more than one awk. (Or, if you be- Torvaldis;linux;Ewing,L;1996;Tux
lieve there can be only one, then there are several clones.)
There’s awk, the original program written by Aho, Weinberg- It’s a small dataset, but it offers a good variety of data types:
er, and Kernighan, and then there’s nawk, mawk, and the
GNU version, gawk. The GNU version of awk is a highly por- • A genus and species name, which are associated with one
table, free software version of the utility with several unique another but considered separate
features, so this article is about GNU awk. • A surname, sometimes with first initials after a comma
While its official name is gawk, on GNU+Linux systems • An integer representing a date
it’s aliased to awk and serves as the default version of that • An arbitrary term
command. On other systems that don’t ship with GNU awk, • All fields separated by semi-colons
you must install it and refer to it as gawk, rather than awk.
This article uses the terms awk and gawk interchangeably. Depending on your educational background, you may con-
Being both a command and a programming language sider this a 2D array or a table or just a line-delimited collec-
makes awk a powerful tool for tasks that might otherwise tion of data. How you think of it is up to you, because awk
be left to sort, cut, uniq, and other common utilities. Luck- doesn’t expect anything more than text. It’s up to you to tell
ily, there’s lots of room in open source for redundancy, so if awk how you want to parse it.
you’re faced with the question of whether or not to use awk,
the answer is probably a solid “maybe.” The sort cheat
The beauty of awk’s flexibility is that if you’ve already com- If you just want to sort a text dataset by a specific, definable
mitted to using awk for a task, then you can probably stay in field (think of a “cell” in a spreadsheet), then you can use the
awk no matter what comes up along the way. This includes sort command [1].
the eternal need to sort data in a way other than the order it
was delivered to you. Fields and records
Regardless of the format of your input, you must find patterns in
Sample set it so that you can focus on the parts of the data that are import-
Before exploring awk’s sorting methods, generate a sample ant to you. In this example, the data is delimited by two factors:
dataset to use. Keep it simple so that you don’t get distracted lines and fields. Each new line represents a new record, as you
by edge cases and unintended complexity. This is the sam- would likely see in a spreadsheet or database dump. Within
ple set this article uses: each line, there are distinct fields (think of them as cells in a
spreadsheet) that are separated by semicolons (;).
Aptenodytes;forsteri;Miller,JF;1778;Emperor Awk processes one record at a time, so while you’re structur-
Pygoscelis;papua;Wagler;1832;Gentoo ing the instructions you will give to awk, you can focus on just

26 A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . AWK ONE-LINERS AND SCRIPTS TO HELP YOU SORT TEXT FILES

one line. Establish what you want to do with one line, then test This establishes the file as an awk script that executes the
it (either mentally or with awk) on the next line and a few more. lines contained in the file.
You’ll end up with a good hypothesis on what your awk script The BEGIN statement is a special setup function provid-
must do in order to provide you with the data structure you want. ed by awk for tasks that need to occur only once. Defining
In this case, it’s easy to see that each field is separated by the built-in variable FS, which stands for field separator
a semicolon. For simplicity’s sake, assume you want to sort and is the same value you set in your awk command with
the list by the very first field of each line. --field-separator, only needs to happen once, so it’s includ-
Before you can sort, you must be able to focus awk on just ed in the BEGIN statement.
the first field of each line, so that’s the first step. The syntax
of an awk command in a terminal is awk, followed by rele- Arrays in awk
vant options, followed by your awk command, and ending You already know how to gather the values of a specific field
with the file of data you want to process. by using the $ notation along with the field number, but in
this case, you need to store it in an array rather than print
$ awk --field-separator=";" '{print $1;}' penguins.list it to the terminal. This is done with an awk array. The im-
Aptenodytes portant thing about an awk array is that it contains keys
Pygoscelis and values. Imagine an array about this article; it would
Eudyptula look something like this: author:"seth",title:"How to sort
Spheniscus with awk",length:1200. Elements like author and title and
Megadyptes length are keys, with the following contents being values.
Eudyptes The advantage to this in the context of sorting is that you
Torvaldis can assign any field as the key and any record as the value,
and then use the built-in awk function asorti() (sort by index)
Because the field separator is a character that has special to sort by the key. For now, assume arbitrarily that you only
meaning to the Bash shell, you must enclose the semicolon want to sort by the second field.
in quotes or precede it with a backslash. This command is Awk statements not preceded by the special keywords
useful only to prove that you can focus on a specific field. BEGIN or END are loops that happen at each record. This
You can try the same command using the number of another is the part of the script that scans the data for patterns and
field to view the contents of another “column” of your data: processes it accordingly. Each time awk turns its attention
to a record, statements in {} (unless preceded by BEGIN or
$ awk --field-separator=";" '{print $3;}' penguins.list END) are executed.
Miller,JF To add a key and value to an array, create a variable (in
Wagler this example script, I call it ARRAY, which isn’t terribly origi-
Bonaparte nal, but very clear) containing an array, and then assign it a
Brisson key in brackets and a value with an equals sign (=).
Milne-Edwards
Viellot { # dump each field into an array
Ewing,L ARRAY[$2] = $R;
}
Nothing has been sorted yet, but this is good groundwork.
In this statement, the contents of the second field ($2) are
Scripting used as the key term, and the current record ($R) is used
Awk is more than just a command; it’s a programming lan- as the value.
guage with indices and arrays and functions. That’s signifi-
cant because it means you can grab a list of fields you want The asorti() function
to sort by, store the list in memory, process it, and then print In addition to arrays, awk has several basic functions that
the resulting data. For a complex series of actions such as you can use as quick and easy solutions for common tasks.
this, it’s easier to work in a text file, so create a new file called One of the functions introduced in GNU awk, asorti(), pro-
sorter.awk and enter this text: vides the ability to sort an array by key (or index) or value.
You can only sort the array once it has been populated,
#!/usr/bin/awk -f meaning that this action must not occur with every new
record but only the final stage of your script. For this pur-
BEGIN { pose, awk provides the special END keyword. The inverse
FS=";"; of BEGIN, an END statement happens only once and only
} after all records have been scanned.

A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM 27


AWK ONE-LINERS AND SCRIPTS TO HELP YOU SORT TEXT FILES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Add this to your script:


{ # dump each field into an array
END { ARRAY[$var] = $R;
asorti(ARRAY,SARRAY); }
# get length
j = length(SARRAY); Try running the script so that it sorts by the third field by using
the -v var option when you execute it:
for (i = 1; i <= j; i++) {
printf("%s %s\n", SARRAY[i],ARRAY[SARRAY[i]]) $ ./sorter.awk -v var=3 penguins.list
} Bonaparte Eudyptula;minor;Bonaparte;1867;Little Blue
} Brisson Spheniscus;demersus;Brisson;1760;African
Ewing,L Torvaldis;linux;Ewing,L;1996;Tux
The asorti() function takes the contents of ARRAY, sorts it Miller,JF Aptenodytes;forsteri;Miller,JF;1778;Emperor
by index, and places the results in a new array called SAR- Milne-Edwards Megadyptes;antipodes;Milne-Edwards;1880;
RAY (an arbitrary name I invented for this article, meaning Yellow-eyed
Sorted ARRAY ). Viellot Eudyptes;chrysocome;Viellot;1816;
Next, the variable j (another arbitrary name) is assigned Sothern Rockhopper
the results of the length() function, which counts the number Wagler Pygoscelis;papua;Wagler;1832;Gentoo
of items in SARRAY.
Finally, use a for loop to iterate through each item in Fixes
SARRAY using the printf() function to print each key, fol- This article has demonstrated how to sort data in pure GNU
lowed by the corresponding value of that key in ARRAY. awk. The script can be improved so, if it’s useful to you,
spend some time researching awk functions [2] on gawk’s
Running the script man page and customizing the script for better output.
To run your awk script, make it executable: Here is the complete script so far:

$ chmod +x sorter.awk #!/usr/bin/awk -f


# GPLv3 appears here
And then run it against the penguin.list sample data: # usage: ./sorter.awk -v var=NUM FILE

$ ./sorter.awk penguins.list BEGIN { FS=";"; }


antipodes Megadyptes;antipodes;Milne-Edwards;1880;
Yellow-eyed { # dump each field into an array
chrysocome Eudyptes;chrysocome;Viellot;1816; ARRAY[$var] = $R;
Sothern Rockhopper }
demersus Spheniscus;demersus;Brisson;1760;African
forsteri Aptenodytes;forsteri;Miller,JF;1778;Emperor END {
linux Torvaldis;linux;Ewing,L;1996;Tux asorti(ARRAY,SARRAY);
minor Eudyptula;minor;Bonaparte;1867;Little Blue # get length
papua Pygoscelis;papua;Wagler;1832;Gentoo j = length(SARRAY);

As you can see, the data is sorted by the second field. for (i = 1; i <= j; i++) {
This is a little restrictive. It would be better to have the printf("%s %s\n", SARRAY[i],ARRAY[SARRAY[i]])
flexibility to choose at runtime which field you want to use as }
your sorting key so you could use this script on any dataset }
and get meaningful results.

Adding command options Links


You can add a command variable to an awk script by using [1] 
https://opensource.com/article/19/10/get-sorted-sort
the literal value var in your script. Change your script so that [2] 
https://www.gnu.org/software/gawk/manual/html_node/
your iterative clause uses var when creating your array: Built_002din.html#Built_002din

28 A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A GAWK SCRIPT TO CONVERT SMART QUOTES

A gawk script
to convert smart quotes
I MANAGE a personal website and edit the web
pages by hand. Since I don’t have
many pages on my site, this works well for me, letting me
}
else {
# prev char is not a space
“scratch the itch” of getting into the site’s code. if (char == "'") {
When I updated my website’s design recently, I decided to printf("&rsquo;");
turn all the plain quotes into “smart quotes,” or quotes that }
look like those used in print material: “” instead of "". else if (char == "\"") {
Editing all of the quotes by hand would take too long, printf("&rdquo;");
so I decided to automate the process of converting the }
quotes in all of my HTML files. But doing so via a script or else {
program requires some intelligence. The script needs to printf("%c", char);
know when to convert a plain quote to a smart quote, and }
which quote to use. }
You can use different methods to convert quotes. Greg }
Pittman wrote a Python script [1] for fixing smart quotes in
text. I wrote mine in GNU awk (gawk) [2]. With that function, the body of the gawk script processes the
To start, I wrote a simple gawk function to evaluate a sin- HTML input file character by character. The script prints all
gle character. If that character is a quote, the function de- text verbatim when inside an HTML tag (for example, <html
termines if it should output a plain quote or a smart quote. lang="en">. Outside any HTML tags, the script uses the
The function looks at the previous character; if the previ- smartquote() function to print text. The smartquote() func-
ous character is a space, the function outputs a left smart tion does the work of evaluating when to print plain quotes
quote. Otherwise, the function outputs a right smart quote. or smart quotes.
The script does the same for single quotes.
function smartquote (char, prevchar) {
function smartquote (char, prevchar) { ...
# print smart quotes depending on the previous }
# 
character otherwise just print the character as-is
BEGIN {htmltag = 0}
if (prevchar ~ /\s/) {
# prev char is a space {
if (char == "'") { # for each line, scan one letter at a time:
printf("&lsquo;");
} linelen = length($0);
else if (char == "\"") {
printf("&ldquo;"); prev = "\n";
}
else { for (i = 1; i <= linelen; i++) {
printf("%c", char); char = substr($0, i, 1);
}

A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM 29


A GAWK SCRIPT TO CONVERT SMART QUOTES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

if (char == "<") { <body>


htmltag = 1; <h1><a href="/"><img src="logo.png" alt="Website logo" />
} </a></h1>
<p>"Hi there!"</p>
if (htmltag == 1) { <p>It's and its.</p>
printf("%c", char); </body>
} </html>
else { Sample output:
smartquote(char, prev);
prev = char; <!DOCTYPE html>
} <html lang="en">
<head>
if (char == ">") { <title>Test page</title>
htmltag = 0; <link rel="stylesheet" type="text/css" href="/test.css" />
} <meta charset="UTF-8">
} <meta name="viewport" content="width=device-width" />
</head>
# add trailing newline at end of each line <body>
printf ("\n"); <h1><a href="/"><img src="logo.png" alt="Website logo" />
} </a></h1>
<p>&ldquo;Hi there!&rdquo;</p>
Here’s an example: <p>It&rsquo;s and its.</p>
</body>
gawk -f quotes.awk test.html > test2.html </html>
Sample input:

<!DOCTYPE html> Links


<html lang="en"> [1] https://opensource.com/article/17/3/python-scribus-smart-
<head> quotes
<title>Test page</title> [2]  https://opensource.com/downloads/cheat-sheet-awk-
<link rel="stylesheet" type="text/css" href="/test.css" /> features
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width" />
</head>

30 A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DRINKING COFFEE WITH AWK

Drinking coffee with AWK


Keep track of what your office mates owe for the coffee they drink with a simple AWK program.

THE FOLLOWING
and details have been changed.
is based on a true story,
although some names
payment:jane:33
payment:pratyush:17
bought:john:60
payback:john:50
A long time ago, in a place far away, there was an
office. The office did not, for various reasons, buy in- Jane paid $33, Pratyush paid $17, John bought $60 worth of
stant coffee. Some workers in that office got together coffee, and the Coffee Corner paid John $50.
and decided to institute the “Coffee Corner.” Step 3: I was ready to write some code. The code would
process the members and payments and spit out an updated
A member of the Coffee Corner would buy some in- members file with the new debts.
stant coffee, and the other members would pay them
back. It came to pass that some people drank more #!/usr/bin/env --split-string=awk -F: -f
coffee than others, so the level of a “half-member”
was added: a half-member was allowed a limited The shebang (#!) line required some work! I used the env
number of coffees per week and would pay half of command to allow passing multiple arguments from the she-
what a member paid bang: specifically, the -F command-line argument to AWK
tells it what the field separator is.
Managing this was a huge pain. I had just read The Unix An AWK program is a sequence of rules. (It can also con-
Programming Environment and wanted to practice my AWK tain function definitions, but I don’t need any for the Coffee
[1] programming. So I volunteered to create a system. Corner.)
Step 1: I kept a database of members and their debt to the The first rule reads the members file. When I run the com-
Coffee Corner. I did it in an AWK-friendly format, where fields mand, I always give it the members file first, and the pay-
are separated by colons: ments file second. It uses AWK associative arrays to record
membership levels in the members array and current debt
member:john:1:22 in the debt array.
member:jane:0.5:33
member:pratyush:0.5:17 $1 == "member" {
member:jing:1:27 members[$2]=$3
debt[$2]=$4
The first field above identifies what kind of row this is total_members += $3
(member). The second field is the member’s name (i.e., }
their email username without the @). The next field is their
membership level (full=1 or half=0.5). The last field is their The second rule reduces the debt when a payment is
debt to the Coffee Corner. A positive number means they recorded.
owe money, a negative number means the Coffee Corner
owes them. $1 == "payment" {
Step 2: I kept a log of inputs to and outputs from the Coffee debt[$2] -= $3
Corner: }

A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM 31


DRINKING COFFEE WITH AWK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Payback is the opposite: it increases the debt. This ele- The END pattern is special: it happens exactly once, when
gantly supports the case of accidentally giving someone too AWK has no more lines to process. At this point, it spits out
much money. the new members file with updated debt levels.

$1 == "payback" { END {
debt[$2] += $3 for (x in members) {
} printf "%s:%s:%s\n", x, members[x], debt[x]
}
The most complicated part happens when someone buys }
("bought") instant coffee for the Coffee Club’s use. It is
treated as a payment and the person’s debt is reduced by Along with a script that iterates over the members and
the appropriate amount. Next, it calculates the per-member sends a reminder email to people to pay their dues (for
fee. It iterates over all members and increases their debt, positive debts), this system managed the Coffee Corner for
according to their level of membership. quite a while.

$1 == "bought" {
debt[$2] -= $3 Links
per_member = $3/total_members [1] https://en.wikipedia.org/wiki/AWK
for (x in members) {
debt[x] += per_member * members[x]
}
}

32 A PRACTICAL GUIDE TO LEARNING GNU AWK . CC BY-SA 4.0 . OPENSOURCE.COM


Opensource.com: GNU awk Cheat Sheet PAGE 1 OF 2 BY JIM HALL

Use this handy quick reference guide to the most commonly used features of GNU awk (gawk).
COMMAND-LINE USAGE REGULAR EXPRESSIONS

Run a gawk script using -f or include a short script right on the Common regular expression patterns include:
command line.
^ Matches start of a line
gawk -f file.awk file1 file2… $ Matches end of a line
or: . Matches any character, including newline

gawk 'pattern {action}' file1 file2… a Matches a single letter a


a+ Matches one or more a's
also: set the field separator using -F
a* Matches zero or more a's
gawk -F: …
a? Matches zero or one a's
PATTERNS [abc] Matches any of the characters a, b, or c

All program lines are some combination of a pattern and actions: [^abc] Negation; matches any character except a, b, or c
\. Use backslash (\) to match a special character (like .)
pattern {action}

where pattern can be: You can also use character classes, including:

• BEGIN (matches start of input) [:alpha:] Any alphabetic character


• END (matches end of input) [:lower:] Any lowercase letter
• a regular expression (act only on matching lines) [:upper:] Any uppercase letter
• a comparison (act only when true) [:digit:] Any numeric character
• empty (act on all lines) [:alnum:] Any alphanumeric character

ACTIONS [:cntrl:] Any control character


[:blank:] Spaces or tabs
Actions are very similar to C programming.
[:space:] Spaces, tabs, and other white space (such as
Actions can span multiple lines.
linefeed)
End statements with a semicolon (;)
For example: OPERATORS

(…) Grouping
BEGIN { FS = ":"; }
++ -- Increment and decrement
{ print "Hello world"; } ^ Exponents

{ +-! Unary plus, minus, and negation


print; */% Multiply, divide, and modulo
i = i + 1; +- Add and subtract
}
< > <= >= == != Relations
FIELDS ~ !~ Regular expression match or negated match

Gawk does the work for you and splits input lines so you can && Logical AND
reference them by field. Use -F on the command line or set FS || Logical OR
to set the field separator. = += -= *= /= %= ^= Assignment
• Reference fields using $
• $1 for the first string, and so on
• Use $0 for the entire line
For example:

gawk '{print "1st word:", $1;}' file.txt

or:

gawk -F: '{print "uid", $3;}' /etc/passwd

opensource.com Twitter @opensourceway | facebook.com/opensourceway | CC BY-SA 4.0


Opensource.com: GNU awk Cheat Sheet PAGE 2 OF 2 BY JIM HALL

FLOW CONTROL FUNCTIONS (CONTINUED)

You can use many common flow control and loop structures, substr(str, pos [, n])
including if, while, do-while, for, and switch.
Return the next n characters of the string str, starting at position pos.
if (i < 10) { print; } If n is omitted, return the rest of the string str.

while (i < 10) { print; i++; } tolower(str)

do { Return the string str, converted to all lowercase.


print; toupper(str)
i++;
} while (i < 10); Return the string str, converted to all uppercase.
Other common string functions include:
for (i = 1; i < 10; i++) { print i; }
match(str, regex)
switch (n) {
Return the position of the first occurrence of the regular
case 1: print "yes"; expression regex in the string str.

default: print "no"; sub(sub, repl [, str])
} For the first matching substring sub in the string $0, replace it
with repl.
FUNCTIONS
If you include the optional string str, operate on that string instead.
Frequently-used string functions include:
gsub(sub, repl [, str])
print "hello world"
print "user:" $1 Same as sub(), but replaces all matching substrings.
print $1, $2
split(str, arr [, del ])
print i
print Splits up the string str into the array arr, according to spaces
and tabs.
Print a value or string. If you don't give a value, outputs $0 instead.
If you include the optional string del, use that as the field
Use commas (,) to put space between the values. delimiter characters.
Use spaces ( ) to combine the output.
strtonum(str)
printf(fmt, values…)
Return the numeric value of the string str. Works with decimal,
The standard C printf function. octal, and hexadecimal values.

sprintf(fmt, values…) USER-DEFINED FUNCTIONS


Similar to the standard C sprintf function, returns the new string. You can define your own functions to add new functionality, or to
make frequently-used code easier to reference.
index(str, sub)
Define a function using the function keyword:
Return the index of the substring sub in the string str, or zero
if not found. function name(parameters) {
statements
length([str]) }
Return the length of the string $0.
If you include the string str, give that length instead.

opensource.com Twitter @opensourceway | facebook.com/opensourceway | CC BY-SA 4.0

You might also like