Perl Oneliners v1p0
Perl Oneliners v1p0
Preface 5
Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Feedback and Errata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Author info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Book version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
One-liner introduction 7
Why use Perl for one-liners? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Installation and Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Command line options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Executing Perl code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Field processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
BEGIN and END . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
ENV hash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Executing external commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Line processing 17
Regexp based filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Extracting matched portions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Transliteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Conditional substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Multiple conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
next . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
exit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Line number based processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Range operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Working with fixed strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Field separators 32
Default field separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Input field separator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2
Output field separator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Changing number of fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Defining field contents instead of using split . . . . . . . . . . . . . . . . . . . . . . 37
Fixed width processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Assorted field processing functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Record separators 46
Input record separator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Single character separator with -0 option . . . . . . . . . . . . . . . . . . . . . . . . 47
NUL separator and slurping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Paragraph mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Output record separator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Using modules 54
Standard modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Third party modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
CSV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
JSON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Convert one-liners to pretty formatted scripts . . . . . . . . . . . . . . . . . . . . . . 58
Modules to explore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4
Preface
This book focuses on Perl usage from the command line, similar to grep , sed and awk
usage. Syntax and features of these tools (along with languages like C and bash ) were
inspirations for Perl, so prior experience with them would make it easier to learn Perl.
You’ll learn about various command line options and Perl features that make it possible to
write compact cli scripts. Learning to use Perl from the command line will also allow you to
construct solutions where Perl is just another tool in the shell ecosystem.
Prerequisites
You should be comfortable with programming basics and have prior experience working with
Perl. You should know concepts like scalar, array, hash and special variables, be familiar with
control structures, regular expressions etc. If you need resources to get started with Perl and
regular expressions, you can start with these links:
• perldoc: perlintro
• learnxinyminutes: perl
• perldoc: perlretut
You should also have prior experience working with command line, bash shell and be familiar
with concepts like file redirection, command pipeline and so on.
Conventions
• The examples presented here have been tested with Perl version 5.32.0 and includes
features not available in earlier versions.
• Code snippets shown are copy pasted from bash shell and modified for presentation pur-
poses. Some commands are preceded by comments to provide context and explanations.
Blank lines have been added to improve readability, only real time is shown for speed
comparisons and so on.
• Unless otherwise noted, all examples and explanations are meant for ASCII characters
∘ See also stackoverflow: why does modern perl avoid utf-8 by default
• External links are provided for further reading throughout the book. Not necessary to
immediately visit them. They have been chosen with care and would help, especially
during re-reads.
• The learn_perl_oneliners repo has all the code snippets and files used in examples and ex-
ercises and other details related to the book. If you are not familiar with git command,
click the Code button on the webpage to get the files.
Acknowledgements
5
• pngquant and svgcleaner for optimizing images
• Warning and Info icons by Amada44 under public domain
• softwareengineering.stackexchange and skolakoda for programming quotes
A heartfelt thanks to all my readers. Your valuable support has significantly eased my financial
concerns and allows me to continue writing books.
I would highly appreciate if you’d let me know how you felt about this book, it would help to
improve this book as well as my future attempts. Also, please do let me know if you spot any
error or typo.
E-mail: [email protected]
Twitter: https://twitter.com/learn_byexample
Author info
Sundeep Agarwal is a freelance trainer, author and mentor. His previous experience includes
working as a Design Engineer at Analog Devices for more than 5 years. You can find his other
works, primarily focused on Linux command line, text processing, scripting languages and
curated lists, at https://github.com/learnbyexample. He has also been a technical reviewer for
Command Line Fundamentals book and video course published by Packt.
License
Book version
1.0
6
One-liner introduction
This chapter will give an overview of perl syntax for command line usage and some examples
to show what kind of problems are typically suited for one-liners.
I assume you are already familiar with use cases where command line is more productive
compared to GUI. See also this series of articles titled Unix as IDE.
A shell utility like bash provides built-in commands and scripting features to easily solve and
automate various tasks. External *nix commands like grep , sed , awk , sort , find
, parallel , etc can be combined to work with each other. Depending upon your familiarity
with those tools, you can either use perl as a single replacement or complement them for
specific use cases.
See also unix.stackexchange: when to use grep, sed, awk, perl, etc
If you are on a Unix like system, you are most likely to already have some version of Perl
installed. See cpan: Perl Source for instructions to install the latest perl version from
source. perl v5.32.0 is used for all the examples shown in this book.
7
You can use perldoc command to access documentation from the command line. You can visit
https://perldoc.perl.org/ if you wish to read it online, which also has a handy search feature.
Here’s some useful links to get started:
• perldoc: overview
• perldoc: perlintro
• perldoc: faqs
perl -h gives the list of all command line options, along with a brief description. See perldoc:
perlrun for documentation on these command switches.
Option Description
This chapter will show examples with -e , -l , -n , -p and -a options. Some more
options will be covered in later chapters, but not all of them are discussed in this book.
8
Executing Perl code
If you want to execute a perl program file, one way is to pass the filename as argument to
the perl command.
$ echo 'print "Hello Perl\n"' > hello.pl
$ perl hello.pl
Hello Perl
For short programs, you can also directly pass the code as an argument to the -e or -E
options. See perldoc: feature for details about the features enabled by the -E option.
$ perl -e 'print "Hello Perl\n"'
Hello Perl
Filtering
perl one-liners can be used for filtering lines matched by a regexp, similar to grep , sed
and awk . And similar to many command line utilities, perl can accept input from both
stdin and file arguments.
$ # sample stdin data
$ printf 'gate\napple\nwhat\nkite\n'
gate
apple
what
kite
By default, grep , sed and awk will automatically loop over input content line by line (with
\n as the line distinguishing character). The -n or -p option will enable this feature for
perl . O module section shows the code Perl runs with these options.
9
As seen before, the -e option accepts code as command line argument. Many shortcuts are
available to reduce the amount of typing needed. In the above examples, a regular expression
(defined by the pattern between a pair of forward slashes) has been used to filter the input.
When the input string isn’t specified, the test is performed against special variable $_ , which
has the contents of the current input line here (the correct term would be input record, see
Record separators chapter). $_ is also the default argument for many functions like print
and say . To summarize:
See perldoc: match for help on m operator. See perldoc: special variables for
documentation on $_ , $& , etc.
The learn_perl_oneliners repo has all the files used in examples (like table.txt in
the above example).
Substitution
Use s operator for search and replace requirements. By default, this operates on $_ when
the input string isn’t provided. For these examples, -p option is used instead of -n option,
so that the value of $_ is automatically printed after processing each input line. See perldoc:
search and replace for documentation and examples.
$ # for each input line, change only first ':' to '-'
$ # same as: sed 's/:/-/' and awk '{sub(/:/, "-")} 1'
$ printf '1:2:3:4\na:b:c:d\n' | perl -pe 's/:/-/'
1-2:3:4
a-b:c:d
10
$ # for each input line, change all ':' to '-'
$ # same as: sed 's/:/-/g' and awk '{gsub(/:/, "-")} 1'
$ printf '1:2:3:4\na:b:c:d\n' | perl -pe 's/:/-/g'
1-2-3-4
a-b-c-d
The s operator modifies the input string it is acting upon if the pattern matches. In
addition, it will return number of substitutions made if successful, otherwise returns a
false value (empty string or 0 ). You can use r flag to return string after substitution
instead of in-place modification. As mentioned before, this book assumes you are already
familiar with perl regular expressions. If not, see perldoc: perlretut to get started.
Field processing
Consider the sample input file shown below with fields separated by a single space character.
$ cat table.txt
brown bread mat hair 42
blue cake mug shirt -7
yellow banana window shoes 3.14
Here’s some examples that is based on specific field rather than the entire line. The -a option
will cause the input line to be split based on whitespaces and the field contents can be accessed
using @F special array variable. Leading and trailing whitespaces will be suppressed, so
there’s no possibility of empty fields. More details is discussed in Default field separation
section.
$ # print the second field of each input line
$ # same as: awk '{print $2}' table.txt
$ perl -lane 'print $F[1]' table.txt
bread
cake
banana
See Output field separator section for details on using array variable inside double quotes.
11
BEGIN and END
You can use a BEGIN{} block when you need to execute something before input is read and
a END{} block to execute something after all of the input has been processed.
$ # same as: awk 'BEGIN{print "---"} 1; END{print "%%%"}'
$ seq 4 | perl -pE 'BEGIN{say "---"} END{say "%%%"}'
---
1
2
3
4
%%%
ENV hash
When it comes to automation and scripting, you’d often need to construct commands that can
accept input from user, file, output of a shell command, etc. As mentioned before, this book
assumes bash as the shell being used. To access environment variables of the shell, you can
use the special hash variable %ENV with the name of the environment variable as a string
key.
Quotes won’t be used around hash keys in this book. See stackoverflow: are
quotes around hash keys a good practice in Perl? on possible issues if you don’t quote
the hash keys.
12
$ # assume 'r' is a shell variable that has to be passed to the perl command
$ r='\Bpar\B'
$ rgx="$r" perl -ne 'print if /$ENV{rgx}/' word_anchors.txt
apparent effort
two spare computers
You can also make use of the -s option to assign a perl variable.
$ r='\Bpar\B'
$ perl -sne 'print if /$rgx/' -- -rgx="$r" word_anchors.txt
apparent effort
two spare computers
As an example, see my repo ch: command help for a practical shell script, where
commands are constructed dynamically.
You can execute external commands using the system function. See perldoc: system for
documentation and details like how string/list argument is processed before it is executed.
$ perl -e 'system("echo Hello World")'
Hello World
Return value of system or special variable $? can be used to act upon exit status of
command issued. As per documentation:
The return value is the exit status of the program as returned by the wait call. To
get the actual exit value, shift right by eight
13
To save the result of an external command, use backticks or qx operator. See perldoc: qx
for documentation and details like separating out STDOUT and STDERR .
$ perl -e '$words = `wc -w <word_anchors.txt`; print $words'
12
Summary
This chapter introduced some of the common options for perl cli usage, along with typical
cli text processing examples. While specific purpose cli tools like grep , sed and awk
are usually faster, perl has a much more extensive standard library and ecosystem. And
you do not have to learn a lot if you are already comfortable with perl but not familiar with
those cli tools. The next section has a few exercises for you to practice the cli options and text
processing use cases.
Exercises
All the exercises are also collated together in one place at Exercises.md. To see the
solutions, visit Exercise_solutions.md.
14
b) For the input file ip.txt , display first field of lines not containing y . Consider space as
the field separator for this file.
##### add your solution here
Hello
This
12345
c) For the input file ip.txt , display all lines containing no more than 2 fields.
##### add your solution here
Hello World
12345
d) For the input file ip.txt , display all lines containing is in the second field.
##### add your solution here
Today is sunny
e) For each line of the input file ip.txt , replace first occurrence of o with 0 .
##### add your solution here
Hell0 World
H0w are you
This game is g0od
T0day is sunny
12345
Y0u are funny
f) For the input file table.txt , calculate and display the product of numbers in the last field
of each line. Consider space as the field separator for this file.
$ cat table.txt
brown bread mat hair 42
blue cake mug shirt -7
yellow banana window shoes 3.14
g) Append . to all the input lines for the given stdin data.
$ printf 'last\nappend\nstop\n' | ##### add your solution here
last.
append.
stop.
h) Use contents of s variable to display all matching lines from the input file ip.txt .
Assume that s doesn’t have any regexp metacharacters. Construct the solution such that
there’s at least one word character immediately preceding the contents of s variable.
$ s='is'
15
i) Use system to display contents of filename present in second field (space separated) of
the given input line.
$ s='report.log ip.txt sorted.txt'
$ echo "$s" | ##### add your solution here
Hello World
How are you
This game is good
Today is sunny
12345
You are funny
$ s='power.txt table.txt'
$ echo "$s" | ##### add your solution here
brown bread mat hair 42
blue cake mug shirt -7
yellow banana window shoes 3.14
16
Line processing
Now that you are familiar with basic perl cli usage, this chapter will dive deeper into line pro-
cessing examples. You’ll learn various ways for matching lines based on regular expressions,
fixed string matching, line numbers, etc. You’ll also see how to group multiple statements and
learn about control flow keywords next and exit .
As mentioned before:
If required, you can also use different delimiters. Quoting from perldoc: match:
If / is the delimiter then the initial m is optional. With the m you can use any pair of
non-whitespace (ASCII) characters as delimiters. This is particularly useful for matching
path names that contain / , to avoid LTS (leaning toothpick syndrome). If ? is the
delimiter, then a match-only-once rule applies, described in m?PATTERN? below. If '
(single quote) is the delimiter, no variable interpolation is performed on the PATTERN.
When using a delimiter character valid in an identifier, whitespace is required after the
m . PATTERN may contain variables, which will be interpolated every time the pattern
search is evaluated, except for when the delimiter is a single quote.
$ cat paths.txt
/foo/a/report.log
/foo/y/power.log
/foo/abc/errors.log
You can use regexp related special variables to extract only the matching portions instead of
filtering entire matching line. Consider this input file.
$ cat programming_quotes.txt
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
17
by definition, not smart enough to debug it by Brian W. Kernighan
A language that does not affect the way you think about programming,
is not worth knowing by Alan Perlis
$ # sometimes capture groups are enough, you don't need special variables
$ # @{ˆCAPTURE} isn't needed here, as it is assumed that every line has a match
$ perl -nE 'say /ˆ(\w+ ).*?(\d+)$/' table.txt
brown 42
blue 7
yellow 14
$ # or add a custom separator
$ perl -nE 'say join ":", /ˆ(\w+).*?(\d+)$/' table.txt
brown:42
blue:7
yellow:14
Transliteration
The transliteration operator tr (or y ) allows you to specify per character transformation
rule. See perldoc: tr for documentation.
$ # rot13
$ echo 'Uryyb Jbeyq' | perl -pe 'tr/a-zA-Z/n-za-mN-ZA-M/'
Hello World
18
$ # use 'd' option to delete specified characters
$ echo 'foo:123:baz' | perl -pe 'tr/0-9\n//cd'
123
Similar to s operator, tr will return number of changes made. Use r option to prevent
in-place modification and return the transliterated string instead.
$ # match lines containing 'b' 2 times
$ perl -ne 'print if tr/b// == 2' table.txt
brown bread mat hair 42
See also:
Conditional substitution
These examples combine line filtering and substitution in different ways. As noted before, s
operator will modify the input string and the return value can be used to know how many
substitutions were made. Use the r flag to prevent in-place modification and get string
output after substitution, if any.
$ # change commas to hyphens if the input line does NOT contain '2'
$ # prints all input lines even if substitution fails
$ printf '1,2,3,4\na,b,c,d\n' | perl -pe 's/,/-/g if !/2/'
1,2,3,4
a-b-c-d
19
Multiple conditions
It is good to remember that Perl is a programming language. You have control structures and
you can combine multiple conditions using logical operators. You don’t have to create a single
complex regexp.
$ perl -ne 'print if /not/ && !/it/' programming_quotes.txt
A language that does not affect the way you think about programming,
is not worth knowing by Alan Perlis
next
When next is executed, rest of the code will be skipped and the next input line will be fetched
for processing. It doesn’t affect BEGIN or END blocks as they are outside the file content
loop.
$ perl -nE 'if(/\bpar/){print "%% $_"; next}
say /s/ ? "X" : "Y"' word_anchors.txt
%% sub par
X
Y
X
%% cart part tart mart
Note that {} is used in the above example to group multiple statements to be executed for
a single if condition. You’ll see many more examples with next in coming chapters.
exit
The exit function is useful to avoid processing unnecessary input content when a termination
condition is reached. See perldoc: exit for documentation.
$ # quits after an input line containing 'you' is found
$ perl -ne 'print; exit if /you/' programming_quotes.txt
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
20
Use tac to get all lines starting from last occurrence of the search string with respect to
entire file content.
$ tac programming_quotes.txt | perl -ne 'print; exit if /not/' | tac
is not worth knowing by Alan Perlis
You can optionally provide a status code as an argument to the exit function.
$ printf 'sea\neat\ndrop\n' | perl -ne 'print; exit(2) if /at/'
sea
eat
$ echo $?
2
Any code in END block will still be executed before exiting. This doesn’t apply if exit was
called from the BEGIN block.
$ perl -pE 'exit if /cake/' table.txt
brown bread mat hair 42
$ perl -pE 'exit if /cake/; END{say "bye"}' table.txt
brown bread mat hair 42
bye
$ perl -pE 'BEGIN{say "hi"; exit; say "hello"} END{say "bye"}' table.txt
hi
Be careful if you want to use exit with multiple input files, as perl will stop
even if there are other files remaining to be processed.
Line numbers can also be specified as a matching criteria using the $. special variable.
$ # print only the 3rd line
$ perl -ne 'print if $. == 3' programming_quotes.txt
by definition, not smart enough to debug it by Brian W. Kernighan
21
$ # print from particular line number to the end of input
$ seq 14 25 | perl -ne 'print if $. >= 10'
23
24
25
Use eof function to check for end of file condition. See perldoc: eof for documentation.
$ # same as: tail -n1 programming_quotes.txt
$ perl -ne 'print if eof' programming_quotes.txt
naming things, and off-by-1 errors by Leon Bambrick
For large input files, use exit to avoid processing unnecessary input lines.
$ seq 3542 4623452 | perl -ne 'if($. == 2452){print; exit}'
5993
$ seq 3542 4623452 | perl -ne 'print if $. == 250; if($. == 2452){print; exit}'
3791
5993
Range operator
You can use range operator to select between pair of matching conditions like line numbers
and regexp. See perldoc: range for documentation.
$ # the range is automatically compared against $. in this context
$ seq 14 25 | perl -ne 'print if 3..5'
16
17
18
22
by definition, not smart enough to debug it by Brian W. Kernighan
There are 2 hard problems in computer science: cache invalidation,
naming things, and off-by-1 errors by Leon Bambrick
See Records bounded by distinct markers section for an alternate, flexible solution.
Both conditions can match the same line too! Also, if the second condition doesn’t
match, lines starting from first condition to the last line of the input will be matched.
You can surround a regexp pattern with \Q and \E to match it as a fixed string, similar to
grep -F option. \E can be left out if there’s no further pattern to be specified. If you want
to filter a line based on fixed string alone, you can also use the index function. See perldoc:
quotemeta and perldoc: index for documentation.
$ # no match, since [] are character class metacharacters
$ echo 'int a[5]' | perl -ne 'print if /a[5]/'
23
$ echo 'int a[5]' | perl -ne 'print if /\Qa[5]/'
int a[5]
$ echo 'int a[5]' | perl -pe 's/\Qa[5]/b[12]/'
int b[12]
The above index example uses double quotes for the string argument, which allows escape
sequences like \t , \n , etc and interpolation. This isn’t the case with single quoted string
values. Using single quotes within the script from command line requires messing with shell
metacharacters. So, use q operator instead or pass the fixed string to be matched as an
environment variable.
$ # double quotes allow escape sequences and interpolation
$ perl -E '$a=5; say "value of a:\t$a"'
value of a: 5
You can use the return value of index function to restrict the matching to the start or end
of the input line. The line content in $_ variable contains the \n line ending character
as well. You can either use chomp function explicitly or use the -l command line option,
which will be discussed in detail in Record separators chapter. For now, it is enough to know
that -l will remove the line ending from $_ and add it back when print is used.
$ cat eqns.txt
a=b,a-b=c,c*d
a+b,pi=3.14,5e12
i*(t+9-g)/8,4-a+b
$ # start of line
$ s='a+b' perl -ne 'print if index($_, $ENV{s})==0' eqns.txt
a+b,pi=3.14,5e12
$ # end of line
$ # same as: s='a+b' perl -ne 'print if /\Q$ENV{s}\E$/' eqns.txt
$ # length function returns number of characters, by default acts on $_
$ # -l option is needed here to remove \n from $_
$ s='a+b' perl -lne '$pos = length() - length($ENV{s});
print if index($_, $ENV{s}) == $pos' eqns.txt
i*(t+9-g)/8,4-a+b
24
Here’s some more examples using the return value of index function.
$ # since 'index' returns '-1' if there's no match,
$ # you need to add >=0 check as well for < or <= comparison
$ perl -ne '$i = index($_, "="); print if $i>=0 && $i<=5' eqns.txt
a=b,a-b=c,c*d
If you need to match entire input line or field, you can use string comparison operators.
$ printf 'a.b\na+b\n' | perl -lne 'print if /ˆa.b$/'
a.b
a+b
$ printf 'a.b\na+b\n' | perl -lne 'print if $_ eq q/a.b/'
a.b
To provide a fixed string in replacement section, environment variable comes in handy again.
Or use q operator for directly providing the value, but you may have to workaround the
delimiters being used.
$ # characters like $ and @ are special in replacement section
$ echo 'x+y' | perl -pe 's/\Qx+y/$x+@y/'
+
$ # or, use 'e' flag to provide single quoted value as Perl code
$ echo 'x+y' | perl -pe 's/\Qx+y/q($x+@y)/e'
$x+@y
Summary
This chapter showed various examples of processing only lines of interest instead of entire
input file. Filtering can be specified using a regexp, fixed string, line number or a combination
of them. next and exit are useful to change the flow of code.
25
Exercises
b) Display only fourth, fifth, sixth and seventh lines for the given input.
$ seq 65 78 | ##### add your solution here
68
69
70
71
c) For the input file ip.txt , replace all occurrences of are with are not and is with
is not only from line number 4 till end of file. Also, only the lines that were changed should
be displayed in the output.
$ cat ip.txt
Hello World
How are you
This game is good
Today is sunny
12345
You are funny
d) For the given stdin , display only the first three lines. Avoid processing lines that are not
relevant.
$ seq 14 25 | ##### add your solution here
14
15
16
e) For the input file ip.txt , display all lines from start of the file till the first occurrence of
game .
##### add your solution here
Hello World
How are you
This game is good
f) For the input file ip.txt , display all lines that contain is but not good .
##### add your solution here
Today is sunny
g) For the input file ip.txt , extract the word before the whole word is as well as the
26
word after it. If such a match is found, display the two words around is in reversed order.
For example, hi;1 is--234 bye should be converted to 234:1 . Assume that whole word
is will not be present more than once in a single line.
##### add your solution here
good:game
sunny:Today
h) For the given input string, replace 0xA0 with 0x7F and 0xC0 with 0x1F .
$ s='start address: 0xA0, func1 address: 0xC0'
i) Find the starting index of first occurrence of is or the or was or to for each input
line of the file idx.txt . Assume all input lines will match at least one of these terms.
$ cat idx.txt
match after the last newline character
and then you want to test
this is good bye then
you were there to see?
j) Display all lines containing [4]* for the given stdin data.
$ printf '2.3/[4]*6\n2[4]5\n5.3-[4]*9\n' | ##### add your solution here
2.3/[4]*6
5.3-[4]*9
k) For the given input string, replace all lowercase alphabets to x only for words starting
with m .
$ s='ma2T3a a2p kite e2e3m meet'
$ echo "$s" | ##### add your solution here
xx2T3x a2p kite e2e3m xxxx
l) For the input file ip.txt , delete all characters other than lowercase vowels and newline
character. Perform this transformation only between a line containing you up to line number
4 (inclusive).
##### add your solution here
Hello World
oaeou
iaeioo
oaiu
12345
You are funny
27
In-place file editing
In the examples presented so far, the output from perl was displayed on the terminal or
redirected to another file. This chapter will discuss how to write back the changes to the
input file(s) itself using the -i command line option. This option can be configured to make
changes to the input file(s) with or without creating a backup of original contents. When
backups are needed, the original filename can get a prefix or a suffix or both. And the backups
can be placed in the same directory or some other directory as needed.
With backup
You can use the -i option to write back the changes to the input file instead of displaying
the output on terminal. When an extension is provided as an argument to -i , the original
contents of the input file gets preserved as per the extension given. For example, if the input
file is ip.txt and -i.orig is used, ip.txt.orig will be the backup filename.
$ cat colors.txt
deep blue
light orange
blue delight
Multiple input files are treated individually and the changes are written back to respective
files.
$ cat t1.txt
have a nice day
bad morning
what a pleasant evening
$ cat t2.txt
worse than ever
too bad
28
t1.txt t1.txt.bkp t2.txt t2.txt.bkp
$ cat t1.txt
have a nice day
good morning
what a pleasant evening
$ cat t2.txt
worse than ever
too good
Without backup
Sometimes backups are not desirable. Using -i option on its own will not create backups.
Be careful though, as changes made cannot be undone. In such cases, test the command with
sample input before using -i option on actual file. You could also use the option with backup,
compare the differences with a diff program and then delete the backup.
$ cat fruits.txt
banana
papaya
mango
A * character in the argument to -i option is special. It will get replaced with the input
filename. This is helpful if you need to use a prefix instead of suffix for the backup filename.
Or any other combination that may be needed.
$ ls *colors.txt*
colors.txt colors.txt.bkp
The * trick can also be used to place the backups in another directory instead of the parent
directory of input files. The backup directory should already exist for this to work.
29
$ mkdir backups
$ perl -i'backups/*' -pe 's/good/nice/' t1.txt t2.txt
$ ls backups/
t1.txt t2.txt
Summary
This chapter discussed about the -i option which is useful when you need to edit a file in-
place. This is particularly useful in automation scripts. But, do ensure that you have tested the
perl command before applying to actual files if you need to use this option without creating
backups.
Exercises
a) For the input file text.txt , replace all occurrences of in with an and write back the
changes to text.txt itself. The original contents should get saved to text.txt.orig
$ cat text.txt
can ran want plant
tin fin fit mine line
$ cat text.txt
can ran want plant
tan fan fit mane lane
$ cat text.txt.orig
can ran want plant
tin fin fit mine line
b) For the input file text.txt , replace all occurrences of an with in and write back the
changes to text.txt itself. Do not create backups for this exercise. Note that you should
have solved the previous exercise before starting this one.
$ cat text.txt
can ran want plant
tan fan fit mane lane
30
$ cat text.txt
cin rin wint plint
tin fin fit mine line
$ diff text.txt text.txt.orig
1c1
< cin rin wint plint
---
> can ran want plant
c) For the input file copyright.txt , replace copyright: 2018 with copyright: 2020 and
write back the changes to copyright.txt itself. The original contents should get saved to
2018_copyright.txt.bkp
$ cat copyright.txt
bla bla 2015 bla
blah 2018 blah
bla bla bla
copyright: 2018
$ cat copyright.txt
bla bla 2015 bla
blah 2018 blah
bla bla bla
copyright: 2020
$ cat 2018_copyright.txt.bkp
bla bla 2015 bla
blah 2018 blah
bla bla bla
copyright: 2018
d) In the code sample shown below, two files are created by redirecting output of echo
command. Then a perl command is used to edit b1.txt in-place as well as create a
backup named bkp.b1.txt . Will the perl command work as expected? If not, why?
$ echo '2 apples' > b1.txt
$ echo '5 bananas' > -ibkp.txt
$ perl -ibkp.* -pe 's/2/two/' b1.txt
31
Field separators
This chapter will dive deep into field processing. You’ll learn how to set input and output field
separators, how to use regexps for defining fields and how to work with fixed length fields.
By default, the -a option splits based on one or more sequence of whitespace characters.
In addition, whitespaces at the start or end of input gets trimmed and won’t be part of field
contents. Using -a is equivalent to @F = split . From perldoc: split:
split emulates the default behavior of the command line tool awk when the PATTERN
is either omitted or a string composed of a single space character (such as ' ' or
"\x20" , but not e.g. / / ). In this case, any leading whitespace in EXPR is removed
before splitting occurs, and the PATTERN is instead treated as if it were /\s+/ ; in
particular, this means that any contiguous whitespace (not just a single space character)
is used as a separator. However, this special treatment can be avoided by specifying the
pattern / / instead of the string " " , thereby allowing only a single space character
to be a separator.
You can use the -F command line option to specify a custom regexp field separator. Note
that -a option implicitly sets -n and -F option implicitly sets -n and -a on newer
versions of Perl. However, this book will always explicitly use these options.
$ # use ':' as input field separator
$ echo 'goal:amazing:whistle:kwality' | perl -F: -anE 'say "$F[0]\n$F[2]"'
goal
whistle
32
$ # use quotes to avoid clashes with shell special characters
$ echo 'one;two;three;four' | perl -F';' -anE 'say $F[2]'
three
You can also specify the regexp to -F option inside // delimiters as well as add LIMIT
argument if needed.
$ # count number of vowels for each input line
$ # can also use: -F'(?i)[aeiou]'
$ printf 'COOL\nnice car\n' | perl -F'/[aeiou]/i' -anE 'say $#F'
2
3
$ # note that newline character is present as part of the last field content
$ echo 'goal:amazing:whistle:kwality' | perl -F'/:/,$_,2' -ane 'print $F[1]'
amazing:whistle:kwality
To get individual characters, you can use empty argument for the -F option.
$ echo 'apple' | perl -F -anE 'say $F[0]'
a
For more information about using perl with different encodings, see:
• perldoc: -C option
• unix.stackexchange: tr with unicode characters
• stackoverflow: Why does modern Perl avoid UTF-8 by default?
If the custom field separator with -F option doesn’t affect the newline character,
then the last element can contain the newline character.
$ # last element will have newline character since field separator is ':'
$ echo 'cat:dog' | perl -F: -anE 'say "[$F[-1]]"'
[dog
]
33
$ # unless the input itself doesn't have newline character
$ printf 'cat:dog' | perl -F: -anE 'say "[$F[-1]]"'
[dog]
The newline character can also show up as the content of last field.
$ # both leading and trailing whitespaces are trimmed
$ echo ' a b c ' | perl -anE 'say $#F'
2
As mentioned before, the -l option is helpful if you wish to remove the newline character
(more details will be discussed in Record separators chapter). A side effect of removing the
newline character before applying split is that a trailing empty field will also get removed
(you can explicitly call split function with -1 as limit to prevent this).
$ # -l will remove the newline character
$ # -l will also cause 'print' to append the newline character
$ echo 'cat:dog' | perl -F: -lane 'print "[$F[-1]]"'
[dog]
As per perldoc: -F option, ”You can’t use literal whitespace or NUL characters in
the pattern.” Here’s some examples.
$ # Error!!
34
$ echo 'pick eat rest laugh' | perl -F't[ ]' -lane 'print $F[1]'
Unmatched [ in regex; marked by <-- HERE in m/t[ <-- HERE /.
$ # no issues if 'split' is used explicitly
$ echo 'pick eat rest laugh' | perl -lne 'print((split /t[ ]/)[1])'
res
There are a few ways to affect the separator to be used while displaying multiple values.
Method 1: The value of $, special variable is used as the separator when multiple arguments
(or list/array) are passed to print and say functions. $, could be remembered easily
by noting that , is used to separate multiple arguments. Note that -l option is used in
the examples below as a good practice even when not needed.
See perldoc: perlvar for alternate names of special variables if you use metacpan:
English module. For example, $OFS or $OUTPUT_FIELD_SEPARATOR instead of $,
$ s='Sample123string42with777numbers'
$ echo "$s" | perl -F'\d+' -lane 'BEGIN{$,=","} print @F'
Sample,string,with,numbers
$ s='goal:amazing:whistle:kwality'
$ echo "$s" | perl -F: -lane 'print join "-", @F[-1, 1, 0]'
kwality-amazing-goal
$ echo "$s" | perl -F: -lane 'print join "::", @F, 42'
goal::amazing::whistle::kwality::42
35
Method 3: You can also manually build the output string within double quotes. Or use $" to
specify the field separator for an array value within double quotes. $" could be remembered
easily by noting that interpolation happens within double quotes.
$ s='goal:amazing:whistle:kwality'
$ echo "$s" | perl -F: -lane 'BEGIN{$"="-"} print "msg: @F[-1, 1, 0]"'
msg: kwality-amazing-goal
$ # reducing fields
$ echo "$s" | perl -F: -lane '$#F=1; print join ",", @F'
goal,amazing
$ # increasing fields
$ echo "$s" | perl -F: -lane '$F[$#F+1]="sea"; print join ":", @F'
goal:amazing:whistle:kwality:sea
36
$ # adds a new grade column based on marks in 3rd column
$ perl -anE 'BEGIN{$,="\t"; @g = qw(D C B A S)}
say @F, $.==1 ? "Grade" : $g[$F[-1]/10 - 5]' marks.txt
Dept Name Marks Grade
ECE Raj 53 D
ECE Joel 72 B
EEE Moi 68 C
CSE Surya 81 A
EEE Tia 59 D
ECE Om 92 S
CSE Amy 67 C
The -F option uses the split function to get field values from input content. In contrast,
using /regexp/g allows you to define what should the fields be made up of.
$ s='Sample123string42with777numbers'
Here’s some examples to display results only if there’s a match. Without the if conditions,
you’ll get empty lines for non-matching lines. Quoting from perldoc: The empty pattern //
If the PATTERN evaluates to the empty string, the last successfully matched regular ex-
pression is used instead. In this case, only the g and c flags on the empty pattern are
honored; the other flags are taken from the original pattern. If no match has previously
succeeded, this will (silently) act instead as a genuine empty pattern (which will always
match).
$ # /\bb\w*\b/ will come into play only if a word starting with 'h' isn't found
$ # so, first line matches 'hair' but not 'brown' or 'bread'
$ # other lines don't have words starting with 'h'
$ perl -nE 'say join "\n", //g if /\bh\w*\b/ || /\bb\w*\b/' table.txt
hair
blue
banana
37
A simple split fails for csv input where fields can contain embedded delimiter characters.
For example, a field content "fox,42" when , is the delimiter.
$ s='eagle,"fox,42",bee,frog'
While metacpan: Text::CSV module should be preferred for robust csv parsing, regexp is
enough for simple formats.
$ echo "$s" | perl -lne 'print((/"[ˆ"]+"|[ˆ,]+/g)[1])'
"fox,42"
The unpack function is more than just a different way of string slicing. It supports various
formats and pre-processing, see perldoc: unpack, perldoc: pack and perldoc: perlpacktut for
details.
In the example below, a indicates arbitrary binary string. The optional number that follows
indicates length of the field.
$ cat items.txt
apple fig banana
50 10 200
Using * will cause remaining characters of that particular format to be consumed. Here Z
is used to process ASCII NUL separated string.
$ printf 'banana\x0050\x00' | perl -nE 'say join ":", unpack "Z*Z*"'
banana:50
38
$ # first field is 5 characters, then 3 characters are ignored
$ # all the remaining characters are assigned to second field
$ perl -lne 'print join ",", unpack "a5x3a*"' items.txt
apple,fig banana
50 ,10 200
Unpacking isn’t always needed, string slicing using substr may suffice. See perldoc: substr
for documentation.
$ # same as: perl -F -anE 'say @F[2..4]'
$ echo 'b 123 good' | perl -nE 'say substr $_,2,3'
123
$ echo 'b 123 good' | perl -ne 'print substr $_,6'
good
Having seen command line options and features commonly used for field processing, this sec-
tion will highlight some of the built-in functions. There’s just too many to meaningfully cover
them in all in detail, so consider this to be just a brief overview of features. See also perldoc:
Perl Functions by Category.
First up, the grep function that allows you to select fields based on a condition. In scalar
context, it returns number of fields that matched the given condition. See perldoc: grep
for documentation. See also unix.stackexchange: create lists of words according to binary
numbers.
$ s='goal:amazing:42:whistle:kwality:3.14'
$ echo '20 711 -983 5 21' | perl -lane 'print join ":", grep {$_ > 20} @F'
711:21
39
$ # maximum of one field containing 'r'
$ perl -lane 'print if 1 >= grep {/r/} @F' table.txt
blue cake mug shirt -7
yellow banana window shoes 3.14
The map function transforms each element according to the logic passed to it. See perldoc:
map for documentation.
$ s='goal:amazing:42:whistle:kwality:3.14'
$ echo "$s" | perl -F: -lane 'print join ":", map {uc} @F'
GOAL:AMAZING:42:WHISTLE:KWALITY:3.14
$ echo "$s" | perl -F: -lane 'print join ":", map {/ˆ[gw]/ ? uc : $_} @F'
GOAL:amazing:42:WHISTLE:kwality:3.14
$ echo '23 756 -983 5' | perl -lane 'print join ":", map {$_ ** 2} @F'
529:571536:966289:25
$ echo 'AaBbCc' | perl -F -lane 'print join " ", map {ord} @F'
65 97 66 98 67 99
$ # for in-place modification of the input array
$ echo 'AaBbCc' | perl -F -lane 'map {$_ = ord} @F; print "@F"'
65 97 66 98 67 99
$ echo 'a b c' | perl -lane 'print join ",", map {qq/"$_"/} @F'
"a","b","c"
$ # with 'grep' alone, provided the transformation doesn't affect the condition
$ # also, @F will be changed here, above map+grep code will not affect @F
$ echo "$s" | perl -lane 'print join "\n", grep {y/ae/X/; /ˆh/} @F'
hour
hXnd
hXXtXd
Here’s some examples with sort and reverse functions for arrays and strings. See perldoc:
sort and perldoc: reverse for documentation.
$ # sorting numbers
$ echo '23 756 -983 5' | perl -lane 'print join " ", sort {$a <=> $b} @F'
-983 5 23 756
40
$ # sort by length of the fields in ascending order
$ echo "$s" | perl -lane 'print join ":", sort {length($a) <=> length($b)} @F'
to:bat:four:floor:dubious
$ # descending order
$ echo "$s" | perl -lane 'print join ":", sort {length($b) <=> length($a)} @F'
dubious:floor:four:bat:to
$ # same as: perl -F -lane 'print sort {$b cmp $a} @F'
$ echo 'foobar' | perl -F -lane 'print reverse sort @F'
roofba
Here’s an example with multiple sorting conditions. If the transformation applied for each field
is expensive, using Schwartzian transform can provide a faster result. See also stackoverflow:
multiple sorting conditions.
$ s='try a bad to good i teal by nice how'
Here’s an example for sorting in descending order based on header column names.
$ cat marks.txt
Dept Name Marks
ECE Raj 53
ECE Joel 72
EEE Moi 68
CSE Surya 81
EEE Tia 59
ECE Om 92
CSE Amy 67
41
See Using modules chapter for more field processing functions.
Summary
This chapter discussed various ways in which you can split (or define) the input into fields and
manipulate them. Many more examples will be discussed in later chapters.
Exercises
a) Extract only the contents between () or )( from each input line. Assume that ()
characters will be present only once every line.
$ cat brackets.txt
foo blah blah(ice) 123 xyz$
(almond-pista) choco
yo )yoyo( yo
b) For the input file scores.csv , extract Name and Physics fields in the format shown
below.
$ cat scores.csv
Name,Maths,Physics,Chemistry
Blue,67,46,99
Lin,78,83,80
Er,56,79,92
Cy,97,98,95
Ort,68,72,66
Ith,100,100,100
c) For the input file scores.csv , display names of those who’ve scored above 80 in Maths.
##### add your solution here
Cy
Ith
42
d) Display the number of word characters for the given inputs. Word definition here is same
as used in regular expressions. Can you construct two different solutions as indicated below?
$ # solve using 's' operator
$ echo 'hi there' | ##### add your solution here
7
e) Construct a solution that works for both the given sample inputs and the corresponding
output shown.
$ s1='1 "grape" and "mango" and "guava"'
$ s2='("a 1""d""c-2""b")'
f) Display only the third and fifth characters from each input line.
$ printf 'restore\ncat one\ncricket' | ##### add your solution here
so
to
ik
g) Transform the given input file fw.txt to get the output as shown below. If second field is
empty (i.e. contains only space characters), replace it with NA .
$ cat fw.txt
1.3 rs 90 0.134563
3.8 6
5.2 ye 8.2387
4.2 kt 32 45.1
h) For the input file scores.csv , display the header as well as any row which contains b
or t (irrespective of case) in the first field.
##### add your solution here
Name,Maths,Physics,Chemistry
Blue,67,46,99
Ort,68,72,66
Ith,100,100,100
i) Extract all whole words that contains 42 but not at the edge of a word. Assume a word
43
cannot contain 42 more than once.
$ s='hi42bye nice1423 bad42 cool_42a 42fake'
$ echo "$s" | ##### add your solution here
hi42bye
nice1423
cool_42a
j) For the input file scores.csv , add another column named GP which is calculated out of
100 by giving 50% weightage to Maths and 25% each for Physics and Chemistry .
##### add your solution here
Name,Maths,Physics,Chemistry,GP
Blue,67,46,99,69.75
Lin,78,83,80,79.75
Er,56,79,92,70.75
Cy,97,98,95,96.75
Ort,68,72,66,68.5
Ith,100,100,100,100.0
k) For the input file mixed_fs.txt , retain only first two fields from each input line. The input
and output field separators should be space for first two lines and , for the rest of the lines.
$ cat mixed_fs.txt
rose lily jasmine tulip
pink blue white yellow
car,mat,ball,basket
light green,brown,black,purple
l) For the given space separated numbers, filter only numbers in the range 20 to 1000
(inclusive).
$ s='20 -983 5 756 634223'
m) For the given input file words.txt , filter all lines containing characters in ascending and
descending order.
$ cat words.txt
bot
art
are
boat
toe
flee
44
reed
$ # ascending order
##### add your solution here
bot
art
$ # descending order
##### add your solution here
toe
reed
n) For the given space separated words, extract the three longest words.
$ s='I bought two bananas and three mangoes'
p) Generate string combinations as shown below for the given input string passed as an envi-
ronment variable.
$ s='{x,y,z}{1,2,3}' ##### add your solution here
x1 x2 x3 y1 y2 y3 z1 z2 z3
45
Record separators
So far, you’ve seen examples where perl automatically splits input data line by line based
on the \n newline character. Just like you can control how those lines are further split into
fields using -a , -F options and other features, perl provides a way to control what
constitutes a line in the first place. In perl parlance, the term record is used to describe
the contents that gets placed in the $_ special variable with -n or -p options.
By default, newline character is used as the input record separator. You can change the $/
special variable to specify a different input record separator. Unlike field separators, you can
only use a string value, regexp isn’t allowed. See perldoc faq: I put a regular expression into
$/ but it didn’t work. What’s wrong? for workarounds.
$ # changing input record separator to comma
$ # note the content of second record, newline is just another character
$ # also note that by default record separator stays with the record contents
$ printf 'this,is\na,sample' | perl -nE 'BEGIN{$/ = ","} say "$.)$_"'
1)this,
2)is
a,
3)sample
46
Single character separator with -0 option
The -0 command line option can be used to specify a single character record separator,
represented with zero to three octal digits. You can also use hexadecimal value. Quoting from
perldoc: -0 option:
You can also specify the separator character using hexadecimal notation: -0xHHH...
, where the H are valid hexadecimal digits. Unlike the octal form, this one may be
used to specify any Unicode character, even those beyond 0xFF . So if you really want
a record separator of 0777 , specify it as -0x1FF . (This means that you cannot use
the -x option with a directory name that consists of hexadecimal digits, or else Perl
will think you have specified a hex number to -0 .)
$ s='this:is:a:sample:string'
The character that gets appended to print function with -l option is based on
the value of input record separator at that point. Here’s some examples to clarify this
point.
$ s='this:is:a:sample:string'
Recall that default -a will split input record based on whitespaces and remove lead-
47
ing/trailing whitespaces. Now that you’ve seen how input record separator can be something
other than newline, here’s an example to show the full effect of default record splitting.
$ # ':' character is the input record separator here
$ s=' a\t\tb\n\t\n:1000\n\n\n\n123 7777:x y \n \n z '
$ printf '%b' "$s" | perl -0072 -lanE 'say join ",", @F'
a,b
1000,123,7777
x,y,z
If the -0 option is used without an argument, the ASCII NUL character will be considered
as the input record separator.
$ printf 'foo\0bar\0' | cat -v
fooˆ@barˆ@
Any octal value of 400 and above will cause the entire input to be slurped as a single string.
Idiomatically, 777 is used. This is same as setting $/ = undef . Slurping entire file makes
it easier to solve some problems, but be careful to not use it for large files that might not fit
available memory.
$ cat paths.txt
/foo/a/report.log
/foo/y/power.log
/foo/abc/errors.log
$ perl -0777 -pe 's|(?<!\A)/.+/|/|s' paths.txt
/foo/errors.log
Paragraph mode
As a special case, using -00 or setting $/ to empty string will invoke paragraph mode.
Two or more consecutive newline characters will act as the record separator. Consider the
programming_quotes.txt sample file, shown here again for convenience:
48
$ cat programming_quotes.txt
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it by Brian W. Kernighan
A language that does not affect the way you think about programming,
is not worth knowing by Alan Perlis
A language that does not affect the way you think about programming,
is not worth knowing by Alan Perlis
A language that does not affect the way you think about programming,
is not worth knowing by Alan Perlis
If the paragraphs are separated by more than two consecutive newlines, the extra
newlines will not be part of the record content.
$ s='a\n\n\n\n\n\n\n\n12\n34\n\nhi\nhello\n'
12
34
49
Any leading newlines (only newlines, not other whitespace characters) in the input
data file will be trimmed and not lead to empty records. This is similar to how -a treats
whitespaces for default field separation.
$ s='\n\n\na\nb\n\n12\n34\n\nhi\nhello\n\n\n\n'
If you wish to avoid the extra empty line at the end of the output for paragraph mode (or similar
situations with other custom record separators), you can either post process the output to
remove the extra empty line or add some logic like shown below.
$ # single paragraph output, no empty line at the end
$ perl -l -00 -ne 'if(/code/){print $s, $_; $s="\n"}' programming_quotes.txt
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it by Brian W. Kernighan
A language that does not affect the way you think about programming,
is not worth knowing by Alan Perlis
Similar to -0 option for input record separator, you can use -l option to specify a single
character output record separator by passing an octal value as argument.
50
$ # comma as output record separator, won't have a newline at the end
$ # note that -l also chomps input record separator
$ seq 8 | perl -l054 -ne 'print if /[24]/'
2,4,
$ # null separator
$ seq 8 | perl -l0 -ne 'print if /[24]/' | cat -v
2ˆ@4ˆ@
You can use $\ to specify a multicharacter string that gets appended to the print function.
This is will override changes due to -l option, if any.
$ # recall that input record separator isn't removed by default
$ seq 2 | perl -ne 'print'
1
2
$ # this will add four more characters after the already present newline
$ # same as: perl -pe 'BEGIN{$\ = "---\n"}'
$ seq 2 | perl -ne 'BEGIN{$\ = "---\n"} print'
1
---
2
---
Many a times, you need to change output record separator depending upon contents of input
record or some other condition. The cond ? expr1 : expr2 ternary operator is often used
in such scenarios. The below example assumes that input is evenly divisible, you’ll have to
add more logic if that is not the case.
$ # same as: perl -pe 's/\n/-/ if $. % 3'
$ seq 6 | perl -lpe '$\ = $. % 3 ? "-" : "\n"'
1-2-3
4-5-6
Summary
This chapter showed you how to change the way input content is split into records and how
to set the string to be appended when print is used. The paragraph mode is useful for
processing multiline records separated by one or more empty lines. You also learned how to
51
set ASCII NUL as the record separator and how to slurp entire input as a single string.
Exercises
a) The input file jumbled.txt consists of words separated by various delimiters. Display all
words that contain an or at or in or it , one per line.
$ cat jumbled.txt
overcoats;furrowing-typeface%pewter##hobby
wavering:concession/woof\retailer
$ # if there's only one line in input, again make sure there's no trailing ','
$ # and that there's a newline character at the end of the line
$ printf 'foo' | paste -sd,
foo
$ printf 'foo' | ##### add your solution here
foo
c) For the input file sample.txt , extract all paragraphs having words starting with do .
$ cat sample.txt
Hello World
Good day
How are you
Just do-it
Believe it
Today is sunny
Not a bit funny
No doubt you like it too
52
$ # note that there's no extra empty line at the end of expected output
##### add your solution here
Just do-it
Believe it
Today is sunny
Not a bit funny
No doubt you like it too
d) For the input file sample.txt , change all paragraphs into single line by joining lines using
. and a space character as the separator. And add a final . to each paragraph.
$ # note that there's no extra empty line at the end of expected output
##### add your solution here
Hello World.
e) For the given input, use ;; as record separators and : as field separators. Display all
records with second field having an integer greater than 50 .
$ s='mango:100;;apple:25;;grapes:75'
$ # note that the output has ;; at the end but not newline character
$ printf "$s" | ##### add your solution here
mango:100;;grapes:75;;
53
Using modules
There are many standard modules available that come by default with Perl installation. And
there’s plenty of third-party modules available for wide variety of use cases. This chapter will
discuss the -M command line option and show some examples with standard and third-party
modules. You’ll also see how to convert one-liners to full fledged script file.
Standard modules
See perldoc: modules for complete list of built-in modules. Quoting from perldoc: -m and -M
options
-Mmodule executes use module ; before executing your program. This loads the
module and calls its import method, causing the module to have its default effect,
typically importing subroutines or giving effect to a pragma. You can use quotes to add
extra code after the module name, e.g., '-MMODULE qw(foo bar)' .
A little builtin syntactic sugar means you can also say -mMODULE=foo,bar or
-MMODULE=foo,bar as a shortcut for '-MMODULE qw(foo bar)' . This avoids the need to
use quotes when importing symbols. The actual code generated by -MMODULE=foo,bar
is use module split(/,/,q{foo,bar}) . Note that the = form removes the distinction
between -m and -M ; that is, -mMODULE=foo,bar is the same as -MMODULE=foo,bar
The List::Util module has handy functions for array processing. See perldoc: List::Util
for documentation. Here’s some examples with max , product and sum0 .
$ # same as: perl -F, -anE 'BEGIN{use List::Util qw(max)} say max @F'
$ echo '34,17,6' | perl -MList::Util=max -F, -anE 'say max @F'
34
$ # 'sum0' returns '0' even if array is empty, whereas 'sum' returns 'undef'
$ echo '3.14,17,6' | perl -MList::Util=sum0 -F, -anE 'say sum0 @F'
26.14
$ s='3,b,a,3,c,d,1,d,c,2,2,2,3,1,b'
54
$ # note that the input order of elements is preserved
$ echo "$s" | perl -MList::Util=uniq -F, -lanE 'say join ",",uniq @F'
3,b,a,c,d,1,2
Here’s an example for base64 encoding and decoding. See perldoc: MIME::Base64 for
documentation.
$ echo 'hello world' | base64
aGVsbG8gd29ybGQK
If you are using the perl version that came installed with your OS, check if you can install
a module from your platform repository. Here’s an example for Ubuntu:
$ # search for Text::CSV module
$ apt-cache search perl text-csv
libspreadsheet-read-perl - reader for common spreadsheet formats
libtext-csv-encoded-perl - encoding-aware comma-separated values manipulator
libtext-csv-perl - comma-separated values manipulator (using XS or PurePerl)
libtext-csv-xs-perl - Perl C/XS module to process Comma-Separated Value files
The above process may fail to work with perl version that you manually installed or if a
particular module isn’t available from your platform repository. There are different options
for such cases.
• stackoverflow: easiest way to install a missing module shows how to use the cpan
command and has details for Windows platform too. You might need admin privileges.
• metacpan: cpanm is also often recommended
• metacpan: Carton is a Perl module dependency manager (aka Bundler for Perl)
55
CSV
For robustly parsing csv files, you can use metacpan: Text::CSV or metacpan: Text::CSV_XS
modules. _XS indicates a faster implementation, usually written in C language. The
Text::CSV module uses Text::CSV_XS by default and uses Text::CSV_PP (pure Perl
implementation) if _XS module isn’t available.
Here’s an example of parsing csv input with embedded comma characters. ARGV is a
special filehandle that iterates over filenames passed as command line arguments, see Multiple
file input chapter for more details.
$ s='eagle,"fox,42",bee,frog\n1,2,3,4'
$ # note that -n or -p option isn't used here
$ printf '%b' "$s" | perl -MText::CSV_XS -E 'say $row->[1]
while $row = Text::CSV_XS->new->getline(*ARGV)'
fox,42
2
Important Note: The default behavior is to accept only ASCII characters in the range
from 0x20 (space) to 0x7E (tilde). This means that the fields can not contain newlines.
If your data contains newlines embedded in fields, or characters above 0x7E (tilde),
or binary data, you must set binary => 1 in the call to new .
$ cat newline.csv
apple,"1
2
3",good
guava,"32
54",nice
56
JSON
Newer versions of Perl come with perldoc: JSON::PP module, which is a pure Perl implemen-
tation. Use metacpan: JSON::XS for faster results. There’s also metacpan: Cpanel::JSON::XS,
which mentions the following reason:
While it seems there are many JSON modules, none of them correctly handle all cor-
ner cases, and in most cases their maintainers are unresponsive, gone missing, or not
listening to bug reports for other reasons.
Here’s a simple example of parsing JSON from a single line of input data.
$ s='{"greeting":"hi","marks":[78,62,93]}'
For multiline input, use -0777 (or set $/ = undef ) to pass entire input content as single
string. You can create a shortcut to make it easier for one-liners.
$ # check if shortcut is available
$ type pj
bash: type: pj: not found
$ # add this to your ~/.bashrc (or the file you use for aliases/functions)
$ pj() { perl -MCpanel::JSON::XS -0777 -E '$ip=decode_json <>;'"$@" ; }
$ s='{"greeting":"hi","marks":[78,62,93]}'
$ # order may be different than input as hash doesn't maintain key order
57
$ # process top-level keys not containing 'e'
$ pj 'for (keys %$ip){say "$_:$ip->{$_}" if !/e/}' sample.json
physics:84
fruit:apple
Here’s an example of converting possibly minified json input to a pretty printed output. You
can use json_pp for JSON::PP and json_xs for JSON::XS .
$ s='{"greeting":"hi","marks":[78,62,93],"fruit":"apple"}'
The O module can be used to convert one-liners to full fledged programs. See perldoc: O for
documentation. This is similar to -o option for GNU awk .
58
continue {
die "-p destination: $!\n" unless print $_;
}
-e syntax OK
Here’s an alternate way to specify code to be executed after the while loop instead of using
END block, when -n option is being used. This cannot be used with -p option because it
will disrupt the continue block.
$ perl -MO=Deparse -ne 'print if /4/ }{ print "==> the end\n"'
LINE: while (defined($_ = readline ARGV)) {
print $_ if /4/;
}
{
print "==> the end\n";
}
-e syntax OK
Here’s an example of saving the script to a file instead of displaying on the terminal.
$ perl -MO=Deparse -ne 'print if /4/' > script.pl
-e syntax OK
$ cat script.pl
LINE: while (defined($_ = readline ARGV)) {
print $_ if /4/;
}
59
Modules to explore
• Awesome Perl — curated list of awesome Perl5 frameworks, libraries and software
• bioperl — practical descriptions of BioPerl modules
• metacpan: XML::LibXML — xml/html parsing
• metacpan: String::Approx — fuzzy matching
• metacpan: Tie::IxHash — ordered associative arrays for Perl
• unix.stackexchange: example for Algorithm::Combinatorics
• unix.stackexchange: example for Text::ParseWords
• unix.stackexchange: sort words by syllable count using Lingua::EN::Syllable
• stackoverflow: regular expression modules
Summary
This chapter showed how to enable modules via -M option and some examples for standard
and third-party modules. You also saw how to convert cryptic one-liners to full fledged perl
script using the O module.
Exercises
a) For the given space separated words, display the max word determined by alphabetic order.
$ s='let in bat xml me lion'
b) For the given space separated words, randomize the order of characters for each word.
$ s='this is a sample sentence'
c) Use metacpan: XML::LibXML to get content of all tags named blue for the input file
sample.xml . See grantm: Perl XML::LibXML by example for a detailed book on XML::LibXML
module.
$ cat sample.xml
<doc>
<greeting type="ask">Hi there. How are you?</greeting>
<greeting type="reply">I am good.</greeting>
<color>
<blue>flower</blue>
<blue>sand stone</blue>
<light-blue>sky</light-blue>
<light-blue>water</light-blue>
</color>
</doc>
60
##### add your solution here
flower
sand stone
61
Multiple file input
You have seen special blocks like BEGIN , END and control structures like next and exit
that affect the entire input contents. This chapter will discuss features that help to make
decisions around individual files when there are multiple files passed as input.
The array @ARGV contains the command-line arguments intended for the script.
$#ARGVis generally the number of arguments minus one, because $ARGV[0] is the
first argument, not the program’s command name itself.
See also stackoverflow: referencing filename passed as arguments for more details
about @ARGV behavior when -n or -p switch is active.
Contains the name of the current file when reading from <> .
The special filehandle that iterates over command-line filenames in @ARGV . Usually
written as the null filehandle in the angle operator <> . Note that currently ARGV
only has its magical effect within the <> operator; elsewhere it is just a plain filehandle
corresponding to the last file opened by <> .
By closing ARGV at the end of each input file, you can reset the $. variable.
62
$ # logic to do something at the start of each input file
$ # closing ARGV will reset $.
$ perl -ne 'print "--- $ARGV ---\n" if $. == 1;
print;
close ARGV if eof' greeting.txt table.txt
--- greeting.txt ---
Hi there
Have a nice day
Good bye
--- table.txt ---
brown bread mat hair 42
blue cake mug shirt -7
yellow banana window shoes 3.14
In scalar context, <> will return the next input record and in list context, <> returns all
the remaining input records. If you need a single character instead of a record, you can use
getc function. See perldoc: getc for documentation.
$ # note that only -e option is used, same as: perl -e 'print scalar <>'
$ perl -e 'print scalar readline' greeting.txt
Hi there
$ perl -e '$line = <>; print "$line---\n"; print <>' greeting.txt
Hi there
---
Have a nice day
Good bye
63
STDIN
The STDIN filehandle is useful to distinguish between files passed as argument and stdin
data. See Comparing records section for more examples.
$ # with no file arguments, <> reads stdin data
$ printf 'apple\nmango\n' | perl -e 'print <>'
apple
mango
You have seen examples where exit function is used to avoid processing unnecessary records
for the current and any other files yet to be processed. Sometimes, you need to skip only
contents for the current file and move on to next file for processing. The close ARGV example
seen previously comes in handy for such cases.
$ # print filename if it contains 'I' anywhere in the file
$ # same as: grep -l 'I' f[1-3].txt greeting.txt
$ # same as: perl -0777 -nE 'say $ARGV if /I/'
$ # but slurping is dependent on size of input files and available memory
$ perl -nE 'if(/I/){say $ARGV; close ARGV}' f[1-3].txt greeting.txt
f1.txt
f2.txt
Summary
This chapter introduced features for processing multiple file inputs and constructing file level
decisions. These will show up in many more examples in coming chapters.
64
Exercises
a) Print the last field of first two lines for the input files passed as arguments to the perl
script. Assume space as the field separators for these two files. To make the output more
informative, print filenames and a separator as shown in the output below. Assume input files
will have at least two lines.
$ # assume table.txt ip.txt are passed as file inputs
##### add your solution here
>table.txt<
42
-7
----------
>ip.txt<
World
you
----------
b) For the given list of input files, display all filenames that contain at or fun in the third
field in any of the input lines. Assume space as the field separator.
$ # assume sample.txt secrets.txt ip.txt table.txt are passed as file inputs
##### add your solution here
secrets.txt
ip.txt
table.txt
c) Print the first two lines for each of the input files ip.txt , sample.txt and table.txt
. Also, add a separator between the results as shown below (note that the separator isn’t
present at the end of the output). Assume input files will have at least two lines.
##### add your solution here
Hello World
How are you
---
Hello World
---
brown bread mat hair 42
blue cake mug shirt -7
65
Processing multiple records
Often, you need to consider multiple lines at a time to make a decision, such as the paragraph
mode examples seen earlier. Sometimes, you need to match a particular record and then get
records surrounding the matched record. Solution to these type of problems often take the
form of state machines. See softwareengineering: FSM examples if you are not familiar with
state machines.
You might need a condition that should satisfy something for one record and something else for
the very next record. There are many ways to tackle this problem. One possible solution is to
use a variable to save the previous record and then create the required conditional expression
using that variable and $_ which already has the current record content.
$ # match and print two consecutive records
$ # first record should contain 'as' and second record should contain 'not'
$ perl -ne 'print $p, $_ if /not/ && $p=~/as/; $p = $_' programming_quotes.txt
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it by Brian W. Kernighan
Context matching
Sometimes you want not just the matching records, but the records relative to the matches as
well. For example, it could be to see the comments at the start of a function block that was
matched while searching a program file. Or, it could be to see extended information from a
log file while searching for a particular error message.
66
english
hindi
spanish
tamil
programming language
python
kotlin
ruby
• If initially $n=2
∘ 2 && 2 --> evaluates to true and $n becomes 1
∘ 1 && 1 --> evaluates to true and $n becomes 0
∘ 0 && --> evaluates to false and $n doesn’t change
• Note that when conditionals are connected with logical && , the right expression will
not be executed at all if the left one turns out to be false because the overall result
will always be false . Same is the case if left expression evaluates to true with
logical || operator. Such logical operators are also known as short-circuit operators.
Thus, in the above case, $n-- won’t be executed when $n is 0 on the left hand side.
This prevents $n going negative and $n && $n-- will never become true unless
$n is assigned again.
$ # same as: grep --no-group-separator -A1 'blue'
$ # print matching line as well as the one that follows it
$ perl -ne '$n=2 if /blue/; print if $n && $n--' context.txt
blue
toy
light blue
flower
Once you’ve understood the above examples, the rest of the examples in this section should
be easier to comprehend. They are all variations of the logic used above and re-arranged to
solve the use case being discussed.
Case 2: Print n records after the matching record. This is similar to previous case, except
that the matching record isn’t printed.
$ # print 2 lines after the matching line
$ perl -ne 'print if $n && $n--; $n=2 if /prog/' context.txt
python
kotlin
67
Case 3: Printing n th record after the matching record.
$ # print only the 3rd line found after the matching line
$ # $n && !--$n will be true only when --$n yields 0
$ # overlapping cases won't work as $n gets re-assigned before going to 0
$ perl -ne 'print if $n && !--$n; $n=3 if /language/' context.txt
spanish
ruby
$ # this will work even if there are less than n records before a match
$ n=5 perl -ne '$i=$.-$ENV{n}; $i=0 if $i<0; $ip[$.]=$_;
print @ip[$i .. $.] if /toy/' context.txt
blue
toy
To prevent confusion with overlapping cases, you can add a separation line between the results.
$ n=2 perl -ne '$i=$.-$ENV{n}; $i=0 if $i<0; $ip[$.]=$_;
if(/toy|flower/){print $s, @ip[$i .. $.]; $s="---\n"}' context.txt
blue
toy
---
blue
toy
flower
---
sand stone
light blue
flower
You can also use the logic from Case 3 by applying tac twice. This avoids the need to use a
hash variable.
68
$ tac context.txt | perl -ne 'print if $n && !--$n; $n=2 if /language/' | tac
sky
spanish
This section will cover cases where the input file will always contain the same number of
starting and ending patterns and arranged in alternating fashion. For example, there cannot
be two starting patterns appearing without an ending pattern between them and vice versa.
Zero or more records of text can appear inside such groups as well as in between the groups.
The sample file shown below will be used to illustrate examples in this section. For simplicity,
assume that the starting pattern is marked by start and the ending pattern by end . They
have also been given group numbers to make it easier to visualize the transformation between
input and output for the commands discussed in this section.
$ cat uniform.txt
mango
icecream
--start 1--
1234
6789
**end 1**
how are you
have a nice day
--start 2--
a
b
c
**end 2**
par,far,mar,tar
Case 1: Processing all the groups of records based on the distinct markers, including the
records matched by markers themselves. For simplicity, the below command will just print all
such records.
$ perl -ne '$f=1 if /start/; print if $f; $f=0 if /end/' uniform.txt
--start 1--
1234
6789
**end 1**
--start 2--
a
b
c
**end 2**
69
perl -ne 'print if /start/../end/' can be used as seen previously in Range
operator section. The state machine format is more flexible for various cases to follow.
Case 2: Processing all the groups of records but excluding the records matched by markers
themselves.
$ perl -ne '$f=0 if /end/; print "* $_" if $f; $f=1 if /start/' uniform.txt
* 1234
* 6789
* a
* b
* c
Case 3-4: Processing all the groups of records but excluding one of the markers.
$ perl -ne '$f=1 if /start/; $f=0 if /end/; print if $f' uniform.txt
--start 1--
1234
6789
--start 2--
a
b
c
The next four cases are obtained by just using if !$f instead of if $f from the cases
shown above.
Case 5: Processing all input records except the groups of records bound by the markers.
$ # same as: perl -ne 'print if !(/start/../end/)'
$ perl -ne '$f=1 if /start/; print if !$f; $f=0 if /end/' uniform.txt
mango
icecream
how are you
have a nice day
par,far,mar,tar
Case 6 Processing all input records except the groups of records between the markers.
$ perl -ne '$f=0 if /end/; print if !$f; $f=1 if /start/' uniform.txt
mango
icecream
--start 1--
70
**end 1**
how are you
have a nice day
--start 2--
**end 2**
par,far,mar,tar
Case 7-8: Similar to case 6, but include only one of the markers.
$ perl -ne 'print if !$f; $f=1 if /start/; $f=0 if /end/' uniform.txt
mango
icecream
--start 1--
how are you
have a nice day
--start 2--
par,far,mar,tar
Specific blocks
Instead of working with all the groups (or blocks) bound by the markers, this section will
discuss how to choose blocks based on some additional criteria.
Here’s how you can process only the first matching block. See also stackoverflow: copy pattern
between range only once and stackoverflow: extract only first range.
$ perl -ne '$f=1 if /start/; print if $f; exit if /end/' uniform.txt
--start 1--
1234
6789
**end 1**
Getting last block alone involves lot more work, unless you happen to know how many blocks
are present in the input file.
$ # reverse input linewise, change the order of comparison, reverse again
$ # can't be used if record separator has to be something other than newline
71
$ tac uniform.txt | perl -ne '$f=1 if /end/; print if $f; exit if /start/' | tac
--start 2--
a
b
c
**end 2**
$ # or, save the blocks in a buffer and print the last one alone
$ perl -ne 'if(/start/){$f=1; $buf=$_; next}
$buf .= $_ if $f;
$f=0 if /end/;
END{print $buf}' uniform.txt
--start 2--
a
b
c
**end 2**
Excluding n th block.
$ seq 30 | perl -ne 'BEGIN{$n=2; $c=0} if(/4/){$f=1; $c++}
print if $f && $c!=$n; $f=0 if /6/'
4
5
6
24
25
26
72
14
15
16
Broken blocks
Sometimes, you can have markers in random order and mixed in different ways. In such cases,
to work with blocks without any other marker present in between them, the buffer approach
comes in handy again.
$ cat broken.txt
qqqqqqqqqqqqqqqq
error 1
hi
error 2
1234
6789
state 1
bye
state 2
error 3
xyz
error 4
abcd
state 3
zzzzzzzzzzzzzzzz
Summary
This chapter covered various examples of working with multiple records. State machines play
an important role in deriving solutions for such cases. Knowing various corner cases is also
crucial, otherwise a solution that works for one input may fail for others.
Next chapter will discuss use cases where you need to process a file input based on contents
of another file.
73
Exercises
a) For the input file sample.txt , print a matching line containing do only if the previous
line is empty and the line before that contains you .
##### add your solution here
Just do-it
Much ado about nothing
b) Print only the second matching line respectively for the search terms do and not for
the input file sample.txt . Match these terms case insensitively.
$ # for reference, here's all the matches
$ grep -i 'do' sample.txt
Just do-it
No doubt you like it too
Much ado about nothing
$ grep -i 'not' sample.txt
Not a bit funny
Much ado about nothing
c) For the input file sample.txt , print matching line as well as n lines around the matching
lines. The value for n is passed to the perl command as an environment value.
$ # match a line containing 'are' or 'bit'
$ n=1 ##### add your solution here
Good day
How are you
Today is sunny
Not a bit funny
No doubt you like it too
Good day
d) For the input file broken.txt , print all lines between the markers top and bottom .
The first perl command shown below doesn’t work because it is matching till end of file if
second marker isn’t found. Assume that the input file cannot have two top markers without
a bottom marker appearing in between and vice-versa.
$ cat broken.txt
top
3.14
bottom
74
---
top
1234567890
bottom
top
Hi there
Have a nice day
Good bye
$ # wrong output
$ perl -ne '$f=0 if /bottom/; print if $f; $f=1 if /top/' broken.txt
3.14
1234567890
Hi there
Have a nice day
Good bye
$ # expected output
##### add your solution here
3.14
1234567890
e) For the input file concat.txt , extract contents from a line starting with %%% until but
not including the next such line. The block to be extracted is indicated by variable n passed
as an environment value.
$ cat concat.txt
%%% addr.txt
How are you
This game is good
Today %%% is sunny
%%% broken.txt
top %%%
1234567890
bottom
%%% sample.txt
Just %%% do-it
Believe it
%%% mixed_fs.txt
pink blue white yellow
car,mat,ball,basket
75
pink blue white yellow
car,mat,ball,basket
f) For the input file perl.md , replace all occurrences of perl (irrespective of case) with
Perl . But, do not replace any matches between ```perl and ``` lines ( perl in these
markers shouldn’t be replaced either).
##### add your solution here, redirect the output to 'out.md'
g) Print the last two lines for each of the input files ip.txt , sample.txt and table.txt .
Also, add a separator between the results as shown below (note that the separator isn’t present
at the end of the output). Assume input files will have at least two lines.
##### add your solution here
12345
You are funny
---
Much ado about nothing
He he he
---
blue cake mug shirt -7
yellow banana window shoes 3.14
76
Two file processing
This chapter focuses on solving problems which depend upon contents of two or more files.
These are usually based on comparing records and fields. Sometimes, record number plays a
role too. You’ll also see some examples where entire file content is used.
Comparing records
Consider the following input files which will be compared line wise to get common lines and
unique lines.
$ cat color_list1.txt
teal
light blue
green
yellow
$ cat color_list2.txt
light blue
black
dark green
yellow
If you do not wish to use modules, you can make use of hash to compare records between two
files.
$ # common lines
$ # same as: grep -Fxf color_list1.txt color_list2.txt
$ # for two file input, $#ARGV will be 0 only for the first file
$ # note that 'exists' isn't strictly necessary here
$ perl -ne 'if(!$#ARGV){$h{$_}=1; next}
print if exists $h{$_}' color_list1.txt color_list2.txt
light blue
yellow
77
$ # using if-else instead of next
$ perl -ne 'if(!$#ARGV){ $h{$_}=1 }
else{ print if exists $h{$_} }' color_list1.txt color_list2.txt
light blue
yellow
$ # read all lines from first file passed as STDIN in BEGIN block
$ perl -ne 'BEGIN{ $h{$_}=1 while <STDIN> }
print if exists $h{$_}' <color_list1.txt color_list2.txt
light blue
yellow
You can use uniq function from List::Util module to preserve only one copy of duplicates
from one or more input files. See Dealing with duplicates chapter for field based duplicate
processing.
$ # input order of lines is preserved
$ # this is same as performing union between two sets
$ perl -MList::Util=uniq -e 'print uniq <>' color_list1.txt color_list2.txt
teal
light blue
green
yellow
black
dark green
The metacpan: List::Compare module supports set operations like union, intersection, sym-
metric difference etc. See also metacpan: Array::Utils.
$ # union, input order of lines is NOT preserved
$ # note that only -e option is used and one of the files is passed as stdin
$ perl -MList::Compare -e '@a1=<STDIN>; @a2=<>;
print List::Compare->new(\@a1, \@a2)->get_union
' <color_list1.txt color_list2.txt
black
dark green
green
light blue
teal
yellow
78
$ # lines from color_list1.txt not present in color_list2.txt
$ perl -MList::Compare -e '@a1=<STDIN>; @a2=<>;
print List::Compare->new(\@a1, \@a2)->get_unique
' <color_list1.txt color_list2.txt
green
teal
Comparing fields
In the previous sections, you saw how to compare whole contents of records between two files.
This section will focus on comparing only specific field(s). The below sample file will be one of
the two file inputs for examples in this section. Consider whitespace as the field separator, so
-a option is enough to get the fields.
$ cat marks.txt
Dept Name Marks
ECE Raj 53
ECE Joel 72
EEE Moi 68
CSE Surya 81
EEE Tia 59
ECE Om 92
CSE Amy 67
To start with, here’s a single field comparison. The problem statement is to fetch all the records
from marks.txt if the first field matches any of the departments listed in dept.txt file.
$ cat dept.txt
CSE
ECE
For multiple field comparison, you can use comma separated values to construct the hash keys.
The special variable $; (whose default is \034 ) will be used to join these values. The \034
character is usually not present in text files. If you cannot guarantee absence of this character,
you can use some other character or use hash of hashes. See also stackoverflow: using array
as hash key.
$ cat dept_name.txt
EEE Moi
CSE Amy
ECE Raj
79
$ # don't use array slice as hash keys
$ perl -anE '$h{@F[0..1]}=1; say join ",", keys %h' dept_name.txt | cat -v
Moi
Moi,Amy
Moi,Raj,Amy
$ # default $; value is \034, same as SUBSEP in awk
$ perl -anE '$h{$F[0],$F[1]}=1; say join ",", keys %h' dept_name.txt | cat -v
EEEˆ\Moi
CSEˆ\Amy,EEEˆ\Moi
ECEˆ\Raj,CSEˆ\Amy,EEEˆ\Moi
Here’s an alternate method with hash of hashes. See also perldoc: REFERENCES.
$ perl -ane 'if(!$#ARGV){ $h{$F[0]}{$F[1]}=1 }
else{ print if exists $h{$F[0]}{$F[1]} }' dept_name.txt marks.txt
ECE Raj 53
EEE Moi 68
CSE Amy 67
80
Dept Name Marks Role
ECE Raj 53 class_rep
ECE Joel 72
EEE Moi 68
CSE Surya 81
EEE Tia 59 placement_rep
ECE Om 92
CSE Amy 67 sports_rep
Here’s an example that shows how you can replace m th line from a file with n th line from
another file.
$ # replace 3rd line of table.txt with
$ # 2nd line of greeting.txt
$ perl -pe 'BEGIN{ $m=3; $n=2; $s = <STDIN> for 1..$n }
$_ = $s if $. == $m' <greeting.txt table.txt
brown bread mat hair 42
blue cake mug shirt -7
Have a nice day
You can use file slurping for fixed string multiline search and replace requirements. The below
example is substituting complete lines. The solution will work for partial lines as well, provided
there is no newline character at the end of search.txt and repl.txt files.
$ head -n2 table.txt > search.txt
$ cat repl.txt
2$1$&3
wise ice go goa
81
Don’t save contents of search.txt and repl.txt in shell variables for passing
them to the perl script. Trailing newlines and ASCII NUL characters will cause issues.
See stackoverflow: pitfalls of reading file into shell variable for details.
Summary
This chapter discussed use cases where you need to process the contents of two or more files
based on entire record/file or field(s). The value of $#ARGV is handy for such cases (formula
is n-2 to match first file passed among n input files). The next chapter discusses more
such examples, based solely on occurrences of duplicate values.
82
Exercises
a) Use contents of match_words.txt file to display matching lines from jumbled.txt and
sample.txt . The matching criteria is that the second word of lines from these files should
match the third word of lines from match_words.txt .
$ cat match_words.txt
%whole(Hello)--{doubt}==ado==
just,\joint*,concession<=nice
b) Interleave contents of secrets.txt with the contents of a file passed as stdin in the
format as shown below.
##### add your solution here, use 'table.txt' for stdin data
stag area row tick
brown bread mat hair 42
---
deaf chi rate tall glad
blue cake mug shirt -7
---
Bi tac toe - 42
yellow banana window shoes 3.14
c) The file search_terms.txt contains one search string per line (these have no regexp
metacharacters). Construct a solution that reads this file and displays search terms (matched
case insensitively) that were found in all of the other input file arguments. Note that these
terms should be matched with any part of the line, not just whole words.
$ cat search_terms.txt
hello
row
you
is
at
83
d) Replace third to fifth lines of input file ip.txt with second to fourth lines from file
para.txt
##### add your solution here
Hello World
How are you
Start working on that
project you always wanted
to, do not let it end
You are funny
e) Insert one line from jumbled.txt before every two lines of copyright.txt
##### add your solution here
overcoats;furrowing-typeface%pewter##hobby
bla bla 2015 bla
blah 2018 blah
wavering:concession/woof\retailer
bla bla bla
copyright: 2020
f) Use entire contents of match.txt to search error.txt and replace with contents of
jumbled.txt . Partial lines should NOT be matched.
$ cat match.txt
print this
but not that
$ cat error.txt
print this
but not that or this
print this
but not that
if print this
but not that
print this
but not that
84
Dealing with duplicates
Often, you need to eliminate duplicates from input file(s), based on entire line content, field(s),
etc. These are typically solved with sort and uniq commands. Advantage with perl
include regexp based field separators, record separator other than newline, input doesn’t have
to be sorted, and in general more flexibility because it is a programming language.
You can use uniq function from List::Util module or use a hash to retain only first copy
of duplicates from one or more input files.
$ cat purchases.txt
coffee
tea
washing powder
coffee
toothpaste
tea
soap
tea
The hash based solution is easy to adapt for removing field based duplicates. Just change $_
to the required field(s) after setting the appropriate field separator.
$ cat duplicates.txt
brown,toy,bread,42
dark red,ruby,rose,111
blue,ruby,water,333
dark red,sky,rose,555
yellow,toy,flower,333
white,sky,bread,111
light red,purse,rose,333
85
brown,toy,bread,42
dark red,ruby,rose,111
blue,ruby,water,333
dark red,sky,rose,555
Multiple fields example. As seen in Comparing fields section, you can either use comma sepa-
rated values to construct the hash key or use hash of hashes.
$ # based on first and third field
$ # can also use: perl -F, -ane 'print if !$h{$F[0]}{$F[2]}++'
$ perl -F, -ane 'print if !$h{$F[0],$F[2]}++' duplicates.txt
brown,toy,bread,42
dark red,ruby,rose,111
blue,ruby,water,333
yellow,toy,flower,333
white,sky,bread,111
light red,purse,rose,333
Duplicate count
In this section, how many times a duplicate record is found plays a role in determining the
output. First up, printing only a specific numbered duplicate.
$ # print only the second occurrence of duplicates based on 2nd field
$ perl -F, -ane 'print if ++$h{$F[1]} == 2' duplicates.txt
blue,ruby,water,333
yellow,toy,flower,333
white,sky,bread,111
Next, printing only the last copy of duplicate. Since the count isn’t known, the tac command
comes in handy again.
$ # reverse the input line-wise, retain first copy and then reverse again
$ tac duplicates.txt | perl -F, -ane 'print if !$h{$F[-1]}++' | tac
brown,toy,bread,42
dark red,sky,rose,555
white,sky,bread,111
light red,purse,rose,333
To get all the records based on a duplicate count, you can pass the input file twice. Then use
the two file processing tricks to make decisions.
$ # all duplicates based on last column
$ perl -F, -ane '!$#ARGV ? $h{$F[-1]}++ :
$h{$F[-1]}>1 && print' duplicates.txt duplicates.txt
dark red,ruby,rose,111
blue,ruby,water,333
86
yellow,toy,flower,333
white,sky,bread,111
light red,purse,rose,333
Summary
This chapter showed how to work with duplicate contents, both record and field based. If you
don’t need regexp based separators and if your input is too big to handle, then specialized
command line tools sort and uniq will be better suited.
Exercises
a) Retain only first copy of a line for the input file lines.txt . Case should be ignored while
comparing lines. For example hi there and HI TheRE will be considered as duplicates.
$ cat lines.txt
Go There
come on
go there
---
2 apples and 5 mangoes
come on!
---
2 Apples
COME ON
87
b) Retain only first copy of a line for the input file twos.txt . Assume space as field separator
with two fields on each line. Compare the lines irrespective of order of the fields. For example,
hehe haha and haha hehe will be considered as duplicates.
$ cat twos.txt
hehe haha
door floor
haha hehe
6;8 3-4
true blue
hehe bebe
floor door
3-4 6;8
tru eblue
haha hehe
c) For the input file twos.txt , display only unique lines. Assume space as field separator
with two fields on each line. Compare the lines irrespective of order of the fields. For example,
hehe haha and haha hehe will be considered as duplicates.
##### add your solution here
true blue
hehe bebe
tru eblue
88