Perl and Regular Expressions 1 Perl and Regular Expressions 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Perl INP 2005/2006 - Perl and regular expressions

Lennart Herlaar [email protected] http://www.cs.uu.nl/people/lennart room A104, telephone 030-2533921 March 9, 2006 Originally designed for processing of (textual) data Is written as a combination of various Unix shell commands shell scripts awk, sed, tr, grep Therefore it has a strange syntax, but easier to learn if you know these programs already. There are many additional modules available There is an extensive user community on the Internet It is very portable (Unix/Windows) PHP was largely derived from Perl. Not just used for CGI programming, but also for maintenance and reporting.

Perl and regular expressions

Perl and regular expressions

Perl
Variables do not have a xed type Lots of type juggling Implicit arguments, print $_ versus print Built-in: strings, lists, dictionaries (associative arrays, hashes) Strong support for regular expressions and other string handling Some object oriented features (but I skip those) Modular setup (import modules) Design is rather messy Programs are dicult to understand and maintain While programming: have a book or manual pages handy.

Variables and types


Your basic types: Scalars such as $pet = bello Arrays like @numarray = (1, 2, 3, "four") Hashes (associative arrays) %ENV Handles for les IN Subroutine &fac Scalar values gimme a $, "Beware of \U$pet", 33, 3.3, pwd pwd quotes used for commands. Expressions obtain a %, &, $, @, depending on what type they should have. Last index of an array? $#numarray $ENV{HOME} for accessing $HOME directory in environment. Casting/coercions: print "123" + 1 gives 124

Perl and regular expressions

Perl and regular expressions

Type contexts and juggling


Essential dierence with PHP: scalars, hashes and arrays are dierent namespaces. $id, @id, etc. can exist at the same time. Jugglings exists between dierent scalar types (similar to PHP): everything is true except: false, "0", "", 0, undefined. Between name spaces only explicit casts are well-dened. @numarray = (1,2,"three",5); # in array context $len = @numarray; # in scalar context $last = $numarray[$#numarray]; # Note the $ for numarray! $alsolast = $numarray[scalar(@numarray)-1]; $alsolast = $numarray[@numarray-1]; $lastagain = pop(@numarray); # Yield and remove print join(":", @numarray); # 1:2:three

Input and output


Basic le functions like open and opendir. Files opened in a certain mode (read, append, write). File handles capitalized identiers. Input and output implicitly from standard input and output But you may also supply a handle. open (INPUT, "speech"); open (TRASHCAN, ">trash.can"); # create for writing print "Throwing away the garbage\n"; $line = <INPUT>; # one line only @remainder = <INPUT>; # read the rest chomp($line); chop($line); print TRASHCAN "Throw me a $line, please\n";
Perl and regular expressions 5

Perl and regular expressions

perlre.pdf March 9, 2006 1

Operators
File operators: -e $a is true if le named $a exists -d $a is true if $a is a directory and many more Comparison: the usual, but strings have dierent operators: eq, ne, lt, gt, le, cmp. No ===. 2 ** 16 for exponentiation. 2 x 16 for repeat. ++ also increments "file1" to "file2". Simultaneous assignments: ($a, $b) = ($b, $a). More exible assignments ($fst, $snd, @otherwords) = split(" ", $line); print ++($snore = zz); # aaa
Perl and regular expressions 6

Control structures
Use compound statement with the normal if. unless as inverse of if elsif and not elseif as in PHP. kissme() if $showup($me); # no braces necessary killme() unless $ipayup; while (<*.java>) { chmod 0711; # set access rights for all java files }

Perl and regular expressions

Loopy arrays
foreach $cds (@collection) { print "<LI>$cds<\/LI>\n"; } @hex = (0 .. 9, a .. f); # range operator %knor = (a,1,b,2,c,10); # squashed pairs key , value foreach $key (sort keys %knor) { print "$key has value $knor{$key}"; } # Better hash notation %map = (red => afghan, blue black => pearl);

Subroutines
No named parameters: take parameters out of @_ array. Variables only local by explicit mention If no mention, identiers have global package scope. Signature can be used to indicate types of parameters (for checking). sub f ($$@) { $a = shift(@_); # modifies global $a $b = shift; # implicit parameter @rest = @_; # Or use ($a, $b, @rest) = @_; return (@rest, $a, $b); } @v = (3, 4, 5); $a = 2; @v = &f (1, $a, @v);
Perl and regular expressions 9

=> curacao,

Perl and regular expressions

Signatures
Use signatures! Optional parameters to the right of semi-colon. References (& in PHP) prexed by backslash. First array can be changed, the second one, if available, can not. Unbackslashed arrays and hashes eat everything. sub complex ($\@;$@) { ...... } sub dumbo (@$$) { @fst = @_; # also includes 1 and 2 } &complex (1, \@a, 2); &dumbo (@a, 1, 2);
Perl and regular expressions 10

Programming advice
Perl can be used for CGI, but also website maintenance. Use use strict. Add signature information to subroutines, and put function denitions at the top of your source les. Turn on all warnings (-w option) Avoid using implicit parameters unless you know what you are doing. Avoid dependence on implicit casts. Use or die ".....$!\n" whenever you do I/O.

Perl and regular expressions

11

perlre.pdf March 9, 2006 2

Closing time for Perl


I skipped formats I skipped the humungous quantity of special variables like $?, $., ..... CGI module does all the administration for CGI. Functions like hr, li, ol, h1,.... DBI module for database interface. Several modules for Cookies, HTML parsing, SMTP etc. Perl has many exotic facilities, some of them perverse. It has one big advantage: you can become a guru.

Regular expressions
Most languages do have them, but rarely as embedded into the language as in Perl. I concentrate on regular expression and using them for matching (for validation) substitution (for modication) Regular expressions are similar to the Regular Languages and Finite Automata of Grammaticas en Ontleden. However, most regular expression languages can do more. In fact, in some respects they go beyond the context free languages. PHP oers perl regular expressions, with some minor dierences.

Perl and regular expressions

12

INP 2003/2004 - regular expressions

13

Simple string functions


$s = "Colourless Green Ideas Sleep Furiously"; # Upper case for really getting the message across $u = "\U$s"; # Using translate $iv = $s; $iv =~ tr/a-zA-Z/A-Za-z/; # invert case ($e = $s) =~ tr/e//cd; // only es $our = substr ($s, 3, 3); $sly = substr ($s, -3); substr ($s, 4, 1) = ""; # americanize colour to color

What if we want more?


Test that a string takes the form of a zip code (like 1234 XY) Given a fully qualied lename, nd the name of the le without extension. Retrieve negative oating point numbers from a le Remove (non-nested) comments from programs Changing e-mail references ending in mailto:....uu.nl in a reference to the corresponding homepage of the person. Regular expressions are an important tool for system administrators and other clever people for quickly doing things. Also supported by many editors (vi) and terminal programs (grep, sed)

INP 2003/2004 - regular expressions

14

INP 2003/2004 - regular expressions

15

First examples
Pattern matching using /.../ or m/.../ print "Not empty" if ($str =~ /\S/); if ($bandname =~ /[iI]ce/) {....} @words = $line =~ m/\S+/g; Substituting for a pattern s/../../.. $s = "jack in the box"; ($t = $s) =~ s/\s+/-/g; # jack-in-the-box ($v = $s) =~ s/\s*/-/g; # -j-a-c-k--i-n-... $s =~ s/\S+/X/g; # X X X X Possible ags i match case-insensitive g match more than once in one line s newline is like any other character
INP 2003/2004 - regular expressions 16

Characteristics
Matching is done from left to right, and are as long/large as possible. Under the g ag, as many as possible, trying from left to right. Resulting matches can be put in an array (or single scalar) In a boolean context, matching is true if a match was found. Substitution actually changes the string you match on.

INP 2003/2004 - regular expressions

17

perlre.pdf March 9, 2006 3

Character classes
The usual special characters: \t, \n, \\ but also \s, \S, \w, \W Matching any single lower case letter: [a-z]. All digits: [0-9]. Alphanumeric: [0-9A-Za-z] Or use [:alnum:]. Combine them: [01[:alpha:]#]. Digits also by \d, white space by \s, non-whitespace by \S. Complementation: [:^space:] is equivalent to \S. Every character but a, e or f: [^aef]. Matching every word but lennart: $name !~ /lennart/.

Matching sequences
A character matches only a single character. How can we match sequences of characters? /max[iy]ma/ matches both maxima and maxyma. /a*/ matches a sequence of zero or more as. /\S+/ matches a sequence of one or more non-whitespace characters. Matching decimal numbers with [:digit:]+ Matching identiers: [a-zA-Z$_][a-zA-Z0-9$_]* Matching the word option or nothing: (option)? Parentheses can be used to group patterns.

INP 2003/2004 - regular expressions

18

INP 2003/2004 - regular expressions

19

Anchors and extraction


Anchors ^ and $. ^ matches the beginning of the string $ matches the end of the string Grouped patterns (between parentheses) can be extracted via variables $1, $2, .... Example: match lines (at least) ve tab separated values. and extract the rst two of these values. if ($line =~ /^(.*)\t(.*)\t(.*)\t(.*)\t(.*)$/) { $artist = $1; $title = $2; if ($title eq "Interview") { @interviews = (@interviews,$artist); } }
INP 2003/2004 - regular expressions 20

Some more matchings


Choice between patterns by using the or, |. sub getfilename () { my $path = $_[0]; chomp($path); # remove end of line symbol return $path =~ s/.*\///; # leave only the filename } sub emptyString($) { return !($_[0] =~ /\S+/); } if ($file =~ /(\.gz|\.zip|\.tgz)$/) { &decompress($file); }

INP 2003/2004 - regular expressions

21

Some example substitutions


Substitutions can use $1, $2, ... to retrieve parts of the matched string. g for global substitution (do them all). # Some HTMLizing $string =~ s/&/&amp;/g; # Removing extensions $name =~ s/\..*//; # Adding text to files in a certain place. s/$anchor/$anchor$inserttext/; $txt = "34 12 78 56"; $txt =~ s/(\d+)\s(\d+)/$2 $1/g; # 12 34 56 78

Take care
Perl regular expressions are not regular: /([a-zA-Z]+)\s+\1/ matches a word followed by that same word. On $t = "abc ac abcdef define"; yields def. Here we must use \1 in place of $1. Not even context free. As you can imagine, I skipped a few facilities. Read the perl manual pages on regular expression for the complete story. Study well on regular expressions: they pop up everywhere.

INP 2003/2004 - regular expressions

22

INP 2003/2004 - regular expressions

23

perlre.pdf March 9, 2006 4

Regex in PHP
preg_grep - Return array entries that match the pattern. preg_match_all - Perform a global regular expression match. preg_match - Perform a regular expression match. preg_replace - Perform a regular expression search and replace. preg_split - Split string by a regular expression. http://weblogtoolscollection.com/regex/regex.php could be useful. $fl_array = preg_grep("/^(\d+)?\.\d+$/", $array); $string = April 15, 2003; $pattern = /(\w+) (\d+), (\d+)/i; $replacement = ${1}1,$3; echo preg_replace($pattern, $replacement, $string);

INP 2003/2004 - regular expressions

24

perlre.pdf March 9, 2006 5

You might also like