Perl Regular Expressions

**ele5125** · June 12th, 2002, 08:27 AM

Disclaimer:

This tutorial is about regular expressions in Perl. Though I make
every effort to provide acurate information, some of the information
I provide may contain false information. As well This tutorial is by
no means complete.

Assumptions made:

I asume that you have a working knowledge of the fundemntals of the Perl language.

Introduction:

We will start off our discussion by defining what a regular expression
is. A regular expression is a pattern to be matched against a string. where a string is a sequence of characters. If the pattern matches with the string true is returned, if it fails false is returned. We can thus test for success or failier. Not only can we check for suces or failier, we may also use substitution to swap the section of the string that matched with another string. As you can see, regular expressions are very powerfull. Let us now embark on our journey into the world of regular expressions in Perl.

We have a definition for a regular expression, let us now examin the parts of a regular expression. Here is our first expression:

/abcd/

this is a expression, notice how it is enclosed in forward slashes, this denotes a regular expression. Now lets match it against a string:

$_ = "this line contains abcd together side by side";

if (/abcd/){
print "It matched";
}

Wait, you say, what are we compareing it with? We are compareing /abcd/ with the contents of the $_ scolar variable. The if conditional statement checks to see if abcd apear together side by side in our $_ variable. It does so it will print It matched.

Single characters, a single character not includeing the new line character is represented with a dot.

Character Classes:

What if we wanted to determine if one and only one of the letters abcd apear in the string instead of all. We would enclose the letters with brackets like so

/[abcd]/

This is called a character class. With square brackets it denotes that one of the enclosed characters must match. But be carefull because our example will still match. Infact if we let it, it will match 4 times, however if we had "a cobera abcd me" it will now match with the first a not the abcd later on in the string, why? because it looks for the first match, and since the expresion states one of the characters "abcd" should match, "a" will satisfie the expression.

If we want more controll, let us say that we want to match any one of the letters in the alphabet, it would be time consumeing to write out each and every character, so instead a shortcut was created. Simply place a - between two characters, this indicates to fill in all the letters inbetween. Like so:

/[a-zA-z]/

This will match with any letter in the alphabet, capital or small. It can also be applyed to numerical digits. Ther is also a negated character class which places a caret ^ just after the first square bracket like so:

/[^a-zA-Z]/

This indicates to match with anything except a character in the alphabet. It means exclued everything enclosed in brackets. You can think of it as the oposite of a regular character class. Another example:

/[^123]/

This sais to match with anything except the digit 1 2 or 3. But what if we want to use a ^ in our expression but we do not want it to be interprited with its special negated class meaning. In this case we would escape it with a back slash like so:

/[^\^]/

Match with everything except a up arrow ^. This goes with any character that has special meaning, if you want to use a specxial character you exacpe it with a back slash. Now we have more short cuts, we have special pre defined classes, because the following are common regular expressions we need a shortcut:

/[a-zA-Z]/
/[^a-zA-Z]/

The above are some common expressions so we have short cuts. Here they are:

\d is the equivalent to [0-9]
\D is the oposite, equivalent to [^0-9]
\w is the equivalent to [a-zA-Z0-9]
\W is the oposite, equivalent to [^a-zA-Z0-9]
\s is the equivalent to [\r\t\f\n] (characters that produce spaces)
\S is the oposite, equivalent to [^ \r\t\f\n]

This makes things a lot easyer, now all I have to do to say match any digit is:

/\d/

Much beter!

Multiplyers:

Expressions can contain multiplyers, these multiplyers can spesify that a particular character must ocur a certian amount of times in the string. The common multiplyers are * + and ?. * folowing a character means that the string must match zero or more of the previous character. + means to match with 1 or more of the previous character and finaly ? means to match with zero or one of the preceding character. Some examples:

/a*/ # matches 0 or more a's
/gr+/ # matches 1 or more r's
/\d?/ # matches 0 or 1 digits

There is another way of multiplying, lucky us. Lets say we want 6 or more of a character, then we place in curly braces {6,} after like so:

/\w{6,}/ # matches 6 or more word characters

What if we want 6 or less? then we do the following:

/abcd{,6}/ # match with 6 or less d's

Hey, we can also have a range like so:

/\d{2,7}/ # matches with anywhere from 2 to 7 digits

See how powerfull regular expressions can be?

But what if we want spasificaly 50 of a character:

/d{50}/ # matches with exactly 50 d's

Memory:

What if your pattern matched with a part in the string and you want to use this matching peice of the string later on? The answer my friend is parenthesies. Perl ofers the capability to memories certian sections of the string that matched by encloseing the pattern in parenthisies like so:

/(\d)abc \1/ # memorize the single digit that matched before abc.
if your sting is "4abc 4" this will match because there is a single digit imediatly befor abc, but the digit that matched is enclosed in p[arenthisies so we ememorize it, to access it we type \1 which recalls the first memory space, \2 recals the second and so on. Thus we say if a single digit is found before the characters abc and is followed by a space and then the same character, then the expression matched.

I can't explain memory very well but that was a short intro.

Alteration:

If we want to match either one pattern or another we can put the two patterns separated by a |. The symbol | means or in this case. Like so:

/\d|me/ # matches a digit OR the character sequence me

Presedence:

In math we learn bedmas. In regular expressions we also must be conserend about the order in which our expressions are evaluated. In math if we did something like:

12 + 4 * 6

We would multiply first then add. If we add first then multiply we get a difernt answer. Well simulary in regular expressions, you must be concered with order or presedence. Here it is in table format:

Name | Representation
----------------------------------------------------
Parentheses | ( ) (?: )
----------------------------------------------------
Multipliers | ? + * {m,n} ?? +? *? {m,n}
----------------------------------------------------
Sequence and anchoring | dhrye ^ $ \A \Z (?= ) (?! )
----------------------------------------------------
Alternation | |

To sum it up, parenthesis are evaluated first, followed by Multiplyers, then Sequence and Anchoring, I did not discus anchoring in this tutorial. Then last is the alteration (or).

Misc:

Some topics we did not discus that I will briefly introduce. We briefly went over how to use a expression but lets take a example:

if (/\d{7}/){
print "Hello world!";
}

This compares the pattern agains the string contained in $_. Why does it do this?, because if you do not spesify a variable in the if statement, by default it uses $_. But whjat is we want to compare it to a difernt string, perhaps a streing called $tring:

$tring = "Hello 876 Me";
$tring =~ /\d/;

Now instead of compareing to $_ we compare to $tring, but in this case the return value, true in this case, is ignored, it is not assigned to $string but instead just forgoten. What if we want to do something if it matched:

$tring = "h3llo j03";

if ($tring =~ /h\d/)
{
print "We found h followed by a digit in $tring";
}

and there we have it. One last topic is substitution:

if I want to not only match but also swap the matching part of a string with something else i can do this:

$tring = "what a beautiful world";

$tring =~ s/world/string/;

This sais to look for the letter sequence world and replace it with string, so now $tring contains what a beautiful string. that is what the s means, substitute. We use / as a delimiter but we can easily use # like so:

$tring =~ s#world#string#;

And this does the same thing. Also optionaly if we want to ignore case we can put a i after the entire expression like so:

$tring = "a beautiful WoRlD";
$tring =~ s#world#string#i;

In this case it would swap WoRlD because we said to ignore case, otherwizse it would not match and not swap. Also consider the following:

$tring = "this is a long long long long long string";
$tring =~ s#long#long#g;

Now the g means to look for all ocurances, thus all of the longs are replaced with one.

Ta Da. This tutorial is complete, but I skiped a lot of topics. Like I said this by no means covers all of regular expressions but it should be enought to get you started. Practice, experiment and have fun.

Thread: Perl Regular Expressions

Thread Tools

Display

Hybrid View

Perl Regular Expressions

Posting Permissions