Results 1 to 5 of 5

Thread: Perl Regular Expressions

  1. #1
    Senior Member
    Join Date
    Jun 2002

    Perl Regular Expressions


    This tutorial is about regular expressions in Perl. Though I make
    every effort to provide acurate information, some of the information
    I provide may contain false information. As well This tutorial is by
    no means complete.

    Assumptions made:

    I asume that you have a working knowledge of the fundemntals of the Perl language.


    We will start off our discussion by defining what a regular expression
    is. A regular expression is a pattern to be matched against a string. where a string is a sequence of characters. If the pattern matches with the string true is returned, if it fails false is returned. We can thus test for success or failier. Not only can we check for suces or failier, we may also use substitution to swap the section of the string that matched with another string. As you can see, regular expressions are very powerfull. Let us now embark on our journey into the world of regular expressions in Perl.

    We have a definition for a regular expression, let us now examin the parts of a regular expression. Here is our first expression:


    this is a expression, notice how it is enclosed in forward slashes, this denotes a regular expression. Now lets match it against a string:

    $_ = "this line contains abcd together side by side";

    if (/abcd/){
    print "It matched";

    Wait, you say, what are we compareing it with? We are compareing /abcd/ with the contents of the $_ scolar variable. The if conditional statement checks to see if abcd apear together side by side in our $_ variable. It does so it will print It matched.

    Single characters, a single character not includeing the new line character is represented with a dot.

    Character Classes:

    What if we wanted to determine if one and only one of the letters abcd apear in the string instead of all. We would enclose the letters with brackets like so


    This is called a character class. With square brackets it denotes that one of the enclosed characters must match. But be carefull because our example will still match. Infact if we let it, it will match 4 times, however if we had "a cobera abcd me" it will now match with the first a not the abcd later on in the string, why? because it looks for the first match, and since the expresion states one of the characters "abcd" should match, "a" will satisfie the expression.

    If we want more controll, let us say that we want to match any one of the letters in the alphabet, it would be time consumeing to write out each and every character, so instead a shortcut was created. Simply place a - between two characters, this indicates to fill in all the letters inbetween. Like so:


    This will match with any letter in the alphabet, capital or small. It can also be applyed to numerical digits. Ther is also a negated character class which places a caret ^ just after the first square bracket like so:


    This indicates to match with anything except a character in the alphabet. It means exclued everything enclosed in brackets. You can think of it as the oposite of a regular character class. Another example:


    This sais to match with anything except the digit 1 2 or 3. But what if we want to use a ^ in our expression but we do not want it to be interprited with its special negated class meaning. In this case we would escape it with a back slash like so:


    Match with everything except a up arrow ^. This goes with any character that has special meaning, if you want to use a specxial character you exacpe it with a back slash. Now we have more short cuts, we have special pre defined classes, because the following are common regular expressions we need a shortcut:


    The above are some common expressions so we have short cuts. Here they are:

    \d is the equivalent to [0-9]
    \D is the oposite, equivalent to [^0-9]
    \w is the equivalent to [a-zA-Z0-9]
    \W is the oposite, equivalent to [^a-zA-Z0-9]
    \s is the equivalent to [\r\t\f\n] (characters that produce spaces)
    \S is the oposite, equivalent to [^ \r\t\f\n]

    This makes things a lot easyer, now all I have to do to say match any digit is:


    Much beter!


    Expressions can contain multiplyers, these multiplyers can spesify that a particular character must ocur a certian amount of times in the string. The common multiplyers are * + and ?. * folowing a character means that the string must match zero or more of the previous character. + means to match with 1 or more of the previous character and finaly ? means to match with zero or one of the preceding character. Some examples:

    /a*/ # matches 0 or more a's
    /gr+/ # matches 1 or more r's
    /\d?/ # matches 0 or 1 digits

    There is another way of multiplying, lucky us. Lets say we want 6 or more of a character, then we place in curly braces {6,} after like so:

    /\w{6,}/ # matches 6 or more word characters

    What if we want 6 or less? then we do the following:

    /abcd{,6}/ # match with 6 or less d's

    Hey, we can also have a range like so:

    /\d{2,7}/ # matches with anywhere from 2 to 7 digits

    See how powerfull regular expressions can be?

    But what if we want spasificaly 50 of a character:

    /d{50}/ # matches with exactly 50 d's


    What if your pattern matched with a part in the string and you want to use this matching peice of the string later on? The answer my friend is parenthesies. Perl ofers the capability to memories certian sections of the string that matched by encloseing the pattern in parenthisies like so:

    /(\d)abc \1/ # memorize the single digit that matched before abc.
    if your sting is "4abc 4" this will match because there is a single digit imediatly befor abc, but the digit that matched is enclosed in p[arenthisies so we ememorize it, to access it we type \1 which recalls the first memory space, \2 recals the second and so on. Thus we say if a single digit is found before the characters abc and is followed by a space and then the same character, then the expression matched.

    I can't explain memory very well but that was a short intro.


    If we want to match either one pattern or another we can put the two patterns separated by a |. The symbol | means or in this case. Like so:

    /\d|me/ # matches a digit OR the character sequence me


    In math we learn bedmas. In regular expressions we also must be conserend about the order in which our expressions are evaluated. In math if we did something like:

    12 + 4 * 6

    We would multiply first then add. If we add first then multiply we get a difernt answer. Well simulary in regular expressions, you must be concered with order or presedence. Here it is in table format:

    Name | Representation
    Parentheses | ( ) (?: )
    Multipliers | ? + * {m,n} ?? +? *? {m,n}
    Sequence and anchoring | dhrye ^ $ \A \Z (?= ) (?! )
    Alternation | |

    To sum it up, parenthesis are evaluated first, followed by Multiplyers, then Sequence and Anchoring, I did not discus anchoring in this tutorial. Then last is the alteration (or).


    Some topics we did not discus that I will briefly introduce. We briefly went over how to use a expression but lets take a example:

    if (/\d{7}/){
    print "Hello world!";

    This compares the pattern agains the string contained in $_. Why does it do this?, because if you do not spesify a variable in the if statement, by default it uses $_. But whjat is we want to compare it to a difernt string, perhaps a streing called $tring:

    $tring = "Hello 876 Me";
    $tring =~ /\d/;

    Now instead of compareing to $_ we compare to $tring, but in this case the return value, true in this case, is ignored, it is not assigned to $string but instead just forgoten. What if we want to do something if it matched:

    $tring = "h3llo j03";

    if ($tring =~ /h\d/)
    print "We found h followed by a digit in $tring";

    and there we have it. One last topic is substitution:

    if I want to not only match but also swap the matching part of a string with something else i can do this:

    $tring = "what a beautiful world";

    $tring =~ s/world/string/;

    This sais to look for the letter sequence world and replace it with string, so now $tring contains what a beautiful string. that is what the s means, substitute. We use / as a delimiter but we can easily use # like so:

    $tring =~ s#world#string#;

    And this does the same thing. Also optionaly if we want to ignore case we can put a i after the entire expression like so:

    $tring = "a beautiful WoRlD";
    $tring =~ s#world#string#i;

    In this case it would swap WoRlD because we said to ignore case, otherwizse it would not match and not swap. Also consider the following:

    $tring = "this is a long long long long long string";
    $tring =~ s#long#long#g;

    Now the g means to look for all ocurances, thus all of the longs are replaced with one.

    Ta Da. This tutorial is complete, but I skiped a lot of topics. Like I said this by no means covers all of regular expressions but it should be enought to get you started. Practice, experiment and have fun.
    In snatches, they learn something of the wisdom
    which is of good, and more of the mere knowledge which is of evil. But must I know what must not come, for I shale become those of knowledgedome. Peace~

  2. #2
    An excellent intro to regular expressions. Thanks for the info.

  3. #3
    Senior Member
    Join Date
    Dec 2001
    Awesome Job man. Tutorials are always great.

  4. #4
    Senior Member
    Join Date
    Jul 2001
    Just as a short addendum to your "Memory Section," the readers should be aware that when you enclose anything in parenthesis "()" not only can use use that memory (i.e. 1 2) within the reg. expression itself, but you can also use it after the regular expression is complete. So if you did something like the following:

    $string="This is a string...";
    print $1;

    The above would print "is" to standard out.

    Just wanted to clarify, because that is one of the big advantages and I don't think it was covered enough.

    \"It\'s only arrogrance if you can\'t back it up, otherwise it is confidence.\" - Me

  5. #5
    Senior Member
    Join Date
    Jun 2002
    10Q I was just reading my tut over again, I realized a few other things I forgot to mention.

    if we do not like to use / as a diliminator we may use any character but we must preceed it with a m like so:

    if ($tring =~ m#hello#){
    print ("hello was found in $tring");

    but now to use the actual character # as a character and not to be confused with a deliminator we must escape it like so

    $tring =~ m#a pound sign \##a replacement string#;

    now because # has special meaning to use it normaly escape it.

    There are other special characters i forgot to mention, they include \t for tab, \n for a new line character, \r for a carage return, and \f for a line feed.

    As well for the substitution, we need not preceed it with a m but it must ocur 3 times like so:

    $tring =~ s%a string%replace with this string%;

    Our dilimitor in this case is %. get it? good!

    there is even more topics of regular expressions, if anyone else has anything to add feal free to do so, also there may be errors so if you see a error ifeel free to point it out. 10Q very much for your help Wizeman.
    In snatches, they learn something of the wisdom
    which is of good, and more of the mere knowledge which is of evil. But must I know what must not come, for I shale become those of knowledgedome. Peace~

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts