Friday, April 22, 2011

5 little things you should know about regexp

1. Groups
I discovered this amazing feature not long time ago.
A group is a pair of parentheses used to group subpatterns. For example, h(a|i)t matches hat or hit. A group also captures the matching text within the parentheses. For example,
(a(b*))+(c*)
regex.group(1) : (a(b*))
regex.group(2): (b*)
regex.group(3): (c*)
regex.group(0) - whole expression

2. Comment - great thing which unfortunately is missed in Java implementation. Python and Perl rocks!
Roman date example, Python:
  pattern = """
    ^                   # beginning of string
    M{0,4}              # thousands - 0 to 4 M's
    (CM|CD|D?C{0,3})    # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
                        #            or 500-800 (D, followed by 0 to 3 C's)
    (XC|XL|L?X{0,3})    # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
                        #        or 50-80 (L, followed by 0 to 3 X's)
    (IX|IV|V?I{0,3})    # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
                        #        or 5-8 (V, followed by 0 to 3 I's)
    $                   # end of string
    """
3. Don't reinvent bicycle - use existent one. There are pleanty of great constructions which represent digits, words, beginning and end of line, learn and use this idioms. There is great library in Perl Regexp::Common
which has a lot of useful shortcuts for dates, phones, etc. In java SimpleDateFormat does something similar but in a very limited context.

4. Quantifieres - use carefully. Is not easy to understand all these concept of greedy, possesive and reluctant quantifiers.

5. Test. Play as much with your regexp as you can before going to prod. To lazy to write unit tests?
- No problem. There is an amazing website regexplib, where you could play with it without any language and IDE.

2 comments:

  1. http://gskinner.com/RegExr/
    http://myregexp.com/

    ReplyDelete
  2. Vow, amazing tools, thanks!

    ReplyDelete