1. Groups
I discovered this amazing feature not long time ago.
A
group is a pair of parentheses used to group subpatterns. For example,
h(a|i)t matches
hat or
hit. A group also captures the matching text within the parentheses. For example,
(a(b*))+(c*)
regex.group(1) : (a(b*))
regex.group(2): (b*)
regex.group(3): (c*)
regex.group(0) - whole expression
2. Comment - great thing which unfortunately is missed in Java implementation. Python and Perl rocks!
Roman date example, Python:
pattern = """
^ # beginning of string
M{0,4} # thousands - 0 to 4 M's
(CM|CD|D?C{0,3}) # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
# or 500-800 (D, followed by 0 to 3 C's)
(XC|XL|L?X{0,3}) # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
# or 50-80 (L, followed by 0 to 3 X's)
(IX|IV|V?I{0,3}) # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
# or 5-8 (V, followed by 0 to 3 I's)
$ # end of string
"""3. Don't reinvent bicycle - use existent one. There are pleanty of great constructions which represent digits, words, beginning and end of line, learn and use this idioms. There is great library in Perl
Regexp::
Common
which has a lot of useful shortcuts for dates, phones, etc. In java SimpleDateFormat does something similar but in a very limited context.
4. Quantifieres - use carefully. Is not easy to understand all these concept of greedy, possesive and reluctant quantifiers.
5. Test. Play as much with your regexp as you can before going to prod. To lazy to write unit tests?
- No problem. There is an amazing website
regexplib, where you could play with it without any language and IDE.