4umi.com/web/javascript/regex

Regular Expressions explained

Text Javascript

Author: liorean (evolt@liorean.f2o.org); 09/09/2002; found in Code, 4umi edited.

What is a regular expression?

Regular expressions are patterns of characters, character sequences that may of may not occur in a given textual content. Take for example the DOS wildcards ? and * when searching for a file which have been commonly used since computers became mainstream articles. That is a kind of very limited subset of RegExp. For instance, if you want to find all files beginning with foo, followed by 1 to 4 random characters, and ending with .txt, you can't do that with the usual DOS wildcards. RegExp, on the other hand, could handle that and much more complicated patterns.

Regular expressions are, in short, a way to effectively handle data, search and replace strings, and provide extended string handling. Often a regular expression can in itself provide string handling that other functionalities such as the built-in string methods and properties can only do if you use them in a complicated function or loop.

Various programming languages implement regular expressions, with some differences between them. In Javascript, support starts with version 1.2, which comes with Internet Explorer 4 and better, and Netscape 4 and better.

RegExp syntax

There are two ways of defining regular expressions in Javascript: one through an object constructor and one through a literal. The object can be changed at runtime, but the literal is compiled at load of the script, and provides better performance. So the literal is the best to use with known regular expressions, but the constructor allows for dynamically constructed regular expressions such as those from user input. In almost all cases you can use either way to define a regular expression, and they will be handled in exactly the same way no matter how you declare them.

Declaring a literal

Like a string variable is declared by putting some text between quotation marks (either the single (') or the double (") variant), RegExp literals are delimited by the forward slash character / in Javascript. This may be confusing at first as it is a quite unusual writing method. Other languages such as PHP or VBScript use other characters for the same purpose in their regular expressions. Any flags go immediately after the closing slash.

var re = /mac|win/i;
/^(\d\d)([-:\/])(\d\d)\2(\d\d)$/.exec( '02:22:44' );
/[\u0024\u20ac]\s*\d+/.test( somemoneystring );
'God save the the Queen'.match( /\b(\w+)\b\s+\1/g );

Extended characters may be entered in the pattern through their Unicode number. In the example above, \u0024 is the dollar, \u20ac is the €.

Declaring an object

To create a regular expression at runtime, a new RegExp() object is called with two parameters, the first the pattern, the second the flags, both as type string. If no flags are required, the second parameter can be left out. Since backslashes are used to escape special characters in strings as well as in regular expressions, they must be escaped themselves with another backslash to make it to the expression.

var re = new RegExp( '\\d{' + min + ',' + max + '}' );
var re = new RegExp( '\\b' + window.prompt( 'Your word:', 'example' ) + '\\b', 'gi' );
var re = new RegExp( new Date().getDate() + ' (\\w) ' + new Date().getFullYear(), 'g' );

Flags

There are three flags that you may use on a RegExp. The multiline flag has bad support in older browsers, but the other two are supported in pretty much every browser that can handle RegExp.

Flag Name Description
gGlobal Search Makes the RegExp search for a pattern throughout the string, creating an array of all occurrences it can find matching the given pattern.
iIgnore Case Makes a regular expression case insensitive, possibly with the exception of extended characters such as æ.
mMultiline Input Makes the beginning (^) and end of input ($) metacharacters also catch at the beginning and end of lines respectively.

Flags are an itegral part of the expression. They cannot be changed or set after the object has been created.

Pattern

The patterns used in RegExp can be very simple, or very complicated, depending on what you're trying to accomplish. To match a simple string like 'Hello World!' is no harder then actually writing the string, but if you want to match an e-mail address or html tag, you might end up with a very complicated pattern that will use most of the syntax presented in the table below.

Pattern Description
Escaping
\ Escapes special characters to literal and literal characters to special.

E.g: /\(s\)/ matches '(s)' while /(\s)/ matches any whitespace and captures the match.
Quantifiers
{n}, {n,}, {n,m}, *, +, ? Quantifiers match the preceding subpattern a certain number of characters. The subpattern can be a single character, an escape sequence, a pattern enclosed by parentheses or a character set.

{n} matches exactly n times.
{n,} matches n or more times.
{n,m} matches n to m times.
* is short for {0,}. Matches zero or more times.
+ is short for {1,}. Matches one or more times.
? is short for {0,1}. Matches zero or one time.

E.g: /o{1,3}/ matches 'oo' in "tooth" and 'o' in "nose".
Pattern delimiters
(pattern), (?:pattern) Matches entire contained pattern.

(pattern) captures match.
(?:pattern) doesn't capture match

E.g: /(d).\1/ matches and captures 'dad' in "abcdadef" while /(?:.d){2}/ matches but doesn't capture 'cdad'.

Note: (?:pattern) is very badly supported in older browsers.
Lookaheads
(?=pattern), (?!pattern) A lookahead matches only if the preceeding subexpression is followed by the pattern, but the pattern is not part of the match. The subexpression is the part of the regular expression which will be matched.

(?=pattern) matches only if there is a following pattern in input.
(?!pattern) matches only if there is not a following pattern in input.

E.g: /Win(?=98)/ matches 'Win' only if 'Win' is followed by '98'.

Note: Support for lookaheads is lacking in most but the newest browsers.
Alternation
| Alternation matches content on either side of the alternation character.

E.g: /(a|b)a/ matches 'aa' in "dseaas" and 'ba' in "acbab".
Character sets
[characters], [^characters] Matches any of the contained characters. A range of characters may be defined by using a hyphen.

[characters] matches any of the contained characters.
[^characters] negates the character set and matches all but the contained characters

E.g: /[abcd]/ matches any of the characters 'a', 'b', 'c', 'd' and may be abbreviated to /[a-d]/. Ranges must be in ascending order, otherwise they will throw an error. (E.g: /[d-a]/ will throw an error.)
/[^0-9]/ matches all characters but digits.

Note: Most special characters are automatically escaped to their literal meaning in character sets.
Other special characters
^, $, ., ? Special characters mean characters that match something else than what they appear as.

^ matches beginning of input (or new line with m flag).
$ matches end of input (or end of line with m flag).
. matches any character except a newline.
? directly following a quantifier makes the quantifier non-greedy (makes it match minimum instead of maximum of the interval defined).

E.g: /(.)*?/ matches nothing or '' in all strings.

Note: Non-greedy matches are not supported in older browsers such as Netscape Navigator 4 or Microsoft Internet Explorer 5.0.
Literal characters
All characters except those with special meaning. Mapped directly to the corresponding character.

E.g: /a/ matches 'a' in "Any ancestor".
Backreferences
\n Backreferences are references to the same thing as a previously captured match. n is a positive nonzero integer telling the browser which captured match to reference to.

/(\S)\1(\1)+/g matches all occurrences of three equal non-whitespace characters following each other.
/<(\S+).*>(.*)<\/\1>/ matches any tag.

E.g: /<(\S+).*>(.*)<\/\1>/ matches '<div id="me">text</div>' in "text<div id=\"me\">text</div>text".
Character Escapes
\f, \r, \n, \t, \v, \0, [\b], \s, \S, \w, \W, \d, \D, \b, \B, \cX, \xhh, \uhhhh \f matches form-feed.
\r matches carrriage return.
\n matches linefeed.
\t matches horizontal tab.
\v matches vertical tab.
\0 matches NUL character.
[\b] matches backspace.
\s matches whitespace (short for [\f\n\r\t\v\u00A0\u2028\u2029]).
\S matches anything but a whitespace (short for [^\f\n\r\t\v\u00A0\u2028\u2029]).
\w matches any alphanumerical character (word characters) including underscore (short for [a-zA-Z0-9_]).
\W matches any non-word characters (short for [^a-zA-Z0-9_]).
\d matches any digit (short for [0-9]).
\D matches any non-digit (short for [^0-9]).
\b matches a word boundary (the position between a word and a space).
\B matches a non-word boundary (short for [^\b]).
\cX matches a control character. E.g: \cm matches control-M.
\xhh matches the character with two characters of hexadecimal code hh.
\uhhhh matches the Unicode character with four characters of hexadecimal code hhhh.

Usage

Now, knowing how a RegExp is written is only half the game. To gain anything from them you have to know how to use them too. There are a number of ways to implement a RegExp, some through methods belonging to the String object, some through methods belonging to the RegExp object. Whether the regular expression is declared through an object constructor or a literal makes no difference as to the usage.

RegExp.exec( string )
//Applies the RegExp to the given string, and returns the match information.

RegExp.test( string )
//Returns a boolean true if the given string matches the Regexp, false if not.

String.match( pattern )
//Matches given string with the RegExp. With 'g' flag returns an array
//  containing all matches, without 'g' flag returns just the first match or
//  if no match is found returns null.

String.search( pattern )
//Returns the index of the beginning of the match if found, -1 if not.

String.replace( pattern, string )
//Returns a string where matches have been replaced with the given string.

String.split( pattern )
//Cuts a string into an array, making cuts at the matches.
Description Example
RegExp.exec(string)
Applies the pattern to the given string, and returns the match information. var match = /s(amp)le/i.exec("Sample text")

match then contains ["Sample","amp"]
RegExp.test(string)
Tests if the given string matches the Regexp, and returns true if matching, false if not. var match = /sample/.test("Sample text")

match then contains false
String.match(pattern)
Matches given string with the RegExp. With g flag returns an array containing the matches, without g flag returns just the first match or if no match is found returns null. var str = "Watch out for the rock!".match(/r?or?/g)

str then contains ["o","or","ro"]
String.search(pattern)
Matches RegExp with string and returns the index of the beginning of the match if found, -1 if not. var ndx = "Watch out for the rock!".search(/for/)

ndx then contains 10
String.replace(pattern,string)
Replaces matches with the given string, and returns the edited string. var str = "Liorean said: My name is Liorean!".replace(/Liorean/g,'Big Fat Dork')

str then contains "Big Fat Dork said: My name is Big Fat Dork!"
String.split(pattern)
Cuts a string into an array, making cuts at matches. var str = "I am confused".split(/\s/g)

str then contains ["I","am","confused"]