Author: liorean (evolt@liorean.f2o.org); 09/09/2002; found in Code, 4umi edited.
Regular expressions are patterns of characters, character sequences that may of may not occur in
a given textual content. Take for example the DOS wildcards ? and
* when searching for a file which have been commonly used since
computers became mainstream articles. That is a kind of very limited subset of RegExp. For instance, if you want to find all
files beginning with foo, followed by 1 to 4 random
characters, and ending with .txt, you can't do that with
the usual DOS wildcards. RegExp, on the other hand, could handle that and much
more complicated patterns.
Regular expressions are, in short, a way to effectively handle data, search and replace strings, and provide extended string handling. Often a regular expression can in itself provide string handling that other functionalities such as the built-in string methods and properties can only do if you use them in a complicated function or loop.
Various programming languages implement regular expressions, with some differences between them. In Javascript, support starts with version 1.2, which comes with Internet Explorer 4 and better, and Netscape 4 and better.
There are two ways of defining regular expressions in Javascript: one through an object constructor and one through a literal. The object can be changed at runtime, but the literal is compiled at load of the script, and provides better performance. So the literal is the best to use with known regular expressions, but the constructor allows for dynamically constructed regular expressions such as those from user input. In almost all cases you can use either way to define a regular expression, and they will be handled in exactly the same way no matter how you declare them.
Like a string variable is declared by putting some text between quotation marks (either the single (') or the double (") variant), RegExp literals are delimited by the forward slash character / in Javascript. This may be confusing at first as it is a quite unusual writing method. Other languages such as PHP or VBScript use other characters for the same purpose in their regular expressions. Any flags go immediately after the closing slash.
/pattern/flagsvar re = /mac|win/i; /^(\d\d)([-:\/])(\d\d)\2(\d\d)$/.exec( '02:22:44' ); /[\u0024\u20ac]\s*\d+/.test( somemoneystring ); 'God save the the Queen'.match( /\b(\w+)\b\s+\1/g );
Extended characters may be entered in the pattern through their Unicode number. In the example above, \u0024 is the dollar, \u20ac is the €.
To create a regular expression at runtime, a object is called with two parameters, the first the pattern, the second the flags, both as type string. If no flags are required, the second parameter can be left out. Since backslashes are used to escape special characters in strings as well as in regular expressions, they must be escaped themselves with another backslash to make it to the expression.new RegExp()
new RegExp( 'pattern', 'flags' );
var re = new RegExp( '\\d{' + min + ',' + max + '}' );
var re = new RegExp( '\\b' + window.prompt( 'Your word:', 'example' ) + '\\b', 'gi' );
var re = new RegExp( new Date().getDate() + ' (\\w) ' + new Date().getFullYear(), 'g' );
There are three flags that you may use on a RegExp. The multiline flag has bad support in older browsers, but the other two are supported in pretty much every browser that can handle RegExp.
| Flag | Name | Description |
|---|---|---|
g | Global Search | Makes the RegExp search for a pattern throughout the string, creating an array of all occurrences it can find matching the given pattern. |
i | Ignore Case | Makes a regular expression case insensitive, possibly with the exception of extended characters such as æ. |
m | Multiline Input | Makes the beginning (^) and end of input
($) metacharacters also catch at the beginning and end of lines
respectively. |
Flags are an itegral part of the expression. They cannot be changed or set after the object has been created.
The patterns used in RegExp can be very simple, or very complicated, depending on what you're trying to accomplish. To match a simple string like 'Hello World!' is no harder then actually writing the string, but if you want to match an e-mail address or html tag, you might end up with a very complicated pattern that will use most of the syntax presented in the table below.
| Pattern | Description |
|---|---|
| Escaping | |
\ |
Escapes special characters to literal and literal characters to
special. E.g: /\(s\)/ matches '(s)' while
/(\s)/ matches any whitespace and captures the match. |
| Quantifiers | |
{n},
{n,},
{n,m}, *,
+, ? |
Quantifiers match the preceding subpattern a certain number of characters.
The subpattern can be a single character, an escape sequence, a pattern enclosed
by parentheses or a character set.{n}
matches exactly n times.{n,} matches
n or more
times.{n,m} matches
n to m times.* is short for
{0,}. Matches zero or more times.+ is short for
{1,}. Matches one or more times.? is short for
{0,1}. Matches zero or one time.E.g: /o{1,3}/
matches 'oo' in "tooth" and 'o' in "nose". |
| Pattern delimiters | |
(pattern),
(?:pattern) |
Matches entire contained
pattern.(pattern) captures
match.(?:pattern) doesn't capture
matchE.g: /(d).\1/ matches and captures 'dad' in "abcdadef"
while /(?:.d){2}/ matches but doesn't capture
'cdad'.Note: (?:pattern)
is very badly supported in older browsers. |
| Lookaheads | |
(?=pattern),
(?!pattern) |
A lookahead matches only if the preceeding subexpression is followed by the
pattern, but the pattern is not part of the match. The subexpression is the part
of the regular expression which will be
matched.(?=pattern) matches only if there
is a following pattern in
input.(?!pattern) matches only if there is not
a following pattern in input.E.g: /Win(?=98)/ matches 'Win' only if 'Win' is followed by
'98'.Note: Support for lookaheads is lacking in most but the newest browsers. |
| Alternation | |
| |
Alternation matches content on either side of the alternation
character. E.g: /(a|b)a/ matches 'aa' in "dseaas" and 'ba'
in "acbab". |
| Character sets | |
[characters],
[^characters] |
Matches any of the contained characters. A range of characters may be
defined by using a hyphen.[characters]
matches any of the contained
characters.[^characters] negates the
character set and matches all but the contained charactersE.g: /[abcd]/ matches any of the characters 'a', 'b', 'c', 'd' and may
be abbreviated to /[a-d]/. Ranges must be in ascending order,
otherwise they will throw an error. (E.g: /[d-a]/ will throw an
error.)/[^0-9]/ matches all characters but
digits.Note: Most special characters are automatically escaped to their literal meaning in character sets. |
| Other special characters | |
^, $, .,
? |
Special characters mean characters that match something else than what they
appear as.^ matches beginning of input (or new line with
m flag).$ matches end of input (or end of line with
m flag).. matches any character except a
newline.? directly following a quantifier makes the quantifier
non-greedy (makes it match minimum instead of maximum of the interval
defined).E.g: /(.)*?/ matches nothing or '' in all
strings.Note: Non-greedy matches are not supported in older browsers such as Netscape Navigator 4 or Microsoft Internet Explorer 5.0. |
| Literal characters | |
| All characters except those with special meaning. | Mapped directly to the corresponding character. E.g: /a/
matches 'a' in "Any ancestor". |
| Backreferences | |
\n |
Backreferences are references to the same thing as a previously captured
match. n is a positive nonzero integer telling the browser which
captured match to reference to./(\S)\1(\1)+/g matches all
occurrences of three equal non-whitespace characters following each
other./<(\S+).*>(.*)<\/\1>/ matches any
tag.E.g: /<(\S+).*>(.*)<\/\1>/ matches '<div
id="me">text</div>' in "text<div
id=\"me\">text</div>text". |
| Character Escapes | |
\f, \r, \n,
\t, \v, \0, [\b],
\s, \S, \w, \W,
\d, \D, \b, \B,
\cX, \xhh,
\uhhhh |
\f matches form-feed.\r matches carrriage
return.\n matches linefeed.\t matches
horizontal tab.\v matches vertical tab.\0
matches NUL character.[\b] matches
backspace.\s matches whitespace (short for
[\f\n\r\t\v\u00A0\u2028\u2029]).\S matches
anything but a whitespace (short for
[^\f\n\r\t\v\u00A0\u2028\u2029]).\w matches any
alphanumerical character (word characters) including underscore (short for
[a-zA-Z0-9_]).\W matches any non-word characters
(short for [^a-zA-Z0-9_]).\d matches any digit
(short for [0-9]).\D matches any non-digit (short
for [^0-9]).\b matches a word boundary (the
position between a word and a space).\B matches a non-word
boundary (short for [^\b]).\cX
matches a control character. E.g: \cm matches
control-M.\xhh matches the character with two
characters of hexadecimal code
hh.\uhhhh matches the Unicode
character with four characters of hexadecimal code
hhhh. |
Now, knowing how a RegExp is written is only half the game. To gain anything from them you have to know how to use them too. There are a number of ways to implement a RegExp, some through methods belonging to the String object, some through methods belonging to the RegExp object. Whether the regular expression is declared through an object constructor or a literal makes no difference as to the usage.
RegExp.exec( string ) //Applies the RegExp to the given string, and returns the match information. RegExp.test( string ) //Returns a boolean true if the given string matches the Regexp, false if not. String.match( pattern ) //Matches given string with the RegExp. With 'g' flag returns an array // containing all matches, without 'g' flag returns just the first match or // if no match is found returns null. String.search( pattern ) //Returns the index of the beginning of the match if found, -1 if not. String.replace( pattern, string ) //Returns a string where matches have been replaced with the given string. String.split( pattern ) //Cuts a string into an array, making cuts at the matches.
| Description | Example |
|---|---|
RegExp. | |
| Applies the pattern to the given string, and returns the match information. | var match = /s(amp)le/i.exec("Sample
text")match then contains
["Sample","amp"] |
RegExp.test(string) | |
| Tests if the given string matches the Regexp, and returns true if matching, false if not. | var match = /sample/.test("Sample
text")match then contains
false |
String.match(pattern) | |
Matches given string with the RegExp. With g flag
returns an array containing the matches, without g flag returns
just the first match or if no match is found returns null. |
var str = "Watch out for the
rock!".match(/r?or?/g)str then contains
["o","or","ro"] |
String.search(pattern) | |
| Matches RegExp with string and returns the index of the beginning of the match if found, -1 if not. | var ndx = "Watch out for the
rock!".search(/for/)ndx then contains
10 |
String.replace(pattern,string) | |
| Replaces matches with the given string, and returns the edited string. | var str = "Liorean said: My name is Liorean!".replace(/Liorean/g,'Big
Fat Dork')str then contains "Big Fat Dork
said: My name is Big Fat Dork!" |
String.split(pattern) | |
| Cuts a string into an array, making cuts at matches. | var str = "I am confused".split(/\s/g)str
then contains ["I","am","confused"] |