mirror of
https://github.com/Stichting-MINIX-Research-Foundation/netbsd.git
synced 2025-09-10 07:39:25 -04:00
434 lines
14 KiB
HTML
434 lines
14 KiB
HTML
<h1>TRE Regexp Syntax</h1>
|
|
|
|
<p>
|
|
This document describes the POSIX 1003.2 extended RE (ERE) syntax and
|
|
the basic RE (BRE) syntax as implemented by TRE, and the TRE extensions
|
|
to the ERE syntax. A simple Extended Backus-Naur Form (EBNF) style
|
|
notation is used to describe the grammar.
|
|
</p>
|
|
|
|
<h2>ERE Syntax</h2>
|
|
|
|
<h3>Alternation operator</h3>
|
|
<a name="alternation"></a>
|
|
<a name="extended-regexp"></a>
|
|
|
|
<table bgcolor="#e0e0f0" cellpadding="10">
|
|
<tr><td>
|
|
<pre>
|
|
<i>extended-regexp</i> ::= <a href="#branch"><i>branch</i></a>
|
|
| <i>extended-regexp</i> <b>"|"</b> <a href="#branch"><i>branch</i></a>
|
|
</pre>
|
|
</td></tr>
|
|
</table>
|
|
<p>
|
|
An extended regexp (ERE) is one or more <i>branches</i>, separated by
|
|
<tt>|</tt>. An ERE matches anything that matches one or more of the
|
|
branches.
|
|
</p>
|
|
|
|
<h3>Catenation of REs</h3>
|
|
<a name="catenation"></a>
|
|
<a name="branch"></a>
|
|
|
|
<table bgcolor="#e0e0f0" cellpadding="10">
|
|
<tr><td>
|
|
<pre>
|
|
<i>branch</i> ::= <i>piece</i>
|
|
| <i>branch</i> <i>piece</i>
|
|
</pre>
|
|
</td></tr>
|
|
</table>
|
|
<p>
|
|
A branch is one or more <i>pieces</i> concatenated. It matches a
|
|
match for the first piece, followed by a match for the second piece,
|
|
and so on.
|
|
</p>
|
|
|
|
|
|
<table bgcolor="#e0e0f0" cellpadding="10">
|
|
<tr><td>
|
|
<pre>
|
|
<i>piece</i> ::= <i>atom</i>
|
|
| <i>atom</i> <a href="#repeat-operator"><i>repeat-operator</i></a>
|
|
| <i>atom</i> <a href="#approx-settings"><i>approx-settings</i></a>
|
|
</pre>
|
|
</td></tr>
|
|
</table>
|
|
<p>
|
|
A piece is an <i>atom</i> possibly followed by a repeat operator or an
|
|
expression controlling approximate matching parameters for the <i>atom</i>.
|
|
</p>
|
|
|
|
|
|
<table bgcolor="#e0e0f0" cellpadding="10">
|
|
<tr><td>
|
|
<pre>
|
|
<i>atom</i> ::= <b>"("</b> <i>extended-regexp</i> <b>")"</b>
|
|
| <a href="#bracket-expression"><i>bracket-expression</i></a>
|
|
| <b>"."</b>
|
|
| <a href="#assertion"><i>assertion</i></a>
|
|
| <a href="#literal"><i>literal</i></a>
|
|
| <a href="#backref"><i>back-reference</i></a>
|
|
| <b>"(?#"</b> <i>comment-text</i> <b>")"</b>
|
|
| <b>"(?"</b> <a href="#options"><i>options</i></a> <b>")"</b> <i>extended-regexp</i>
|
|
| <b>"(?"</b> <a href="#options"><i>options</i></a> <b>":"</b> <i>extended-regexp</i> <b>")"</b>
|
|
</pre>
|
|
</td></tr>
|
|
</table>
|
|
<p>
|
|
An atom is either an ERE enclosed in parenthesis, a bracket
|
|
expression, a <tt>.</tt> (period), an assertion, or a literal.
|
|
</p>
|
|
|
|
<p>
|
|
The dot (<tt>.</tt>) matches any single character.
|
|
If the <code>REG_NEWLINE</code> compilation flag (see <a
|
|
href="api.html">API manual</a>) is specified, the newline
|
|
character is not matched.
|
|
</p>
|
|
|
|
<p>
|
|
<tt>Comment-text</tt> can contain any characters except for a closing parenthesis <tt>)</tt>. The text in the comment is
|
|
completely ignored by the regex parser and it used solely for readability purposes.
|
|
</p>
|
|
|
|
<h3>Repeat operators</h3>
|
|
<a name="repeat-operator"></a>
|
|
|
|
<table bgcolor="#e0e0f0" cellpadding="10">
|
|
<tr><td>
|
|
<pre>
|
|
<i>repeat-operator</i> ::= <b>"*"</b>
|
|
| <b>"+"</b>
|
|
| <b>"?"</b>
|
|
| <i>bound</i>
|
|
| <b>"*?"</b>
|
|
| <b>"+?"</b>
|
|
| <b>"??"</b>
|
|
| <i>bound</i> <b>?</b>
|
|
</pre>
|
|
</td></tr>
|
|
</table>
|
|
|
|
<p>
|
|
An atom followed by <tt>*</tt> matches a sequence of 0 or more matches
|
|
of the atom. <tt>+</tt> is similar to <tt>*</tt>, matching a sequence
|
|
of 1 or more matches of the atom. An atom followed by <tt>?</tt>
|
|
matches a sequence of 0 or 1 matches of the atom.
|
|
</p>
|
|
|
|
<p>
|
|
A <i>bound</i> is one of the following, where <i>m</i> and <i>m</i>
|
|
are unsigned decimal integers between <tt>0</tt> and
|
|
<tt>RE_DUP_MAX</tt>:
|
|
</p>
|
|
|
|
<ol>
|
|
<li><tt>{</tt><i>m</i><tt>,</tt><i>n</i><tt>}</tt></li>
|
|
<li><tt>{</tt><i>m</i><tt>,}</tt></li>
|
|
<li><tt>{</tt><i>m</i><tt>}</tt></li>
|
|
</ol>
|
|
|
|
<p>
|
|
An atom followed by [1] matches a sequence of <i>m</i> through <i>n</i>
|
|
(inclusive) matches of the atom. An atom followed by [2]
|
|
matches a sequence of <i>m</i> or more matches of the atom. An atom
|
|
followed by [3] matches a sequence of exactly <i>m</i> matches of the
|
|
atom.
|
|
</p>
|
|
|
|
|
|
<p>
|
|
Adding a <tt>?</tt> to a repeat operator makes the subexpression minimal, or
|
|
non-greedy. Normally a repeated expression is greedy, that is, it matches as
|
|
many characters as possible. A non-greedy subexpression matches as few
|
|
characters as possible. Note that this does not (always) mean the same thing
|
|
as matching as many or few repetitions as possible. Also note
|
|
that <strong>minimal repetitions are not currently supported for approximate
|
|
matching</strong>.
|
|
</p>
|
|
|
|
<h3>Approximate matching settings</h3>
|
|
<a name="approx-settings"></a>
|
|
|
|
<table bgcolor="#e0e0f0" cellpadding="10">
|
|
<tr><td>
|
|
<pre>
|
|
<i>approx-settings</i> ::= <b>"{"</b> <i>count-limits</i>* <b>","</b>? <i>cost-equation</i>? <b>"}"</b>
|
|
|
|
<i>count-limits</i> ::= <b>"+"</b> <i>number</i>?
|
|
| <b>"-"</b> <i>number</i>?
|
|
| <b>"#"</b> <i>number</i>?
|
|
| <b>"~"</b> <i>number</i>?
|
|
|
|
<i>cost-equation</i> ::= ( <i>cost-term</i> "+"? " "? )+ <b>"<"</b> <i>number</i>
|
|
|
|
<i>cost-term</i> ::= <i>number</i> <b>"i"</b>
|
|
| <i>number</i> <b>"d"</b>
|
|
| <i>number</i> <b>"s"</b>
|
|
|
|
</pre>
|
|
</td></tr>
|
|
</table>
|
|
|
|
<p>
|
|
The approximate matching settings for a subpattern can be changed
|
|
by appending <i>approx-settings</i> to the subpattern. Limits for
|
|
the number of errors can be set and an expression for specifying and
|
|
limiting the costs can be given.
|
|
</p>
|
|
|
|
<p>
|
|
The <i>count-limits</i> can be used to set limits for the number of
|
|
insertions (<tt>+</tt>), deletions (<tt>-</tt>), substitutions
|
|
(<tt>#</tt>), and total number of errors (<tt>~</tt>). If the
|
|
<i>number</i> part is omitted, the specified error count will be
|
|
unlimited.
|
|
</p>
|
|
|
|
<p>
|
|
The <i>cost-equation</i> can be thought of as a mathematical equation,
|
|
where <tt>i</tt>, <tt>d</tt>, and <tt>s</tt> stand for the number of
|
|
insertions, deletions, and substitutions, respectively. The equation
|
|
can have a multiplier for each of <tt>i</tt>, <tt>d</tt>, and
|
|
<tt>s</tt>. The multiplier is the cost of the error, and the number
|
|
after <tt><</tt> is the maximum allowed cost of a match. Spaces
|
|
and pluses can be inserted to make the equation readable. In fact, when
|
|
specifying only a cost equation, adding a space after the opening <tt>{</tt>
|
|
is <strong>required</strong>.
|
|
</p>
|
|
|
|
<p>
|
|
Examples:
|
|
<dl>
|
|
<dt><tt>{~}</tt></dt>
|
|
<dd>Sets the maximum number of errors to unlimited.</dd>
|
|
<dt><tt>{~3}</tt></dt>
|
|
<dd>Sets the maximum number of errors to three.</dd>
|
|
<dt><tt>{+2~5}</tt></dt>
|
|
<dd>Sets the maximum number of errors to five, and the maximum number
|
|
of insertions to two.</dd>
|
|
<dt><tt>{<3}</tt></dt>
|
|
<dd>Sets the maximum cost to three.
|
|
<dt><tt>{ 2i + 1d + 2s < 5 }</tt></dt>
|
|
<dd>Sets the cost of an insertion to two, a deletion to one, a
|
|
substitution to two, and the maximum cost to five.
|
|
</dl>
|
|
|
|
|
|
<h3>Bracket expressions</h3>
|
|
<a name="bracket-expression"></a>
|
|
|
|
<table bgcolor="#e0e0f0" cellpadding="10">
|
|
<tr><td>
|
|
<pre>
|
|
<i>bracket-expression</i> ::= <b>"["</b> <i>item</i>+ <b>"]"</b>
|
|
| <b>"[^"</b> <i>item</i>+ <b>"]"</b>
|
|
</pre>
|
|
</td></tr>
|
|
</table>
|
|
|
|
<p>
|
|
A bracket expression specifies a set of characters by enclosing a
|
|
nonempty list of items in brackets. Normally anything matching any
|
|
item in the list is matched. If the list begins with <tt>^</tt> the
|
|
meaning is negated; any character matching no item in the list is
|
|
matched.
|
|
</p>
|
|
|
|
<p>
|
|
An item is any of the following:
|
|
</p>
|
|
<ul>
|
|
<li>A single character, matching that character.</li>
|
|
<li>Two characters separated by <tt>-</tt>. This is shorthand for the
|
|
full range of characters between those two (inclusive) in the
|
|
collating sequence. For example, <tt>[0-9]</tt> in ASCII matches any
|
|
decimal digit.</li>
|
|
<li>A collating element enclosed in <tt>[.</tt> and <tt>.]</tt>,
|
|
matching the collating element. This can be used to include a literal
|
|
<tt>-</tt> or a multi-character collating element in the list.</li>
|
|
<li>A collating element enclosed in <tt>[=</tt> and <tt>=]</tt> (an
|
|
equivalence class), matching all collating elements with the same
|
|
primary collation weight as that element, including the element
|
|
itself.</li>
|
|
<li>The name of a character class enclosed in <tt>[:</tt> and
|
|
<tt>:]</tt>, matching any character belonging to the class. The set
|
|
of valid names depends on the <code>LC_CTYPE</code> category of the
|
|
current locale, but the following names are valid in all locales:
|
|
<ul>
|
|
<li><tt>alnum</tt> - alphanumeric characters</li>
|
|
<li><tt>alpha</tt> - alphabetic characters</li>
|
|
<li><tt>blank</tt> - blank characters</li>
|
|
<li><tt>cntrl</tt> - control characters</li>
|
|
<li><tt>digit</tt> - decimal digits (0 through 9)</li>
|
|
<li><tt>graph</tt> - all printable characters except space</li>
|
|
<li><tt>lower</tt> - lower-case letters</li>
|
|
<li><tt>print</tt> - printable characters including space</li>
|
|
<li><tt>punct</tt> - printable characters not space or alphanumeric</li>
|
|
<li><tt>space</tt> - white-space characters</li>
|
|
<li><tt>upper</tt> - upper case letters</li>
|
|
<li><tt>xdigit</tt> - hexadecimal digits</li>
|
|
</ul>
|
|
</ul>
|
|
<p>
|
|
To include a literal <tt>-</tt> in the list, make it either the first
|
|
or last item, the second endpoint of a range, or enclose it in
|
|
<tt>[.</tt> and <tt>.]</tt> to make it a collating element. To
|
|
include a literal <tt>]</tt> in the list, make it either the first
|
|
item, the second endpoint of a range, or enclose it in <tt>[.</tt> and
|
|
<tt>.]</tt>. To use a literal <tt>-</tt> as the first
|
|
endpoint of a range, enclose it in <tt>[.</tt> and <tt>.]</tt>.
|
|
</p>
|
|
|
|
|
|
<h3>Assertions</h3>
|
|
<a name="assertion"></a>
|
|
|
|
<table bgcolor="#e0e0f0" cellpadding="10">
|
|
<tr><td>
|
|
<pre>
|
|
<i>assertion</i> ::= <b>"^"</b>
|
|
| <b>"$"</b>
|
|
| <b>"\"</b> <i>assertion-character</i>
|
|
</pre>
|
|
</td></tr>
|
|
</table>
|
|
|
|
<p>
|
|
The expressions <tt>^</tt> and <tt>$</tt> are called "left anchor" and
|
|
"right anchor", respectively. The left anchor matches the empty
|
|
string at the beginning of the string. The right anchor matches the
|
|
empty string at the end of the string. The behaviour of both anchors
|
|
can be varied by specifying certain execution and compilation flags;
|
|
see the <a href="api.html">API manual</a>.
|
|
</p>
|
|
|
|
<p>
|
|
An assertion-character can be any of the following:
|
|
</p>
|
|
|
|
<ul>
|
|
<li><tt><</tt> - Beginning of word
|
|
<li><tt>></tt> - End of word
|
|
<li><tt>b</tt> - Word boundary
|
|
<li><tt>B</tt> - Non-word boundary
|
|
<li><tt>d</tt> - Digit character (equivalent to <tt>[[:digit:]]</tt>)</li>
|
|
<li><tt>D</tt> - Non-digit character (equivalent to <tt>[^[:digit:]]</tt>)</li>
|
|
<li><tt>s</tt> - Space character (equivalent to <tt>[[:space:]]</tt>)</li>
|
|
<li><tt>S</tt> - Non-space character (equivalent to <tt>[^[:space:]]</tt>)</li>
|
|
<li><tt>w</tt> - Word character (equivalent to <tt>[[:alnum:]_]</tt>)</li>
|
|
<li><tt>W</tt> - Non-word character (equivalent to <tt>[^[:alnum:]_]</tt>)</li>
|
|
</ul>
|
|
|
|
|
|
<h3>Literals</h3>
|
|
<a name="literal"></a>
|
|
|
|
<table bgcolor="#e0e0f0" cellpadding="10">
|
|
<tr><td>
|
|
<pre>
|
|
<i>literal</i> ::= <i>ordinary-character</i>
|
|
| <b>"\x"</b> [<b>"1"</b>-<b>"9"</b> <b>"a"-<b>"f"</b> <b>"A"</b>-<b>"F"</b>]{0,2}
|
|
| <b>"\x{"</b> [<b>"1"</b>-<b>"9"</b> <b>"a"-<b>"f"</b> <b>"A"</b>-<b>"F"</b>]* <b>"}"</b>
|
|
| <b>"\"</b> <i>character</i>
|
|
</pre>
|
|
</td></tr>
|
|
</table>
|
|
<p>
|
|
A literal is either an ordinary character (a character that has no
|
|
other significance in the context), an 8 bit hexadecimal encoded
|
|
character (e.g. <tt>\x1B</tt>), a wide hexadecimal encoded character
|
|
(e.g. <tt>\x{263a}</tt>), or an escaped character. An escaped
|
|
character is a <tt>\</tt> followed by any character, and matches that
|
|
character. Escaping can be used to match characters which have a
|
|
special meaning in regexp syntax. A <tt>\</tt> cannot be the last
|
|
character of an ERE. Escaping also allows you to include a few
|
|
non-printable characters in the regular expression. These special
|
|
escape sequences include:
|
|
</p>
|
|
|
|
<ul>
|
|
<li><tt>\a</tt> - Bell character (ASCII code 7)
|
|
<li><tt>\e</tt> - Escape character (ASCII code 27)
|
|
<li><tt>\f</tt> - Form-feed character (ASCII code 12)
|
|
<li><tt>\n</tt> - New-line/line-feed character (ASCII code 10)
|
|
<li><tt>\r</tt> - Carriage return character (ASCII code 13)
|
|
<li><tt>\t</tt> - Horizontal tab character (ASCII code 9)
|
|
</ul>
|
|
|
|
<p>
|
|
An ordinary character is just a single character with no other
|
|
significance, and matches that character. A <tt>{</tt> followed by
|
|
something else than a digit is considered an ordinary character.
|
|
</p>
|
|
|
|
|
|
<h3>Back references</h3>
|
|
<a name="backref"></a>
|
|
|
|
<table bgcolor="#e0e0f0" cellpadding="10">
|
|
<tr><td>
|
|
<pre>
|
|
<i>back-reference</i> ::= <b>"\"</b> [<b>"1"</b>-<b>"9"</b>]
|
|
</pre>
|
|
</td></tr>
|
|
</table>
|
|
<p>
|
|
A back reference is a backslash followed by a single non-zero decimal
|
|
digit <i>d</i>. It matches the same sequence of characters
|
|
matched by the <i>d</i>th parenthesized subexpression.
|
|
</p>
|
|
|
|
<p>
|
|
Back references are not defined for POSIX EREs (for BREs they are),
|
|
but many matchers, including TRE, implement back references for both
|
|
EREs and BREs.
|
|
</p>
|
|
|
|
<h3>Options</h3>
|
|
<a name="options"></a>
|
|
<table bgcolor="#e0e0f0" cellpadding="10">
|
|
<tr><td>
|
|
<pre>
|
|
<i>options</i> ::= [<b>"i" "n" "r" "U"</b>]* (<b>"-"</b> [<b>"i" "n" "r" "U"</b>]*)?
|
|
</pre>
|
|
</td></tr>
|
|
</table>
|
|
|
|
Options allow compile time options to be turned on/off for particular parts of the
|
|
regular expression. The options equate to several compile time options specified to
|
|
the regcomp API function. If the option is specified in the first section, it is
|
|
turned on. If it is specified in the second section (after the <tt>-</tt>), it is
|
|
turned off.
|
|
<ul>
|
|
<li>i - Case insensitive.
|
|
<li>n - Forces special handling of the new line character. See the REG_NEWLINE flag in
|
|
the <a href="tre-api.html">API Manual</a>.
|
|
<li>r - Causes the regex to be matched in a right associative manner rather than the normal
|
|
left associative manner.
|
|
<li>U - Forces repetition operators to be non-greedy unless a <tt>?</tt> is appended.
|
|
</ul>
|
|
<h2>BRE Syntax</h2>
|
|
|
|
<p>
|
|
The obsolete basic regexp (BRE) syntax differs from the ERE syntax as
|
|
follows:
|
|
</p>
|
|
|
|
<ul>
|
|
<li><tt>|</tt> is an ordinary character, and there is no equivalent
|
|
for its functionality. <tt>+</tt>, and <tt>?</tt> are ordinary
|
|
characters.</li>
|
|
<li>The delimiters for bounds are <tt>\{</tt> and <tt>\}</tt>, with
|
|
<tt>{</tt> and <tt>}</tt> by themselves ordinary characters.</li>
|
|
<li>The parentheses for nested subexpressions are <tt>\(</tt> and
|
|
<tt>\)</tt>, with <tt>(</tt> and <tt>)</tt> by themselves ordinary
|
|
characters.</li>
|
|
<li><tt>^</tt> is an ordinary character except at the beginning of the
|
|
RE or the beginning of a parenthesized subexpression. Similarly,
|
|
<tt>$</tt> is an ordinary character except at the end of the
|
|
RE or the end of a parenthesized subexpression.</li>
|
|
</ul>
|