syntax_basic.qbk 10 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289
  1. [/
  2. Copyright 2006-2007 John Maddock.
  3. Distributed under the Boost Software License, Version 1.0.
  4. (See accompanying file LICENSE_1_0.txt or copy at
  5. http://www.boost.org/LICENSE_1_0.txt).
  6. ]
  7. [section:basic_syntax POSIX Basic Regular Expression Syntax]
  8. [h3 Synopsis]
  9. The POSIX-Basic regular expression syntax is used by the Unix utility `sed`,
  10. and variations are used by `grep` and `emacs`. You can construct POSIX
  11. basic regular expressions in Boost.Regex by passing the flag `basic` to the
  12. regex constructor (see [syntax_option_type]), for example:
  13. // e1 is a case sensitive POSIX-Basic expression:
  14. boost::regex e1(my_expression, boost::regex::basic);
  15. // e2 a case insensitive POSIX-Basic expression:
  16. boost::regex e2(my_expression, boost::regex::basic|boost::regex::icase);
  17. [#boost_regex.posix_basic][h3 POSIX Basic Syntax]
  18. In POSIX-Basic regular expressions, all characters are match themselves except
  19. for the following special characters:
  20. [pre .\[\\*^$]
  21. [h4 Wildcard:]
  22. The single character '.' when used outside of a character set will match any
  23. single character except:
  24. * The NULL character when the flag `match_no_dot_null` is passed to the
  25. matching algorithms.
  26. * The newline character when the flag `match_not_dot_newline` is passed to
  27. the matching algorithms.
  28. [h4 Anchors:]
  29. A '^' character shall match the start of a line when used as the first
  30. character of an expression, or the first character of a sub-expression.
  31. A '$' character shall match the end of a line when used as the last
  32. character of an expression, or the last character of a sub-expression.
  33. [h4 Marked sub-expressions:]
  34. A section beginning `\(` and ending `\)` acts as a marked sub-expression.
  35. Whatever matched the sub-expression is split out in a separate field by the
  36. matching algorithms. Marked sub-expressions can also repeated, or
  37. referred-to by a back-reference.
  38. [h4 Repeats:]
  39. Any atom (a single character, a marked sub-expression, or a character class)
  40. can be repeated with the \* operator.
  41. For example `a*` will match any number of letter a's repeated zero or more
  42. times (an atom repeated zero times matches an empty string), so the
  43. expression `a*b` will match any of the following:
  44. [pre
  45. b
  46. ab
  47. aaaaaaaab
  48. ]
  49. An atom can also be repeated with a bounded repeat:
  50. `a\{n\}` Matches 'a' repeated exactly n times.
  51. `a\{n,\}` Matches 'a' repeated n or more times.
  52. `a\{n, m\}` Matches 'a' repeated between n and m times inclusive.
  53. For example:
  54. [pre ^a\{2,3\}$]
  55. Will match either of:
  56. [pre
  57. aa
  58. aaa
  59. ]
  60. But neither of:
  61. [pre
  62. a
  63. aaaa
  64. ]
  65. It is an error to use a repeat operator, if the preceding construct can not be
  66. repeated, for example:
  67. [pre a\(*\)]
  68. Will raise an error, as there is nothing for the \* operator to be applied to.
  69. [h4 Back references:]
  70. An escape character followed by a digit /n/, where /n/ is in the range 1-9,
  71. matches the same string that was matched by sub-expression /n/. For example
  72. the expression:
  73. [pre ^\\(a\*\\)\[\^a\]\*\\1$]
  74. Will match the string:
  75. [pre aaabbaaa]
  76. But not the string:
  77. [pre aaabba]
  78. [h4 Character sets:]
  79. A character set is a bracket-expression starting with \[ and ending with \],
  80. it defines a set of characters, and matches any single character that is a
  81. member of that set.
  82. A bracket expression may contain any combination of the following:
  83. [h5 Single characters:]
  84. For example `[abc]`, will match any of the characters 'a', 'b', or 'c'.
  85. [h5 Character ranges:]
  86. For example `[a-c]` will match any single character in the range 'a' to 'c'.
  87. By default, for POSIX-Basic regular expressions, a character /x/ is within the
  88. range /y/ to /z/, if it collates within that range; this results in
  89. locale specific behavior. This behavior can be turned off by unsetting
  90. the `collate` option flag when constructing the regular expression
  91. - in which case whether a character appears within
  92. a range is determined by comparing the code points of the characters only.
  93. [h5 Negation:]
  94. If the bracket-expression begins with the ^ character, then it matches the
  95. complement of the characters it contains, for example `[^a-c]` matches
  96. any character that is not in the range a-c.
  97. [h5 Character classes:]
  98. An expression of the form `[[:name:]]` matches the named character class "name",
  99. for example `[[:lower:]]` matches any lower case character.
  100. See [link boost_regex.syntax.character_classes character class names].
  101. [h5 Collating Elements:]
  102. An expression of the form `[[.col.]` matches the collating element /col/.
  103. A collating element is any single character, or any sequence of
  104. characters that collates as a single unit. Collating elements may also
  105. be used as the end point of a range, for example: `[[.ae.]-c]` matches
  106. the character sequence "ae", plus any single character in the range "ae"-c,
  107. assuming that "ae" is treated as a single collating element in the current locale.
  108. Collating elements may be used in place of escapes (which are not
  109. normally allowed inside character sets), for example `[[.^.]abc]` would
  110. match either one of the characters 'abc^'.
  111. As an extension, a collating element may also be specified via its
  112. symbolic name, for example:
  113. [pre \[\[\.NUL\.\]\]]
  114. matches a 'NUL' character.
  115. See [link boost_regex.syntax.collating_names collating element names].
  116. [h5 Equivalence classes:]
  117. An expression of the form `[[=col=]]`, matches any character or collating
  118. element whose primary sort key is the same as that for collating element
  119. /col/, as with collating elements the name /col/ may be a
  120. [link boost_regex.syntax.collating_names collating symbolic name].
  121. A primary sort key is one that ignores case, accentation, or
  122. locale-specific tailorings; so for example `[[=a=]]` matches any of
  123. the characters: a, '''À''', '''Á''', '''Â''',
  124. '''Ã''', '''Ä''', '''Å''', A, '''à''', '''á''',
  125. '''â''', '''ã''', '''ä''' and '''å'''.
  126. Unfortunately implementation of this is reliant on the platform's
  127. collation and localisation support; this feature can not be relied
  128. upon to work portably across all platforms, or even all locales on one platform.
  129. [h5 Combinations:]
  130. All of the above can be combined in one character set declaration, for
  131. example: `[[:digit:]a-c[.NUL.]].`
  132. [h4 Escapes]
  133. With the exception of the escape sequences \\{, \\}, \\(, and \\),
  134. which are documented above, an escape followed by any character matches
  135. that character. This can be used to make the special characters
  136. [pre .\[\\\*^$]
  137. "ordinary". Note that the escape character loses its special meaning
  138. inside a character set, so `[\^]` will match either a literal '\\' or a '^'.
  139. [h3 What Gets Matched]
  140. When there is more that one way to match a regular expression, the
  141. "best" possible match is obtained using the
  142. [link boost_regex.syntax.leftmost_longest_rule leftmost-longest rule].
  143. [h3 Variations]
  144. [#boost_regex.grep_syntax][h4 Grep]
  145. When an expression is compiled with the flag `grep` set, then the
  146. expression is treated as a newline separated list of
  147. [link boost_regex.posix_basic POSIX-Basic expressions],
  148. a match is found if any of the expressions in the list match, for example:
  149. boost::regex e("abc\ndef", boost::regex::grep);
  150. will match either of the [link boost_regex.posix_basic POSIX-Basic expressions]
  151. "abc" or "def".
  152. As its name suggests, this behavior is consistent with the Unix utility grep.
  153. [h4 emacs]
  154. In addition to the [link boost_regex.posix_basic POSIX-Basic features]
  155. the following characters are also special:
  156. [table
  157. [[Character][Description]]
  158. [[+][repeats the preceding atom one or more times.]]
  159. [[?][repeats the preceding atom zero or one times.]]
  160. [[*?][A non-greedy version of *.]]
  161. [[+?][A non-greedy version of +.]]
  162. [[??][A non-greedy version of ?.]]
  163. ]
  164. And the following escape sequences are also recognised:
  165. [table
  166. [[Escape][Description]]
  167. [[\\|][specifies an alternative.]]
  168. [[\\(?: ... \)][is a non-marking grouping construct - allows you to lexically group something without spitting out an extra sub-expression.]]
  169. [[\\w][matches any word character.]]
  170. [[\\W][matches any non-word character.]]
  171. [[\\sx][matches any character in the syntax group x, the following
  172. emacs groupings are supported: 's', ' ', '_', 'w', '.', ')', '(', '"', '\\'', '>' and '<'. Refer to the emacs docs for details.]]
  173. [[\\Sx][matches any character not in the syntax grouping x.]]
  174. [[\\c and \\C][These are not supported.]]
  175. [[\\`][matches zero characters only at the start of a buffer (or string being matched).]]
  176. [[\\'][matches zero characters only at the end of a buffer (or string being matched).]]
  177. [[\\b][matches zero characters at a word boundary.]]
  178. [[\\B][matches zero characters, not at a word boundary.]]
  179. [[\\<][matches zero characters only at the start of a word.]]
  180. [[\\>][matches zero characters only at the end of a word.]]
  181. ]
  182. Finally, you should note that emacs style regular expressions are matched
  183. according to the
  184. [link boost_regex.syntax.perl_syntax.what_gets_matched Perl "depth first search" rules].
  185. Emacs expressions are
  186. matched this way because they contain Perl-like extensions, that do not
  187. interact well with the
  188. [link boost_regex.syntax.leftmost_longest_rule POSIX-style leftmost-longest rule].
  189. [h3 Options]
  190. There are a [link boost_regex.ref.syntax_option_type.syntax_option_type_basic variety of flags] that may be combined with the `basic` and `grep`
  191. options when constructing the regular expression, in particular note
  192. that the
  193. [link boost_regex.ref.syntax_option_type.syntax_option_type_basic `newline_alt`, `no_char_classes`, `no-intervals`, `bk_plus_qm`
  194. and `bk_plus_vbar`] options all alter the syntax, while the
  195. [link boost_regex.ref.syntax_option_type.syntax_option_type_basic `collate` and `icase` options] modify how the case and locale sensitivity
  196. are to be applied.
  197. [h3 References]
  198. [@http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap09.html IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Base Definitions and Headers, Section 9, Regular Expressions (FWD.1).]
  199. [@http://www.opengroup.org/onlinepubs/000095399/utilities/grep.html IEEE Std 1003.1-2001, Portable Operating System Interface (POSIX ), Shells and Utilities, Section 4, Utilities, grep (FWD.1).]
  200. [@http://www.gnu.org/software/emacs/ Emacs Version 21.3.]
  201. [endsect]