scanner.html 23 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288
  1. <html>
  2. <head>
  3. <title>The Scanner and Parsing</title>
  4. <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
  5. <link rel="stylesheet" href="theme/style.css" type="text/css">
  6. </head>
  7. <body>
  8. <table width="100%" border="0" background="theme/bkd2.gif" cellspacing="2">
  9. <tr>
  10. <td width="10">
  11. </td>
  12. <td width="85%">
  13. <font size="6" face="Verdana, Arial, Helvetica, sans-serif"><b>The Scanner and Parsing</b></font>
  14. </td>
  15. <td width="112"><a href="http://spirit.sf.net"><img src="theme/spirit.gif" width="112" height="48" align="right" border="0"></a></td>
  16. </tr>
  17. </table>
  18. <br>
  19. <table border="0">
  20. <tr>
  21. <td width="10"></td>
  22. <td width="30"><a href="../index.html"><img src="theme/u_arr.gif" border="0"></a></td>
  23. <td width="30"><a href="directives.html"><img src="theme/l_arr.gif" border="0"></a></td>
  24. <td width="30"><a href="grammar.html"><img src="theme/r_arr.gif" border="0"></a></td>
  25. </tr>
  26. </table>
  27. <p>The <b>scanner</b>'s task is to feed the sequential input data stream to the
  28. parser. The scanner extracts data from the input, parceling, potentially modifying
  29. or filtering, and then finally relegating the result to individual parser elements
  30. on demand until the input is exhausted. The scanner is composed of two STL conforming
  31. forward iterators, first and last, where first is held by reference and last,
  32. by value. The first iterator is held by reference to allow it to be re-positioned.
  33. The following diagram illustrates what's happening:</p>
  34. <table width="62%" border="0" align="center">
  35. <tr>
  36. <td><img src="theme/scanner1.png"></td>
  37. </tr>
  38. </table>
  39. <p>The scanner manages various aspects of the parsing process through a set of
  40. policies. There are three sets of policies that govern:</p>
  41. <blockquote>
  42. <p><img src="theme/bullet.gif" width="12" height="12"> Iteration and filtering<br>
  43. <img src="theme/bullet.gif" width="12" height="12"> Recognition and matching<br>
  44. <img src="theme/bullet.gif" width="12" height="12"> Handling semantic actions</p>
  45. </blockquote>
  46. <p>These policies are mostly hidden from view and users generally need not know
  47. about them. Advanced users might however provide their own policies that override
  48. the ones that are already in place to fine tune the parsing process
  49. to fit their own needs. We shall see how this can be done. This will be covered
  50. in further detail later.</p>
  51. <p>The <tt>scanner</tt> is a template class expecting two parameters: <tt>IteratorT</tt>,
  52. the iterator type and <tt>PoliciesT</tt>, its set of policies. <tt>IteratorT</tt>
  53. defaults to <tt>char const*</tt> while <tt>PoliciesT</tt> defaults to <tt>scanner_policies&lt;&gt;</tt>,
  54. a predefined set of scanner policies that we can use straight out of the box.</p>
  55. <pre><code><font color="#000000"><span class=keyword> template</span><span class=special>&lt;
  56. </span><span class=keyword>typename </span><span class=identifier>IteratorT </span><span class=special>= </span><span class=keyword>char </span><span class=keyword>const</span><span class=special>*,
  57. </span><span class=keyword>typename </span><span class=identifier>PoliciesT </span><span class=special>= </span><span class=identifier>scanner_policies</span><span class=special>&lt;&gt; </span><span class=special>&gt;
  58. </span><span class=keyword>class </span><span class=identifier>scanner</span><span class=special>;</span></font></code></pre>
  59. <p>Spirit uses the same iterator concepts and interface formally defined by the
  60. C++ Standard Template Library (STL). We can use iterators supplied by STL's
  61. containers (e.g. <tt>list</tt>, <tt>vector</tt>, <tt>string</tt>, etc.) as is,
  62. or perhaps write our own. Iterators can be as simple as a pointer (e.g. <tt>char
  63. const<span class="operators">*</span></tt>). At the other end of the spectrum,
  64. iterators can be quite complex; for instance, an iterator adapter that wraps
  65. a lexer such as LEX.</p>
  66. <h2>The Free Parse Functions</h2>
  67. <p>The framework provides a couple of free functions to make parsing a snap. These
  68. parser functions have two forms. The first form works on the <b>character level</b>.
  69. The second works on the <b>phrase level</b> and asks for a <b>skip parser</b>.</p>
  70. <p>The <b>skip parser</b> is just about any parser primitive or composite. Its
  71. purpose is to move the scanner's <tt>first</tt> iterator to valid tokens by
  72. skipping white spaces. In C for instance, the tab <tt class="quotes">'\t'</tt>,
  73. the newline <tt class="quotes">'\n'</tt>, return <tt><span class="quotes">'\r'</span></tt>,
  74. space <tt class="quotes">' '</tt> and characters inside comments <tt class="quotes">/*...*/</tt>
  75. are considered as white spaces.</p>
  76. <p><b>Character level parsing</b></p>
  77. <pre><code><font color="#000000"><span class=special> </span><span class=keyword>template </span><span class=special>&lt;</span><span class=keyword>typename </span><span class=identifier>IteratorT</span><span class=special>, </span><span class=keyword>typename </span><span class=identifier>DerivedT</span><span class=special>&gt;
  78. </span><span class=identifier>parse_info</span><span class=special>&lt;</span><span class=identifier>IteratorT</span><span class=special>&gt;
  79. </span><span class=identifier>parse
  80. </span><span class=special>(
  81. </span><span class=identifier>IteratorT </span><span class=keyword>const</span><span class=special>& </span><span class=identifier>first</span><span class=special>,
  82. </span><span class=identifier>IteratorT </span><span class=keyword>const</span><span class=special>& </span><span class=identifier>last</span><span class=special>,
  83. </span><span class=identifier>parser</span><span class=special>&lt;</span><span class=identifier>DerivedT</span><span class=special>&gt; </span><span class=keyword>const</span><span class=special>& </span><span class=identifier>p
  84. </span><span class=special>);</span></font></code></pre>
  85. <pre><code><font color="#000000"><span class=special> </span><span class=keyword>template </span><span class=special>&lt;</span><span class=keyword>typename </span><span class=identifier>CharT</span><span class=special>, </span><span class=keyword>typename </span><span class=identifier>DerivedT</span><span class=special>&gt;
  86. </span><span class=identifier>parse_info</span><span class=special>&lt;</span><span class=identifier>CharT </span><span class=keyword>const</span><span class=special>*&gt;
  87. </span><span class=identifier>parse
  88. </span><span class=special>(
  89. </span><span class=identifier>CharT </span><span class=keyword>const</span><span class=special>* </span><span class=identifier>str</span><span class=special>,
  90. </span><span class=identifier>parser</span><span class=special>&lt;</span><span class=identifier>DerivedT</span><span class=special>&gt; </span><span class=keyword>const</span><span class=special>& </span><span class=identifier>p
  91. </span><span class=special>);</span></font></code></pre>
  92. <p>There are two variants. The first variant accepts a <tt>first</tt>, <tt>last</tt>
  93. iterator pair like you do STL algorithms. The second variant accepts a null
  94. terminated string. The last argument is a parser <tt>p</tt> which will be used
  95. to parse the input.</p>
  96. <p><b>Phrase level parsing</b></p>
  97. <pre><code><font color="#000000"><span class=special> </span><span class=keyword>template </span><span class=special>&lt;</span><span class=keyword>typename </span><span class=identifier>IteratorT</span><span class=special>, </span><span class=keyword>typename </span><span class=identifier>ParserT</span><span class=special>, </span><span class=keyword>typename </span><span class=identifier>SkipT</span><span class=special>&gt;
  98. </span><span class=identifier>parse_info</span><span class=special>&lt;</span><span class=identifier>IteratorT</span><span class=special>&gt;
  99. </span><span class=identifier>parse
  100. </span><span class=special>(
  101. </span><span class=identifier>IteratorT </span><span class=keyword>const</span><span class=special>& </span><span class=identifier>first</span><span class=special>,
  102. </span><span class=identifier>IteratorT </span><span class=keyword>const</span><span class=special>& </span><span class=identifier>last</span><span class=special>,
  103. </span><span class=identifier>parser</span><span class=special>&lt;</span><span class=identifier>ParserT</span><span class=special>&gt; </span><span class=keyword>const</span><span class=special>& </span><span class=identifier>p</span><span class=special>,
  104. </span><span class=identifier>parser</span><span class=special>&lt;</span><span class=identifier>SkipT</span><span class=special>&gt; </span><span class=keyword>const</span><span class=special>& </span><span class=identifier>skip
  105. </span><span class=special>);</span></font></code></pre>
  106. <pre><code><font color="#000000"><span class=special> </span><span class=keyword>template </span><span class=special>&lt;</span><span class=keyword>typename </span><span class=identifier>CharT</span><span class=special>, </span><span class=keyword>typename </span><span class=identifier>ParserT</span><span class=special>, </span><span class=keyword>typename </span><span class=identifier>SkipT</span><span class=special>&gt;
  107. </span><span class=identifier>parse_info</span><span class=special>&lt;</span><span class=identifier>CharT </span><span class=keyword>const</span><span class=special>*&gt;
  108. </span><span class=identifier>parse
  109. </span><span class=special>(
  110. </span><span class=identifier>CharT </span><span class=keyword>const</span><span class=special>* </span><span class=identifier>str</span><span class=special>,
  111. </span><span class=identifier>parser</span><span class=special>&lt;</span><span class=identifier>ParserT</span><span class=special>&gt; </span><span class=keyword>const</span><span class=special>& </span><span class=identifier>p</span><span class=special>,
  112. </span><span class=identifier>parser</span><span class=special>&lt;</span><span class=identifier>SkipT</span><span class=special>&gt; </span><span class=keyword>const</span><span class=special>& </span><span class=identifier>skip
  113. </span><span class=special>);</span></font></code></pre>
  114. <p>Like above, there are two variants. The first variant accepts a <tt>first</tt>,
  115. <tt>last</tt> iterator pair like you do STL algorithms. The second variant accepts
  116. a null terminated string. The argument <tt>p</tt> is the parser which will be
  117. used to parse the input. The last argument <tt>skip</tt> is the skip parser.</p>
  118. <p><b>The parse_info structure</b></p>
  119. <p>The functions above return a <tt>parse_info</tt> structure parameterized by
  120. the iterator type passed in. The parse_info struct has these members:</p>
  121. <table width="90%" border="0" align="center">
  122. <tr>
  123. <td colspan="2" class="table_title"><b>parse_info</b></td>
  124. </tr>
  125. <tr>
  126. <td width="14%" class="table_cells"><b>stop</b></td>
  127. <td width="86%" class="table_cells">Points to the final parse position (i.e
  128. The parser recognized and processed the input up to this point)</td>
  129. </tr>
  130. <tr>
  131. <td width="14%" class="table_cells"><b>hit</b></td>
  132. <td width="86%" class="table_cells">True if parsing is successful. This may
  133. be full: the parser consumed all the input, or partial: the parser consumed
  134. only a portion of the input.</td>
  135. </tr>
  136. <tr>
  137. <td width="14%" class="table_cells"><b>full</b></td>
  138. <td width="86%" class="table_cells">True when we have a full match (i.e The
  139. parser consumed all the input).</td>
  140. </tr>
  141. <tr>
  142. <td width="14%" class="table_cells"><b>length</b></td>
  143. <td width="86%" class="table_cells">The number of characters consumed by the
  144. parser. This is valid only if we have a successful match (either partial
  145. or full). </td>
  146. </tr>
  147. </table>
  148. <h2><a name="phrase_scanner_t" id="phrase_scanner_t"></a><img src="theme/lens.gif" width="15" height="16">
  149. The phrase_scanner_t and wide_phrase_scanner_t</h2>
  150. <p>For convenience, Spirit declares these typedefs:</p>
  151. <pre>
  152. <span class="keyword">typedef</span> scanner<span class="special">&lt;</span><span class="keyword">char const</span><span class="special">*,</span> unspecified<span class="special">&gt;</span> phrase_scanner_t<span class="special">;</span>
  153. <span class="keyword">typedef</span> scanner<span class="special">&lt;</span><span class="keyword">wchar_t const</span><span class="special">*,</span> <span class="identifier">unspecified</span><span class="special">&gt;</span> wide_phrase_scanner_t<span class="special">;</span>
  154. </pre>
  155. <p>These are the exact scanner types used by Spirit on calls to the parse function
  156. passing in a <tt>char const*</tt> (C string) or a <tt>wchar_t const*</tt> (wide
  157. string) as the first parameter and a <tt>space_p</tt> as skip-parser (the third
  158. parameter). For instance, we can use these typedefs to declare some rules. Example:</p>
  159. <pre> rule<span class="special">&lt;</span>phrase_scanner_t<span class="special">&gt; </span><span class="identifier">my_rule</span><span class="special">;
  160. </span><span class="identifier">parse</span><span class="special">(</span><span class="string">&quot;abrakadabra&quot;</span><span class="special">, </span><span class="identifier">my_rule</span><span class="special">,</span> <span class="identifier">space_p</span><span class="special">);</span></pre>
  161. <h2><img src="theme/lens.gif" width="15" height="16"> Direct parsing with Iterators</h2>
  162. <p>The free parse functions make it easy for us. By using them, we need not bother
  163. with the scanner intricacies. The free parse functions hide the dirty details.
  164. However, sometime in the future, we will need to get under the hood. It's nice
  165. that we know what we are dealing with when that need comes. We will need to
  166. go low-level and call the parser's parse member function directly. </p>
  167. <p>If we wish to work on the <b>character level</b>, the procedure is quite simple:</p>
  168. <pre><span class=identifier> </span><span class=identifier>scanner</span><span class=special>&lt;</span><span class=identifier>IteratorT</span><span class=special>&gt; </span><span class=identifier>scan</span><span class=special>(</span><span class=identifier>first</span><span class=special>, </span><span class=identifier>last</span><span class=special>);
  169. </span><span class=keyword>if </span><span class=special>(</span><span class=identifier>p</span><span class=special>.</span><span class=identifier>parse</span><span class=special>(</span><span class=identifier>scan</span><span class=special>))
  170. </span><span class=special>{
  171. </span><span class=comment>// Parsed successfully. If first == last, then we have
  172. // a full parse, the parser recognized the input in whole.
  173. </span><span class=special>}
  174. </span><span class=keyword>else
  175. </span><span class=special>{
  176. </span><span class=comment>// Parsing failure. The parser failed to recognize the input
  177. </span><span class=special>}</span></pre>
  178. <table width="80%" border="0" align="center">
  179. <tr>
  180. <td class="note_box"><img src="theme/alert.gif" width="16" height="16"> <strong>The
  181. scanner position on an unsucessful match</strong><br> <br>
  182. On a successful match, the input is advanced accordingly. But what happens
  183. on an unsuccessful match? Be warned. It might be intuitive to think that
  184. the scanner position is reset to its initial position prior to parsing.
  185. No, the position is not reset. On an unsuccessful match, the position of
  186. the scanner is <strong>undefined</strong>! Usually, it is positioned at
  187. the farthest point where the error was found somewhere down the recursive
  188. descent. If this behavior is not desired, you may need to position the scanner
  189. yourself. The <a href="numerics.html#scanner_save">example in the numerics
  190. chapter</a> illustrates how the scanner position can be saved and later
  191. restored.</td>
  192. </tr>
  193. </table>
  194. <p>Where <tt>p</tt> is the parser we want to use, and <tt>first</tt>/<tt>last</tt>
  195. are the iterator pairs referring to the input. We just create a scanner given
  196. the iterators. The scanner type we will use here uses the default <tt>scanner_policies&lt;&gt;</tt>.</p>
  197. <p>The situation is a bit more complex when we wish to work on the <b>phrase level</b>:</p>
  198. <pre><span class=special> </span><span class=keyword>typedef </span><span class=identifier>skip_parser_iteration_policy</span><span class=special>&lt;</span><span class=identifier>SkipT</span><span class=special>&gt; </span><span class=identifier>iter_policy_t</span><span class=special>;
  199. </span><span class=keyword>typedef </span><span class=identifier>scanner_policies</span><span class=special>&lt;</span><span class=identifier>iter_policy_t</span><span class=special>&gt; </span><span class=identifier>scanner_policies_t</span><span class=special>;
  200. </span><span class=keyword>typedef </span><span class=identifier>scanner</span><span class=special>&lt;</span><span class=identifier>IteratorT</span><span class=special>, </span><span class=identifier>scanner_policies_t</span><span class=special>&gt; </span><span class=identifier>scanner_t</span><span class=special>;
  201. </span><span class=special> </span><span class=identifier>iter_policy_t </span><span class=identifier>iter_policy</span><span class=special>(</span><span class=identifier>skip</span><span class=special>);
  202. </span><span class=identifier>scanner_policies_t </span><span class=identifier>policies</span><span class=special>(</span><span class=identifier>iter_policy</span><span class=special>);
  203. </span><span class=identifier>scanner_t </span><span class=identifier>scan</span><span class=special>(</span><span class=identifier>first</span><span class=special>, </span><span class=identifier>last</span><span class=special>, </span><span class=identifier>policies</span><span class=special>);
  204. </span>
  205. <span class=keyword>if </span><span class=special>(</span><span class=identifier>p</span><span class=special>.</span><span class=identifier>parse</span><span class=special>(</span><span class=identifier>scan</span><span class=special>))
  206. </span><span class=special>{
  207. </span><span class=comment>// Parsed successfully. If first == last, then we have
  208. // a full parse, the parser recognized the input in whole.
  209. </span><span class=special>}
  210. </span><span class=keyword>else
  211. </span><span class=special>{
  212. </span><span class=comment>// Parsing failure. The parser failed to recognize the input
  213. </span><span class=special>}</span></pre>
  214. <p>Where <tt>SkipT</tt> is the type of the skip-parser, <tt>skip</tt>. Again,
  215. <tt>p</tt> is the parser we want to use, and <tt>first</tt>/<tt>last</tt> are
  216. the iterator pairs referring to the input. Given a skip-parser type <tt>SkipT</tt>,
  217. <span class=identifier><tt>skip_parser_iteration_policy</tt></span> creates
  218. a scanner iteration policy that skips over portions that are recognized by the
  219. skip-parser. This may then be used to create a scanner. The <tt>scanner_policies</tt>
  220. class wraps all scanner related policies including the iteration policies.</p>
  221. <h2><a name="lexeme_scanner"></a>lexeme_scanner</h2>
  222. <p>When switching from phrase level to character level parsing, the <tt>lexeme_d</tt>
  223. (see <a href="directives.html">directives.html</a>) does its magic by disabling
  224. the skipping of white spaces. This is done by tweaking the <a href="scanner.html">scanner</a>.
  225. However, when we do this, all parsers inside the lexeme gets a transformed scanner
  226. type. This should not be a problem in most cases. However, when rules are called
  227. inside the <tt>lexeme_d</tt>, the compiler will choke if the rule does not have
  228. the proper scanner type. If a rule must be used inside a <tt>lexeme_d</tt>,
  229. the rule's type must be:</p>
  230. <pre> <span class=identifier>rule</span><span class=special>&lt;</span><span class=identifier>lexeme_scanner</span><span class="special">&lt;</span><span class=identifier>ScannerT</span><span class=special>&gt;::</span><span class="identifier">type</span><span class=special>&gt; </span>r<span class=special>;</span></pre>
  231. <p>where <span class=identifier><tt>ScannerT</tt></span> is the actual type of
  232. the scanner used. Take note that <tt>lexeme_scanner</tt> will only work for phrase level scanners. </p>
  233. <h2><a name="as_lower_scanner"></a>as_lower_scanner</h2>
  234. <p>Similarly, the <tt>as_lower_d</tt> does its work by filtering and converting
  235. all characters received from the scanner to lower case. This is also done by
  236. tweaking the <a href="scanner.html">scanner</a>. Then again, all parsers inside
  237. the <tt>as_lower_d</tt> gets a transformed scanner type. If a rule must be used
  238. inside a <tt>as_lower_d</tt>, the rule's type must be:</p>
  239. <pre> <span class=identifier>rule</span><span class=special>&lt;</span><span class=identifier>as_lower_scanner</span><span class="special">&lt;</span><span class=identifier>ScannerT</span><span class=special>&gt;::</span><span class="identifier">type</span><span class=special>&gt; </span>r<span class=special>;</span></pre>
  240. <p>where <span class=identifier><tt>ScannerT</tt></span> is the actual type of
  241. the scanner used. </p>
  242. <table width="80%" border="0" align="center">
  243. <tr>
  244. <td class="note_box"><img src="theme/bulb.gif" width="13" height="18"> See
  245. the techniques section for an <a href="techniques.html#multiple_scanner_support">example</a>
  246. of a <a href="grammar.html">grammar</a> using a <a href="rule.html#multiple_scanner_support">multiple
  247. scanner enabled rule</a>, <a href="scanner.html#lexeme_scanner">lexeme_scanner</a>
  248. and <a href="scanner.html#as_lower_scanner">as_lower_scanner.</a></td>
  249. </tr>
  250. </table>
  251. <h3><a name="no_actions_scanner"></a>no_actions_scanner</h3>
  252. <p>Again, <tt>no_actions_d</tt> directive tweaks the scanner to disable firing
  253. semantic actions. Like before, all parsers inside the <tt>no_actions_d</tt>
  254. gets a transformed scanner type. If a rule must be used inside a <tt>no_actions_d</tt>,
  255. the rule's type must be:</p>
  256. <pre> <span class=identifier>rule</span><span class=special>&lt;</span>no_actions_scanner<span class="special">&lt;</span><span class=identifier>ScannerT</span><span class=special>&gt;::</span><span class="identifier">type</span><span class=special>&gt; </span>r<span class=special>;</span></pre>
  257. <p>where <tt>ScannerT</tt> is the actual type of the scanner used. <span class=special></span></p>
  258. <table width="80%" border="0" align="center">
  259. <tr>
  260. <td class="note_box"><img src="theme/note.gif" width="16" height="16"> Be
  261. sure to add &quot;<tt>typename</tt>&quot; before <tt><span class=identifier><tt>lexeme_scanner</tt>,
  262. <tt>as_lower_scanner</tt></span></tt> and <tt>no_actions_scanner</tt> when
  263. these are used inside a template class or function.</td>
  264. </tr>
  265. </table>
  266. <p><img src="theme/lens.gif" width="15" height="16"> See <a href="../example/fundamental/no_actions.cpp">no_actions.cpp</a>. This is part of the Spirit distribution.</p>
  267. <table border="0">
  268. <tr>
  269. <td width="10"></td>
  270. <td width="30"><a href="../index.html"><img src="theme/u_arr.gif" border="0"></a></td>
  271. <td width="30"><a href="directives.html"><img src="theme/l_arr.gif" border="0"></a></td>
  272. <td width="30"><a href="grammar.html"><img src="theme/r_arr.gif" border="0"></a></td>
  273. </tr>
  274. </table>
  275. <br>
  276. <hr size="1">
  277. <p class="copyright">Copyright &copy; 1998-2003 Joel de Guzman<br>
  278. <br>
  279. <font size="2">Use, modification and distribution is subject to the Boost Software
  280. License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at
  281. http://www.boost.org/LICENSE_1_0.txt)</font></p>
  282. <p>&nbsp;</p>
  283. <p>&nbsp;</p>
  284. </body>
  285. </html>