123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213 |
- <?xml version="1.0" standalone="yes"?>
- <!DOCTYPE library PUBLIC "-//Boost//DTD BoostBook XML V1.0//EN"
- "http://www.boost.org/tools/boostbook/dtd/boostbook.dtd"
- [
- <!ENTITY % entities SYSTEM "program_options.ent" >
- %entities;
- ]>
- <section id="program_options.design">
- <title>Design Discussion</title>
- <para>This section focuses on some of the design questions.
- </para>
- <section id="program_options.design.unicode">
- <title>Unicode Support</title>
- <para>Unicode support was one of the features specifically requested
- during the formal review. Throughout this document "Unicode support" is
- a synonym for "wchar_t" support, assuming that "wchar_t" always uses
- Unicode encoding. Also, when talking about "ascii" (in lowercase) we'll
- not mean strict 7-bit ASCII encoding, but rather "char" strings in local
- 8-bit encoding.
- </para>
- <para>
- Generally, "Unicode support" can mean
- many things, but for the program_options library it means that:
- <itemizedlist>
- <listitem>
- <para>Each parser should accept either <code>char*</code>
- or <code>wchar_t*</code>, correctly split the input into option
- names and option values and return the data.
- </para>
- </listitem>
- <listitem>
- <para>For each option, it should be possible to specify whether the conversion
- from string to value uses ascii or Unicode.
- </para>
- </listitem>
- <listitem>
- <para>The library guarantees that:
- <itemizedlist>
- <listitem>
- <para>ascii input is passed to an ascii value without change
- </para>
- </listitem>
- <listitem>
- <para>Unicode input is passed to a Unicode value without change</para>
- </listitem>
- <listitem>
- <para>ascii input passed to a Unicode value, and Unicode input
- passed to an ascii value will be converted using a codecvt
- facet (which may be specified by the user).
- </para>
- </listitem>
- </itemizedlist>
- </para>
- </listitem>
- </itemizedlist>
- </para>
- <para>The important point is that it's possible to have some "ascii
- options" together with "Unicode options". There are two reasons for
- this. First, for a given type you might not have the code to extract the
- value from Unicode string and it's not good to require that such code be written.
- Second, imagine a reusable library which has some options and exposes
- options description in its interface. If <emphasis>all</emphasis>
- options are either ascii or Unicode, and the library does not use any
- Unicode strings, then the author is likely to use ascii options, making
- the library unusable inside Unicode
- applications. Essentially, it would be necessary to provide two versions
- of the library -- ascii and Unicode.
- </para>
- <para>Another important point is that ascii strings are passed though
- without modification. In other words, it's not possible to just convert
- ascii to Unicode and process the Unicode further. The problem is that the
- default conversion mechanism -- the <code>codecvt</code> facet -- might
- not work with 8-bit input without additional setup.
- </para>
- <para>The Unicode support outlined above is not complete. For example, we
- don't support Unicode option names. Unicode support is hard and
- requires a Boost-wide solution. Even comparing two arbitrary Unicode
- strings is non-trivial. Finally, using Unicode in option names is
- related to internationalization, which has it's own
- complexities. E.g. if option names depend on current locale, then all
- program parts and other parts which use the name must be
- internationalized too.
- </para>
- <para>The primary question in implementing the Unicode support is whether
- to use templates and <code>std::basic_string</code> or to use some
- internal encoding and convert between internal and external encodings on
- the interface boundaries.
- </para>
- <para>The choice, mostly, is between code size and execution
- speed. A templated solution would either link library code into every
- application that uses the library (thereby making shared library
- impossible), or provide explicit instantiations in the shared library
- (increasing its size). The solution based on internal encoding would
- necessarily make conversions in a number of places and will be somewhat slower.
- Since speed is generally not an issue for this library, the second
- solution looks more attractive, but we'll take a closer look at
- individual components.
- </para>
- <para>For the parsers component, we have three choices:
- <itemizedlist>
- <listitem>
- <para>Use a fully templated implementation: given a string of a
- certain type, a parser will return a &parsed_options; instance
- with strings of the same type (i.e. the &parsed_options; class
- will be templated).</para>
- </listitem>
- <listitem>
- <para>Use internal encoding: same as above, but strings will be converted to and
- from the internal encoding.</para>
- </listitem>
- <listitem>
- <para>Use and partly expose the internal encoding: same as above,
- but the strings in the &parsed_options; instance will be in the
- internal encoding. This might avoid a conversion if
- &parsed_options; instance is passed directly to other components,
- but can be also dangerous or confusing for a user.
- </para>
- </listitem>
- </itemizedlist>
- </para>
- <para>The second solution appears to be the best -- it does not increase
- the code size much and is cleaner than the third. To avoid extra
- conversions, the Unicode version of &parsed_options; can also store
- strings in internal encoding.
- </para>
- <para>For the options descriptions component, we don't have much
- choice. Since it's not desirable to have either all options use ascii or all
- of them use Unicode, but rather have some ascii and some Unicode options, the
- interface of the &value_semantic; must work with both. The only way is
- to pass an additional flag telling if strings use ascii or internal encoding.
- The instance of &value_semantic; can then convert into some
- other encoding if needed.
- </para>
- <para>For the storage component, the only affected function is &store;.
- For Unicode input, the &store; function should convert the value to the
- internal encoding. It should also inform the &value_semantic; class
- about the used encoding.
- </para>
- <para>Finally, what internal encoding should we use? The
- alternatives are:
- <code>std::wstring</code> (using UCS-4 encoding) and
- <code>std::string</code> (using UTF-8 encoding). The difference between
- alternatives is:
- <itemizedlist>
- <listitem>
- <para>Speed: UTF-8 is a bit slower</para>
- </listitem>
- <listitem>
- <para>Space: UTF-8 takes less space when input is ascii</para>
- </listitem>
- <listitem>
- <para>Code size: UTF-8 requires additional conversion code. However,
- it allows one to use existing parsers without converting them to
- <code>std::wstring</code> and such conversion is likely to create a
- number of new instantiations.
- </para>
- </listitem>
- </itemizedlist>
- There's no clear leader, but the last point seems important, so UTF-8
- will be used.
- </para>
- <para>Choosing the UTF-8 encoding allows the use of existing parsers,
- because 7-bit ascii characters retain their values in UTF-8,
- so searching for 7-bit strings is simple. However, there are
- two subtle issues:
- <itemizedlist>
- <listitem>
- <para>We need to assume the character literals use ascii encoding
- and that inputs use Unicode encoding.</para>
- </listitem>
- <listitem>
- <para>A Unicode character (say '=') can be followed by 'composing
- character' and the combination is not the same as just '=', so a
- simple search for '=' might find the wrong character.
- </para>
- </listitem>
- </itemizedlist>
- Neither of these issues appear to be critical in practice, since ascii is
- almost universal encoding and since composing characters following '=' (and
- other characters with special meaning to the library) are not likely to appear.
- </para>
- </section>
- </section>
- <!--
- Local Variables:
- mode: xml
- sgml-indent-data: t
- sgml-parent-document: ("program_options.xml" "section")
- sgml-set-face: t
- End:
- -->
|