Re: [unicode] UTF-c

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Feb 21 2011 - 14:44:39 CST

  • Next message: Doug Ewell: "RE: [unicode] UTF-c"

    Don't use the proposed C code as a reference as it contains obvious
    buffer overflow problems (incorrect computing of the necessary buffer
    length), and inefficient way to handle large files (files don't need
    to be fully read in a single buffer), in addition to unusual
    command-line syntax (and not recommended because of its ambiguity).
    Anyway, this list is not the place for posting implemntation code

     Finally, the proposal is ill in the way it implements its BOM (one
    for each supported page) : why not sticking on using U+FEFF, or using
    one of the unused scalar values (e.g. encoding each page code numbers
    as a scalar >= 0x110000). Remember that ASCII controls are reserved
    for something else, and have many restrictions for use in portable
    plain-text (notably MIME compatibility : don't use FS,GS,RS,US for
    that).

    And why do you need in fact an enumeration of codepage numbers, when
    all that is needed is to allow encoding the page base as the scalar
    value of the first character allowed in all possible supported page
    (for example, all pages are aligned on a row (16-character) boundary,
    so the 21 bits reduce to 15 bits. You can easily map any one of these
    15-bit values into a base page selector mapped on one of the many
    unused scalar values (>=0x110000, which has plenty of available space,
    even with this encoding), without requiring any "magic" lookup table
    (and in this case you coukd even have a data converter select the best
    page automatically).

    But note that non-ASCII bytes are still requiring tests to see if they
    are leading or trailing bytes of a multibyte sequence: ASCII bytes are
    self-synchronizing, middle-bytes are easy to process in backward or
    forward direction to find the leading or trailing byte with a small
    bounded number of tests, BUT long sequences of characters encoded as
    2-bytes will require non bounded numbers of tests until you find
    either a middle-byte or an ASCII byte, just to know if one of those
    bytes with the binary 10xxxxxx pattern are leading or trailing.

    But as those characters encoded as 2-byte sequences are limited to the
    U+0080..U+1080 range (in this proposal, when the selected page
    encodable as one byte does not fall within this range), it will not be
    a problem for texts written in the Latin script (due to the hygh
    frequency of ASCII letters), but for other alphabetic scripts in this
    range (as long as there's no space, or punctuation or control) if they
    are not using a page selection for the most frequent letters of their
    alphabet.

    Even if all scripts in this range are using spaces and standard ASCII
    punctuations, this is still a problem for implementation of fast
    searching algorithms (but the additonal compression of data, compared
    to UTF-8, would still optimize a bit the data locality for memory
    caches and I/O when handling large amounts of texts to compare, and
    the additional tests will have certainly a lower impact)

    But if I had a comment to say, I think that I would have even allowed
    splitting the 64-character page into two separate 32-character
    subpages (mappable individually), in order to support the range of
    1-byte codes used by C1 controls or Windows codepages for very small
    scripts: here again, this ony requires a few special codes mappable on
    the many unused scalar values (0xD800..0xDFFF, 0x11000..)

    One of its interest is that (without using any base page selector) it
    can support C0 controls, the US-ASCII printable characters, and the
    full alphabetic area of ISO-8859-1 as single bytes (instead of 2 bytes
    with UTF-8). And it cannot be longer than UTF-8 and it also allows
    bidirectional parsing of the encoded stream (including from a random
    position with very few tests to resynchronize in both directions)

    But I admit that this proposal has its merit : once, corrected (for
    the above deficiencies) it can correctly map ***ALL*** Unicode scalar
    values (not "code points" because UTFs are not supposed to support all
    code points, but only those that have a scalar value, i.e. excluding
    all "surrogate" code points U+D800..U+DFFF for strict compatibility
    with UTF-16 which cannot represent them, but still including other
    code points assigned to other non-characters such as U+FFFF):

    For this to work effectively, you must absolutely DROP the special
    non-standard handling of FS/GS/RS/US (because it is not even needed
    !), and replace it either with the existing standard BOM U+FEFF (or
    even better by the encoding of a single leading page selector whose
    encoded scalar value does not fall within the standard Unicode scalar
    values 0..0xD7FF, 0xE000..0x10FFFF).

    And anyway it is also much simpler to understand and easier to
    implement correctly (not like the sample code given here) than SCSU,
    and it is still very highly compressible with standard compression
    algorithms while still allowing very fast processing in memory in its
    decompressed encoded form :
    - a bit faster than UTF-8, as seen in my early benchmarks, for small
    number of large texts such as pages in a Wiki database,
    - but a bit slower for large number of small strings such as tabular
    data, because of the higher number of conditional branches when using
    a CPU with a 1-way instruction pipeline (not a problem with today's
    processors that include a dozen of parallel pipelines even in a single
    core, if the compiled assembly code is correctly optimized and
    scheduled to make use of them when branch-prediction cannot help
    much).

    Philippe.

    2011/2/20 suzuki toshiya <mpsuzuki@hiroshima-u.ac.jp>:
    > Doug Ewell wrote:
    >> <mpsuzuki at hiroshima dash u dot ac dot jp> wrote:
    >>
    >>> In your proposal, the maximum length of the coded character
    >>> is 4, it is less than UTF-8's max length. It's interesting
    >>> idea.
    >>
    >> What code sequences in UTF-8 that represent the modern coding space
    >> (ending at 0x10FFFF, not 0x7FFFFFFF) are more than 4 code units in length?
    >
    > Oh, I'm sorry. I slipped to remember the shrinking of ISO/IEC 10646
    > codespace is reduced the max length of UTF-8.
    >
    > Regards,
    > mpsuzuki
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Feb 21 2011 - 14:49:46 CST