Re: UTF-c

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Feb 22 2011 - 10:26:16 CST

  • Next message: mpsuzuki@hiroshima-u.ac.jp: "[unicode] Re: UTF-c"

    2011/2/22 Doug Ewell <doug@ewellic.org>:
    > Now if Cropley's algorithm is being presented as a replacement or
    > alternative to UTF-8, then it does need to be evaluated on criteria like
    > these, and Suzuki-san's observations become very relevant.

    I had posted the same two observations in a prior message.
    But I also explained that the BOM-like system was ill, and not
    necessary. You can perfectly implement code switching without this
    hack, and without breaking the UTF requirements.

    Cropley has not seen that his scheme allowed more separate codes to do
    that (it is safe to reuse the surrogates range for encoding such
    special encoding function as a 3-byte sequence specifying that a code
    page switch has occured on the previously encoded character, and
    specifying where the codepage starts or how long it is and which base
    page was altered, if multiple ones smaller than a range of 64
    characters can be remapped) : this just requires a few data bits and
    the 15 bits in the unused surrogates range is ample enough to specify
    this in a single 3-bytes function code, without needing any "magic"
    table, and to support all evolutions of the standard. And if more bits
    are needed, there are still a lot of unused scalar values starting at
    0x110000, and encodable with 4-bytes sequences.

    (however, the insertion of code switching functions may expose to
    problems like correctly sizing the target buffer for the worst case,
    to avoid buffer overflows, something that should not occur if code
    switching is used properly to effectively reduce the encoding size).

    Yes there's currently a sync problem with 2-byte encoded characters
    (if one byte gets deleted), but they occur in a Unicode range
    (0x80..0x407F) where they extremely rarely occur in overlong sequences
    (this range is used by scripts that also abondantly use spaces and
    ASCII punctuations, in addition to controls and line-breaks), so the
    need to resynchronize on newlines is already satisfied.

    Note also that if the selected 1-byte encoded range (of 64 characters)
    falls within 0x80..0x4080, then a part of this range is also encocable
    as 2-bytes (but Cropley wanted to exclude this case by forcing the
    shortest code). This means that the 2-byte encodable range may extend
    to 0x80..0x40BF, if the selected page falls any where in this range,
    so the 3-bytes encoded sequences could start at 0x41C0 instead of
    0x4180 (not much an improvement).

    An alternative could instead use this conditionally unused range of 64
    codes (depending on the selected codepage) for some extra code
    switching functions, or for no-op resync codes (in overlong sequences
    of 2-byte encoded characters).

    Another variant could also use the 2-byte encoded range to encode
    larger scripts (of up to 4096 characters), using code switching as
    well (in that case, there would still be 192 characters encoded as 1
    byte, including the ASCII page and the selectable 64-character page).
    It could be used for syllabaries or large alphabets (including Nko, or
    basic CJK ideographs, or Hiragana+Katakana, or Hangul in decomposed
    Jamos form, but also extended Latin, Cyrillic, Arabic), all other
    characters still requiring 3-byte or 4-byte sequences on this case.



    This archive was generated by hypermail 2.1.5 : Tue Feb 22 2011 - 10:28:14 CST