Re: [OT] UTF-81920 was RE: Unicode forms for internal storage - BOCU-1 speed

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Jan 23 2004 - 08:12:22 EST

  • Next message: Jon Hanna: "Re: [OT] UTF-81920 was RE: Unicode forms for internal storage - BOCU-1 speed"

    From: "Jon Hanna" <jon@hackcraft.net>
    > Quoting Marco Cimarosti <marco.cimarosti@essetre.it>:
    >
    > > Jon Hanna wrote:
    > > > I refuse to rename my UTF-81920!
    > >
    > > Doug, Shlomi, there's a new one out there!
    > > Jon, would you mind describing it?
    >
    > There are two different UTF-81920s (the resultant ambiguity is very much
    in the
    > spirit of UTF-81920).

    I can't find any reference document about "UTF-81920" in Google.

    All I can find is documents describing "UTF-8" which encodes 128 characters
    on 1 byte, and 1920 characters on 2 bytes.

    Does it mean that UTF-81920 is a restriction of UTF-8 to the range
    [U+0000..U+007FF] which can be encoded with at most 2 bytes with UTF-8?

    UTF-81920 would then effectively not be a Unicode-compatible encoding scheme
    as it would be restricted to only Latin, Greek, Coptic, Cyrillic, Armenian,
    Hebrew and Arabic with their diacritics, excluding all Asian scripts,
    surrogates, and compatibility characters, Arabic/Hebrew extension, common
    ligatures like "fi" and presentation forms, as well as currency signs (such
    as the Euro symbol coded at U+20AC), technical symbols, and even the BOM
    U+FEFF? This encoding does not seem suitable to even represent successfully
    the legacy DOS/OEM codepages, or the legacy PostScript and Mac charsets.



    This archive was generated by hypermail 2.1.5 : Fri Jan 23 2004 - 09:47:54 EST