Re: [OT] UTF-81920 was RE: Unicode forms for internal storage - BOCU-1 speed

From: Jon Hanna (jon@hackcraft.net)
Date: Fri Jan 23 2004 - 08:45:14 EST

  • Next message: Markus Scherer: "Re: Unicode forms for internal storage - BOCU-1 speed"

    Quoting Philippe Verdy <verdy_p@wanadoo.fr>:

    > From: "Jon Hanna" <jon@hackcraft.net>
    > > Quoting Marco Cimarosti <marco.cimarosti@essetre.it>:
    > >
    > > > Jon Hanna wrote:
    > > > > I refuse to rename my UTF-81920!
    > > >
    > > > Doug, Shlomi, there's a new one out there!
    > > > Jon, would you mind describing it?
    > >
    > > There are two different UTF-81920s (the resultant ambiguity is very much
    > in the
    > > spirit of UTF-81920).
    >
    > I can't find any reference document about "UTF-81920" in Google.

    That's because there are no documents about UTF-81920. It barely qualifies as
    the starting point of a gedankenexperiment, never mind as a spec. That's why
    this thread is marked as OT. The closest thing to a spec is the email I just
    sent to this list.

    > All I can find is documents describing "UTF-8" which encodes 128 characters
    > on 1 byte, and 1920 characters on 2 bytes.

    Excellent, the inclusion of "1920" in the name is then wonderfully
    serendipitous.

    > Does it mean that UTF-81920 is a restriction of UTF-8 to the range
    > [U+0000..U+007FF] which can be encoded with at most 2 bytes with UTF-8?

    No, it is as explained in the email.

    > UTF-81920 would then effectively not be a Unicode-compatible encoding scheme
    > as it would be restricted to only Latin, Greek, Coptic, Cyrillic, Armenian,
    > Hebrew and Arabic with their diacritics, excluding all Asian scripts,
    > surrogates, and compatibility characters, Arabic/Hebrew extension, common
    > ligatures like "fi" and presentation forms, as well as currency signs (such
    > as the Euro symbol coded at U+20AC), technical symbols, and even the BOM
    > U+FEFF? This encoding does not seem suitable to even represent successfully
    > the legacy DOS/OEM codepages, or the legacy PostScript and Mac charsets.

    Yes, day-dream concepts mentioned in jest do often have technical
    short-comings.

    -- 
    Jon Hanna
    <http://www.hackcraft.net/>
    *Thought provoking quote goes here*
    


    This archive was generated by hypermail 2.1.5 : Fri Jan 23 2004 - 10:34:55 EST