Re: [OT] UTF-81920 was RE: Unicode forms for internal storage - BOCU-1 speed

From: Jon Hanna (jon@hackcraft.net)
Date: Fri Jan 23 2004 - 08:45:14 EST

Next message: Markus Scherer: "Re: Unicode forms for internal storage - BOCU-1 speed"

Previous message: Philippe Verdy: "Re: [OT] UTF-81920 was RE: Unicode forms for internal storage - BOCU-1 speed"
In reply to: Philippe Verdy: "Re: [OT] UTF-81920 was RE: Unicode forms for internal storage - BOCU-1 speed"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Quoting Philippe Verdy <verdy_p@wanadoo.fr>:

> From: "Jon Hanna" <jon@hackcraft.net>
> > Quoting Marco Cimarosti <marco.cimarosti@essetre.it>:
> >
> > > Jon Hanna wrote:
> > > > I refuse to rename my UTF-81920!
> > >
> > > Doug, Shlomi, there's a new one out there!
> > > Jon, would you mind describing it?
> >
> > There are two different UTF-81920s (the resultant ambiguity is very much
> in the
> > spirit of UTF-81920).
>
> I can't find any reference document about "UTF-81920" in Google.

That's because there are no documents about UTF-81920. It barely qualifies as
the starting point of a gedankenexperiment, never mind as a spec. That's why
this thread is marked as OT. The closest thing to a spec is the email I just
sent to this list.

> All I can find is documents describing "UTF-8" which encodes 128 characters
> on 1 byte, and 1920 characters on 2 bytes.

Excellent, the inclusion of "1920" in the name is then wonderfully
serendipitous.

> Does it mean that UTF-81920 is a restriction of UTF-8 to the range
> [U+0000..U+007FF] which can be encoded with at most 2 bytes with UTF-8?

No, it is as explained in the email.

> UTF-81920 would then effectively not be a Unicode-compatible encoding scheme
> as it would be restricted to only Latin, Greek, Coptic, Cyrillic, Armenian,
> Hebrew and Arabic with their diacritics, excluding all Asian scripts,
> surrogates, and compatibility characters, Arabic/Hebrew extension, common
> ligatures like "fi" and presentation forms, as well as currency signs (such
> as the Euro symbol coded at U+20AC), technical symbols, and even the BOM
> U+FEFF? This encoding does not seem suitable to even represent successfully
> the legacy DOS/OEM codepages, or the legacy PostScript and Mac charsets.

Yes, day-dream concepts mentioned in jest do often have technical
short-comings.

-- 
Jon Hanna
<http://www.hackcraft.net/>
*Thought provoking quote goes here*

Next message: Markus Scherer: "Re: Unicode forms for internal storage - BOCU-1 speed"
Previous message: Philippe Verdy: "Re: [OT] UTF-81920 was RE: Unicode forms for internal storage - BOCU-1 speed"
In reply to: Philippe Verdy: "Re: [OT] UTF-81920 was RE: Unicode forms for internal storage - BOCU-1 speed"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jan 23 2004 - 10:34:55 EST