Re: What's the BMP being saved for?

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Fri Mar 19 2004 - 17:26:58 EST

Next message: Peter Kirk: "Re: Irish dotless I"

Previous message: Peter Constable: "RE: Sociolinguistics and orthography"
In reply to: Marion Gunn: "Re: What's the BMP being saved for?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

At 07:13 AM 3/19/2004, Marion Gunn wrote:
>Ar 15:33 +0000 2004/03/18, scríobh Arcane Jill:
> >This probably is going to sound like a really dumb question, but ... Is
> >the BMP being saved for something?
> >...
> >Arcane Jill
>
>There are never any dumb questions, Jill, only dumb answers.

And some of the latter deserve to be straightened out a bit.

>BMP is part of 10646-speak, and probably part of pre-Unicode terminology.

It used to be, but now this term is mentioned on page 1 of Unicode 4.0
Michael's reply to his had it partially right:

At 10:57 AM 3/19/2004, Michael Everson wrote:
>This is incorrect. "BMP" means "Basic Multilingual Plane" and is the name
>given to the plane designated by the code positions 00000-0FFFF. It is not
>"10646-speak". It is part of the architectural nomenclature of the
>Universal Character Set.

Yes, and all that is 10646-speak, in the sense that BMP, UCS etc. are terms
from 10646. While it's correct to call The Unicode Standard a universal
character set, the title Universal Character Set is that of 10646.

>To summarize (telescoping time) so as to get this msg off before returning
>to paid work.:-)
>
>The decision to create the BMP dates back to a time when certain software
>suppliers were complaining that anthing approaching a full implementation
>of ISO 10646 (later transmuted, so to speak, into Unicode) would be too big
>for them to handle, and too costly.

The fact that there is a 2-byte form (UCS-2) of 10646 is due to the merger
with Unicode, which was conceived of as a 16-bit standard. Before the
merger with Unicode, there were 1-byte and 3-byte forms as well, and the
2-byte form was quite a bit different from today's BMP in its basic layout
and behavior. For example, vast sections of it could be 'swapped' out to
effectively create a C, J, K and U version of the 2-byte form.

The Unicode camp felt that asking the world to support a 32-bit standard to
replace the hodgepodge of 8-bit character sets, etc., was asking too much.
Their initial belief that one could actually contain a universal character
set in 16-bit had begun to crack around that time, as can be witnessed by
the creation of UTF's (first UTF-8 and precursors, then UTF-16).

10646 was simplified to a static 2-byte and 4-byte form, later UTF-8 and
the surrogates needed for UTF-16 were added, leaving both standards with 3
eqivalent encoding forms, plus the fixed width 16-bit UCS-2 in 10646 only
which is not so useful.

>Small local groups, such as ours, were then working rapidly and painlessly
>mostly on national and international character sets on far smaller scales.
>
>I recall chairing some discussion at a CEN workshop, possibly in Slovenia,
>in re something related, at the height of the debate. In any case, by that
>time, CEN had already emerged as a big player in this work (I think Unicode
>had yet to make much of a mark, but I don't mind if someone corrects me
>about that, if wrong, because it really doesn't matter now, in the least).
>
>Anyway, it was agreed to divide ISO 10646 into sections, such as BMP (Basic
>Multilingual Plane) and the MES (Minimum European Subset), and my own
>company, among others, was very pleased to be hired by CEN to do the
>necessary (a truly exciting and rewarding period, when we actually got
>_paid_, generously, if belatedly, for such Standards work!)

BMP and MES are certainly both sub-sets, but they are not on equal footing.
One is a contiguous sub-set of the code space, lined up with an even power
of 2, the other is a discontiguous sub-set of the *characters* in 10646
determined by rather unclear principles to be of use to Europeans.

>Is the BMP a reality, actually referenced in software, or scheduled to be
>so referenced in future? I doubt it, although I think that would be a very
>good thing (just as I believe the 8859 series and the like more practially
>useful, even today, as clean-cutting tools, than the full complement of
>10646, which remains a rather blunt instrument which creates obstacles in
>unflagged text).

The BMP is a handy concept, and a practical tool to organize code point
allocations, that's why the term made it from 10646 into Unicode. It
approximates the collection of frequently used characters from living
scripts; there are some exception to this, viz the Hong Kong ideographs in
Plane2 and Runic and Ogham on Plane 0.

What's not so useful is UCS-2. The best use I've found for that term is as
descriptive label on software that does not (yet) support supplementary
characters; so I'm hoping that use of the term will gradually expire.

>Justification for saving the BMP for the purposes originally intended is
>probably something the Unicode Consortium would be happy to clarify for
>you.

There've been some nice answers by other's on the list who took the time to
put them together.

>Perhaps that has already been done in some of today's e-mails, which are
>too numerous for me to read right now, under pressure of urgent work. (I do
>promise to try to read them all.) If you want more info on the purpose and
>genesis of the BMP,

A little known fact is that representatives of Unicode participated in the
work on, and review of 10646 long before the merger. We have no need to
study anyone else's archives ;-).

A./

>I suggest that you ask NSAI to let you study the
>archives of NSAI/AGITS/WG6 (later transmuted into NSAI/ICTSCC/SC4), or thou
>send a simple query directly to CEN (on whose live agenda such matters
>remain, I believe).
>
>Hope this helps,
>mg
>
>ps.
>Would someone just hit reply to this msg, to time our comms here? There
>seems to be a long timelag between sending and delivery of Unicode list
>msgs, sometimes.
>mg
>
>
>--
>Marion Gunn * EGTeo (Estab.1991)
>27 Páirc an Fhéithlinn, Baile an
>Bhóthair, Co. Átha Cliath, Éire.
>* mgunn@egt.ie * eamonn@egt.ie *

Next message: Peter Kirk: "Re: Irish dotless I"
Previous message: Peter Constable: "RE: Sociolinguistics and orthography"
In reply to: Marion Gunn: "Re: What's the BMP being saved for?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Mar 19 2004 - 18:04:04 EST