Re: What's the BMP being saved for?

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Fri Mar 19 2004 - 17:26:58 EST

  • Next message: Peter Kirk: "Re: Irish dotless I"

    At 07:13 AM 3/19/2004, Marion Gunn wrote:
    >Ar 15:33 +0000 2004/03/18, scríobh Arcane Jill:
    > >This probably is going to sound like a really dumb question, but ... Is
    > >the BMP being saved for something?
    > >...
    > >Arcane Jill
    >
    >There are never any dumb questions, Jill, only dumb answers.

    And some of the latter deserve to be straightened out a bit.

    >BMP is part of 10646-speak, and probably part of pre-Unicode terminology.

    It used to be, but now this term is mentioned on page 1 of Unicode 4.0
    Michael's reply to his had it partially right:

    At 10:57 AM 3/19/2004, Michael Everson wrote:
    >This is incorrect. "BMP" means "Basic Multilingual Plane" and is the name
    >given to the plane designated by the code positions 00000-0FFFF. It is not
    >"10646-speak". It is part of the architectural nomenclature of the
    >Universal Character Set.

    Yes, and all that is 10646-speak, in the sense that BMP, UCS etc. are terms
    from 10646. While it's correct to call The Unicode Standard a universal
    character set, the title Universal Character Set is that of 10646.

    >To summarize (telescoping time) so as to get this msg off before returning
    >to paid work.:-)
    >
    >The decision to create the BMP dates back to a time when certain software
    >suppliers were complaining that anthing approaching a full implementation
    >of ISO 10646 (later transmuted, so to speak, into Unicode) would be too big
    >for them to handle, and too costly.

    The fact that there is a 2-byte form (UCS-2) of 10646 is due to the merger
    with Unicode, which was conceived of as a 16-bit standard. Before the
    merger with Unicode, there were 1-byte and 3-byte forms as well, and the
    2-byte form was quite a bit different from today's BMP in its basic layout
    and behavior. For example, vast sections of it could be 'swapped' out to
    effectively create a C, J, K and U version of the 2-byte form.

    The Unicode camp felt that asking the world to support a 32-bit standard to
    replace the hodgepodge of 8-bit character sets, etc., was asking too much.
    Their initial belief that one could actually contain a universal character
    set in 16-bit had begun to crack around that time, as can be witnessed by
    the creation of UTF's (first UTF-8 and precursors, then UTF-16).

    10646 was simplified to a static 2-byte and 4-byte form, later UTF-8 and
    the surrogates needed for UTF-16 were added, leaving both standards with 3
    eqivalent encoding forms, plus the fixed width 16-bit UCS-2 in 10646 only
    which is not so useful.

    >Small local groups, such as ours, were then working rapidly and painlessly
    >mostly on national and international character sets on far smaller scales.
    >
    >I recall chairing some discussion at a CEN workshop, possibly in Slovenia,
    >in re something related, at the height of the debate. In any case, by that
    >time, CEN had already emerged as a big player in this work (I think Unicode
    >had yet to make much of a mark, but I don't mind if someone corrects me
    >about that, if wrong, because it really doesn't matter now, in the least).
    >
    >Anyway, it was agreed to divide ISO 10646 into sections, such as BMP (Basic
    >Multilingual Plane) and the MES (Minimum European Subset), and my own
    >company, among others, was very pleased to be hired by CEN to do the
    >necessary (a truly exciting and rewarding period, when we actually got
    >_paid_, generously, if belatedly, for such Standards work!)

    BMP and MES are certainly both sub-sets, but they are not on equal footing.
    One is a contiguous sub-set of the code space, lined up with an even power
    of 2, the other is a discontiguous sub-set of the *characters* in 10646
    determined by rather unclear principles to be of use to Europeans.

    >Is the BMP a reality, actually referenced in software, or scheduled to be
    >so referenced in future? I doubt it, although I think that would be a very
    >good thing (just as I believe the 8859 series and the like more practially
    >useful, even today, as clean-cutting tools, than the full complement of
    >10646, which remains a rather blunt instrument which creates obstacles in
    >unflagged text).

    The BMP is a handy concept, and a practical tool to organize code point
    allocations, that's why the term made it from 10646 into Unicode. It
    approximates the collection of frequently used characters from living
    scripts; there are some exception to this, viz the Hong Kong ideographs in
    Plane2 and Runic and Ogham on Plane 0.

    What's not so useful is UCS-2. The best use I've found for that term is as
    descriptive label on software that does not (yet) support supplementary
    characters; so I'm hoping that use of the term will gradually expire.

    >Justification for saving the BMP for the purposes originally intended is
    >probably something the Unicode Consortium would be happy to clarify for
    >you.

    There've been some nice answers by other's on the list who took the time to
    put them together.

    >Perhaps that has already been done in some of today's e-mails, which are
    >too numerous for me to read right now, under pressure of urgent work. (I do
    >promise to try to read them all.) If you want more info on the purpose and
    >genesis of the BMP,

    A little known fact is that representatives of Unicode participated in the
    work on, and review of 10646 long before the merger. We have no need to
    study anyone else's archives ;-).

    A./

    >I suggest that you ask NSAI to let you study the
    >archives of NSAI/AGITS/WG6 (later transmuted into NSAI/ICTSCC/SC4), or thou
    >send a simple query directly to CEN (on whose live agenda such matters
    >remain, I believe).
    >
    >Hope this helps,
    >mg
    >
    >ps.
    >Would someone just hit reply to this msg, to time our comms here? There
    >seems to be a long timelag between sending and delivery of Unicode list
    >msgs, sometimes.
    >mg
    >
    >
    >--
    >Marion Gunn * EGTeo (Estab.1991)
    >27 Páirc an Fhéithlinn, Baile an
    >Bhóthair, Co. Átha Cliath, Éire.
    >* mgunn@egt.ie * eamonn@egt.ie *



    This archive was generated by hypermail 2.1.5 : Fri Mar 19 2004 - 18:04:04 EST