Doing Markup in Plain Text: A Modest Proposal for Planes 4-B of Unicode

From: jcowan@reutershealth.com
Date: Wed Mar 31 2004 - 16:49:06 EST

  • Next message: Rick McGowan: "Re: What is the principle?"

    XML has become the de facto standard for fancy text. It is therefore
    useful to explore ways and means of bringing XML into plain text,
    since obviously plain text is simpler than, and superior to, fancy text.
    The current method involving & and < and > and / and who knows what else
    is obviously much too complicated, and cannot interoperate with even the
    simplest plain text. Fortunately, the characters in planes 4 through
    B can come to our rescue.

    Plane 4 will be divided into mini-blocks of 32 (or perhaps 64) characters.
    The Unicode Consortium will allocated these on the usual basis (first come
    first served, once and for all, and free) to users for the representation
    of start-tags. For example, supposing that block 40000 was allocated to
    the W3C HTML WG, we might represent <html> as U+40000, <head> as U+40001,
    <body> as U+40002, and so on. In this way, the start-tag (exclusive
    of attributes and attribute values) is reduced to a single character.
    The last block will not be allocated; U+4FFFC will be used to indicate
    the beginning of a comment, and U+4FFFD the beginning of a processing
    instruction.

    Plane 5 will be automatically assigned in parallel to plane 4 for the
    representation of end-tags: thus, U+50000 would be </html>. U+5FFFC and
    U+5FFFD will have the obvious meanings.

    Plane 6 will also be allocated as mini-blocks and used for the
    representation of attribute names. If a Plane 4 character is followed
    by a Plane 6 character, then the start-tag has at least one attribute.
    The last mini-block will not be allocated; 6FFFD will be used to indicate
    that the current tag has no more attributes.

    Plane 7 is reserved for future use.

    Planes 8 through A are clones of planes 0 through 3 respectively,
    and are used to represent attribute value, comment, and processing
    instruction text. In this way, only character content is encoded using
    traditional Unicode characters.

    It is expected that a secondary market in mini-blocks would eventually
    arise.

    -- 
    "But I am the real Strider, fortunately,"       John Cowan
    he said, looking down at them with his face     jcowan@reutershealth.com
    softened by a sudden smile.  "I am Aragorn son  http://www.ccil.org/~/cowan
    of Arathorn, and if by life or death I can      http://www.reutershealth.com
    save you, I will."  --LotR Book I Chapter 10
    


    This archive was generated by hypermail 2.1.5 : Wed Mar 31 2004 - 17:28:50 EST