Re: [REPOST, LONG] XML and tags (LONG)

From: Doug Ewell (
Date: Fri Feb 21 2003 - 12:10:36 EST

  • Next message: Asmus Freytag: "Re: [OpenType] PS glyph `phi' vs `phi1'"

    Marco Cimarosti <marco dot cimarosti at essetre dot it> wrote:

    > (Warning: I have probably succeeded in the impossible task of being
    > more verbose than Mr. Overington. Please start reading only if you
    > have a few free time... :-)

    There's a difference between "verbose," which implies a high ratio of
    words per idea, and "long." Marco's post was definitely long, but of

    > I will be pretending that William is "Overington Inc.", one of the key
    > customers of the company I work with, and that they are asking me to
    > implement a protocol to send text over the famous "Overington
    > Multimedia Broadcasting (OMB)", with the following requirements:

    William took exception to being "reduced" to a company in this way, but
    I think it makes the scenario a bit more realistic. In the software
    business, our customers are usually companies rather than individuals.
    The net result of this is that more than one person is responsible for
    the customer requirements, and trying to get clarifications or
    modifications to them takes more than a simple one-on-one chat.

    > 1. The text MUST be transmitted in UTF-8 (because the CEO of
    > Overington Inc. thinks that UTF-8 is cute).

    That's a perfectly legitimate requirement. BTW, I think UTF-8 is cute
    too. :-)

    > I convert the sample text file to XML (see <wo.xml> in the attached
    > ZIP file), and here comes the first surprise: while the Plane-14
    > tagged file <wo.txt> wad 445 bytes long, the XML files is only 322
    > bytes long!
    > This seems strange, at first: because of the "/" each pair of my XML
    > language tags is one character longer than the corresponding pair of
    > Plane-14 tags. Moreover, the syntactical overhead in X.1 above cannot
    > be less than 30 characters. Of course, the reason for the 123-byte
    > spare is that, in UTF-8, the characters composing XML tags only take
    > one byte each, while Plane-14 tag character take four bytes each.

    Too bad the customer in this scenario didn't think SCSU was cute.

    > a. An XML file is human readable and may be edited with any text
    > editor; although the Plain-14 file claims to be "plain text", each
    > language tag character appears as a three black boxes in any UTF-8
    > editor (and as a random twelve "accented" characters in a non-UTF-8
    > editor).

    While I'm no longer in the business of defending Plane 14 tags, it
    should be mentioned that rendering engines are *not* supposed to display
    tag characters as black boxes (although they all do). From UAX #27,
    Section 13.7: "... the tag characters themselves have no display and do
    not affect line breaking, character shaping or joining, or any other
    format or layout properties."

    As for the non-UTF-8 editor, well, UTF-8 was a customer requirement, so
    not only will the tags display badly, so will every other character
    outside the Basic Latin range.

    But the rest of Marco's arguments for XML are certainly sound. In
    particular, XML information and support is everywhere, and as soon as
    the functional requirements expand beyond language tagging, Plane 14
    tags are no longer adequate.

    -Doug Ewell
     Fullerton, California

    This archive was generated by hypermail 2.1.5 : Fri Feb 21 2003 - 12:43:38 EST