XML and tags (LONG) (derives from Re: Plane 14 Tag Deprecation Issue)

From: William Overington (WOverington@ngo.globalnet.co.uk)
Date: Mon Feb 17 2003 - 05:28:55 EST

  • Next message: Tex Texin: "Re: BOM's at Beginning of Web Pages? Mac IE's Euro"

    Two posts in the Unicode list in the last few days advocate using XML rather
    than using plane 14 tags.

    I knew very little XML so I started to learn some more so as to assess the
    matter of whether there is any good reason for using XML rather than plane
    14 tags. Certainly no reasons were stated in the posts.

    I asked for XML at http://www.ask.co.uk and found, amongst other results,
    the following.


    This is the first page of a set of seven pages introducing XML. I found
    this set of documents very useful.

    I also found the following.


    I have read both the above.

    I also found the following FAQ.

    The XML FAQ


    I have had a look through that document.

    I also found the following.

    Extensible Markup Language (XML) 1.0 (Second Edition)


    I have glanced at that but not yet in any depth.

    The more I read about XML the less reason there seems to be to use XML
    instead of tags!

    I can understand that there are many possible applications for XML, but any
    particular set of named elements produced in an XML document seems by its
    nature to be at the same level of standardization and interoperability from
    one person to another as using a Private Use Area collection of Unicode
    codes. Namely it might well work and be beneficial amongst a group of
    people, yet it simply does not have the rigorous definition and formal
    standardization of any particular chosen format (that is, of any end user
    designed language produced using the XML metalanguage) to be useful
    generally throughout computing as a standard for information exchange!
    Large companies are, in my view, unlikely to accept as a standard a language
    which is produced other than by a formal standards body.

    Also, as a scientific applications programmer I just do not understand how
    if I write a Java program to act in response to a file of XML text it is in
    any way other than much harder to program the Java than if the file is a
    file containing plane 14 tags, or individual codes such as in eutocode
    graphics. Am I missing something? In particular, for the DVB-MHP (Digital
    Video Broadcasting - Multimedia Home Platform) there is a need to keep the
    programs as small as possible and to keep text files as small as possible.
    Plane 14 tags and a eutocode graphics system offer both ease of decoding,
    compactness of documents and ease of preparation of documents starting from
    the main plain text body of the document.

    I accept that generating information source documents might be easier using
    XML for the more complex usage of markup systems but then a Java program can
    be used locally to convert into the tag format for interchange or storage,
    thereby cutting down on storage space and complexity of decoding at the
    receiving end.

    Please know that I am not in any way criticising XML or in any way
    purporting otherwise than that it is useful in many situations. The matter
    under discussion is as to why it is being claimed that XML is better than
    tags for specific applications. My feeling is that there is room for both
    tags and XML as facilities for people to use. Which is used in any
    particular application depends upon the application. There is an overlap of
    areas of application where both will do, yet there are areas of application
    where each has its own particular advantages. I read recently of how XML is
    being used to produce a format for marshalling content from content
    originators to broadcasters of interactive television services: that seems a
    good use of XML, it has English-like layout of information for use within a
    particular user community. Yet there is a qualitative difference between
    that and having a tag system where everybody uses the same encoding for all
    sorts of applications, such as finding all documents which have been tagged
    with a particular Dewey Decimal Classification indicating the nature of the
    subject area of the document.

    If the examples of the example Cyrillic document and the haiku from my
    posting together with eutocode graphics are all looked at, is using XML
    instead of my encoding methods in any way whatsoever an improvement? I am
    genuinely puzzled over this. Am I missing something or are the suggestions
    to use XML rather than to use tags and my suggested new tag types and plane
    14 vector graphics codes unfounded?

    1. Previously I wrote as follows.


    Suppose that there is a plain text document written in Cyrillic script. If
    at the start of that document there is a U+E0001 character then some tag
    characters indicating the language and then a U+E0002 character and then the
    characters U+E0036 U+E0030 U+E0038 then someone could look at the document
    using a suitable computer system and find out from the few plane 14
    characters at the start of the document in which particular language the
    document is written and also that the general topic area of the document is
    inventions and patents. This being because 608 is the Dewey Decimal
    Classification for inventions and patents. However, in an ordinary document
    viewing package, the tags would not be displayed, so they would not get in
    the way.

    end quote

    How would that be done using XML? Would it be done better using XML than
    using tags? Why, or why not?

    2. Previously I wrote as follows.


    My suggestion for U+E0004 could be very useful. Suppose that the haiku
    which I included at the end of the document had an International Literary
    Work Number, if such a system of International Literary Work Numbers comes
    into existence in the future. I could produce a plain text file which
    starts with U+E0004 and a number of tag characters and then the text of the
    haiku. I could place that file somewhere on the web. Search engines might
    log it. If then someone is writing an article about the topic of poetry and
    Unicode, then he or she might refer to that haiku and include a tag encoded
    reference to it, using its International Literary Work Number. A reader of
    that document could decide to have a look at the text and could then search
    the internet for the text of the haiku, knowing that the search is made
    easier due to the fact that the International Literary Work Number is unique
    to that haiku, whereas searching for Phaistos Disc might not find it at all,
    or might find it as but one of many search engine matches for the term
    Phaistos Disc.

    end quote

    Please suppose, for the purposes of this discussion, that an International
    Literary Work Number is expressed as 15 digits followed by a full stop
    followed by 5 digits. (A real world implementation might add a space and a
    check digit in the manner of International Standard Book Numbers, but the 21
    character model will be adequate for this discussion, the idea here being
    that anyone may obtain an ILWN 15 digit code from a web site which has a
    database facility by choosing any 15 digit number not starting with a 0
    character which number has not already been chosen by someone else, then
    that person may allocate the 5 digits after the full stop as he or she

    How would that be done using XML? Would it be done better using XML than
    using tags? Why, or why not?

    3. Previously I wrote as follows.


    Looking further at the matter of plane 14, I am wondering whether there is
    scope for the eventual production of a vector graphics system to be encoded
    in plane 14. I have had some good success with my eutocode graphics system
    which is produced using codes from the Private Use Area.



    Eutocode graphics uses 10 bit data input. If a system in plane 14 were
    produced, then 12 bit data input could be used, perhaps using all of the
    codes U+E2000 through to U+E2FFF for data input. Some of the codes in the
    range U+E1000 through to U+E1FFF could be used for control codes for the
    system, though not that many of them. At its present stage of development
    eutocode graphics uses only a few codes for control, all of them within the
    range U+EB00 through to U+EBFF of the Private Use Area.

    end quote

    Please consider the graphic Winter Night in the second of the above named
    web pages. This is a vector graphic. The Winter Night graphic can be a
    stand alone graphic in a file or it can be embedded within a text file if
    one is using a Java program in an interactive television system to process
    text files to produce displays.

    How would that be done using XML? Would it be done better using XML than
    using tags? Why, or why not?

    Your comments would be appreciated please. I recognize that this posting is
    at length and that discussing the matter thoroughly may take considerable
    time and effort. However, as the Unicode Technical Committee are heading
    for making a decision which may have long-lasting and widespread effects
    upon the way in which computing develops, I feel that it is important that
    the matter is discussed fully and thoroughly. Plane 14 could become a
    formally standardized area of futuristic development for the 21st Century
    and beyond. I feel that that opportunity for progress should not be blocked
    off now by a committee making a decision which prevents opportunities for
    technological progress.

    William Overington

    Monday 17 February 2003

    This archive was generated by hypermail 2.1.5 : Mon Feb 17 2003 - 06:39:19 EST