Re: Plane 14 Tag Deprecation Issue

From: William Overington (WOverington@ngo.globalnet.co.uk)
Date: Sat Feb 15 2003 - 07:00:49 EST

  • Next message: David Oftedal: "Re: traditional vs simplified chinese"

    I was interested to read the comment by Rick McGowan.

    Thank you for your note. I found the MARC system described at the following
    place on the web.

    http://www.loc.gov/marc/

    It is very interesting and I have started to read about it.

    I looked back at what I had written and found the following.

    quote

    Books in libraries are often classified with a code consisting of digits and
    a full stop character. For example, the number 515.53 is on a label which
    is still on the spine of a book which I bought in a sale of withdrawn books
    from a library. So, if U+E0002 were used to introduce a tag for the library
    book classification code, then a sequence starting with U+E0002 and using
    some other tag characters could be used to classify the subject matter of
    any document which is stored in computerized form.

    end quote

    I also found the following about the Dewey Decimal Classification system.

    http://www.oclc.org/dewey/about/

    I realize in rereading what I wrote in the light of the comment by Rick that
    I may well have not expressed my meaning correctly.

    My intention was to convey the meaning of the type of use as in the
    following example.

    Suppose that there is a plain text document written in Cyrillic script. If
    at the start of that document there is a U+E0001 character then some tag
    characters indicating the language and then a U+E0002 character and then the
    characters U+E0036 U+E0030 U+E0038 then someone could look at the document
    using a suitable computer system and find out from the few plane 14
    characters at the start of the document in which particular language the
    document is written and also that the general topic area of the document is
    inventions and patents. This being because 608 is the Dewey Decimal
    Classification for inventions and patents. However, in an ordinary document
    viewing package, the tags would not be displayed, so they would not get in
    the way.

    My suggestion about using International Standard Book Numbers with a tag
    type code, which could perhaps be U+E0003, perhaps needs looking at further.
    Does the tag code mean "This is the start of the text of the book with the
    following ISBN number" or does it mean "Here is a reference to an ISBN for a
    book to which I am referring"? Can the two meanings be distinguished,
    perhaps by putting a tag R after the U+E0003 and in front of the tag digits
    for a reference to the book and not using a tag R if the use is at the start
    of the text of the electronic book itself? Or how? There are possibilities
    for progress here, provided that tags are continued, on the basis of being
    reserved for use in particular protocols, and provided that the Unicode
    Technical Committee is willing to consider the defining of additional tag
    types at some time in the future.

    My suggestion for U+E0004 could be very useful. Suppose that the haiku
    which I included at the end of the document had an International Literary
    Work Number, if such a system of International Literary Work Numbers comes
    into existence in the future. I could produce a plain text file which
    starts with U+E0004 and a number of tag characters and then the text of the
    haiku. I could place that file somewhere on the web. Search engines might
    log it. If then someone is writing an article about the topic of poetry and
    Unicode, then he or she might refer to that haiku and include a tag encoded
    reference to it, using its International Literary Work Number. A reader of
    that document could decide to have a look at the text and could then search
    the internet for the text of the haiku, knowing that the search is made
    easier due to the fact that the International Literary Work Number is unique
    to that haiku, whereas searching for Phaistos Disc might not find it at all,
    or might find it as but one of many search engine matches for the term
    Phaistos Disc.

    All of these things and maybe many more will be possible if tag characters
    are not fully deprecated and the possibility of defining more tag character
    types exists.

    In my posting I wrote the following.

    quote

    Perhaps all of plane 14 needs to be declared an area considered as
    deprecated in general terms, yet where codes for use with particular
    protocols can be defined by the Unicode Technical Committee, so that the
    potential for using such futuristic developments and encoding them within
    the Unicode framework is preserved?

    end quote

    I feel that that is the way forward. In some ways it would be a compromise,
    yet it is more than a compromise, it is a far-reaching forward-looking
    policy option which would both protect the present mainstream use of Unicode
    whilst also providing for futuristic possibilities within the context of
    conveying information in Unicode compatible files in a precise,
    formally-defined manner. At present, characters are either regular Unicode
    codes or Private Use Area codes. This could be changed so that characters
    are either regular Unicode codes, or reserved Unicode codes or Private Use
    Area codes, with reserved Unicode codes all being in plane 14.

    William Overington

    15 February 2003



    This archive was generated by hypermail 2.1.5 : Sat Feb 15 2003 - 07:56:11 EST