Re: In defense of Plane 14 language tags (long)

From: Otto Stolz (Otto.Stolz@uni-konstanz.de)
Date: Mon Nov 04 2002 - 12:21:17 EST

  • Next message: Joseph Boyle: "RE: PRODUCING and DESCRIBING UTF-8 with and without BOM"

    Doug Ewell wrote:

    > 1. Language tags may be useful for display issues.
    ...

    > For example, it is often said that Japanese
    > users prefer “Japanese-style” glyphs universally, even for Chinese text.
    >
    > The Plane 14 tagging approach is not perfect, but it is sufficient to
    > solve this problem. Japanese users who prefer “Japanese-style” glyphs
    > universally can tag all Han text as “ja”, which may be linguistically
    > wrong but achieves the desired effect. Users who want Chinese glyphs
    > for Chinese-language text and Japanese glyphs for Japanese-language text
    > can tag the former as “zh” and the latter as “ja” as they see fit.

    The "user" viewing the text (and preferring 'Japanese-style' glyphs)
    may be another person than the "user" authoring the text (and inserting
    the plane-14 tags); in fact the user viewing the text may not be able
    to modify the plane-14 tags, or may not even be aware of them.

    I guess, this argument should be reworded, based on a clear distinction
    of the various "users".

    > Other scripts besides Han can benefit from plain-text language tagging
    > as well. A common Latin-script example

    ...

    A common Cyrillic example is the difference in the italic forms for,
    e. g., Russian and Serbian, cf. "Rendering Serbian italics" (used to
    be at <http://www.tiro.com/transfer/Serbian_Rendering.pdf> -- John,
    can we have it back?).

    Other examples include the different current (handwriting) forms,
    e. g., a UK "I" is perceived as a "T" by most Germans; the Russian-
    Serbian contrast mentioned above is also in current.

    > 2. Language tags may be useful for non-display issues.
    ...

    > 3. Conflict with HTML/XML tags need not be a problem.
    ...

    > The potential disruption caused by this scenario is probably overstated.
    > Almost every HTML file ever created contains at least one plain-text
    > line separator (CR and/or LF) and at least one HTML-style line separator
    > (<p> and/or <br>). Which to follow? The HTML specification very
    > clearly states that the higher-level protocol takes precedence in this
    > case (unless <pre>preformatted text</pre> is explicitly indicated). The
    > same could be said for the interaction between Plane 14 language tags
    > and HTML language tags.

    Other possibilities include a clear rule about their mutual interaction.

    Paradigms to follow are

    - interaction between Unicode formatting characters, such as U+200E,
       U+200F, and U+202A through U+202E, and HTML tagging, such as
       the Dir attribute and the Bdo element (cf.
       <http://www.w3.org/TR/html401/struct/dirlang.html#h-8.2>),

    - interaction between HTTP arguments and the HTML Meta tag, e. g.,
       the HTTP Content-Type, including its charset attribute,
       cf. <http://www.w3.org/TR/html401/charset.html#h-5.2.2>.

    > 4. The original need for language tags has not disappeared.

    ...

    > 5. “Statefulness” disadvantage is exaggerated.
    ...

    > 6. Plane 14 tags are easy to filter out, and harmless if not
    > interpreted.

    ...

    > Tags [...] do not affect searching,

    There are indeed situations where language tags would affect searching,
    if not handled properly.
    Example: In my German WWW pages, I take pains to tag all English terms
    in the hope to help speech synthesizers, or other clients depending on
    the correct identification of the language. Now, German attaches pre-
    fixes and suffixes to the word-stems, and also tends to form compounds.
    Of course, I have to confine my LANG=EN span to the English word proper.
    This leads to monsters such as
       <span lang="en">E-Mail</span>-Adresse
       <span lang="en">Mailing</span>listen
       ... aus den <span lang="en">Received-Header</span>n ...

    A search engine should remove these tags before comparing a search argument
    to this sort of text. For perfect results, this normalizing should be ap-
    plied to HTML tags and Unicode tags, alike. (I fear that Google is not
    that smart, but I haven't tested it.)

    So the correct argument for Doug's issue #6, the correct argument is
    probably:
    Plane-14 Tags do not affect searching any more than high-level tags do.

    > 7. Rapid deprecation creates an image of instability.
    ...

    > 8. Other, as yet uninvented tags would be implicitly deprecated.
    ...

    Best wishes,
       Otto Stolz



    This archive was generated by hypermail 2.1.5 : Mon Nov 04 2002 - 12:56:13 EST