From: Otto Stolz (Otto.Stolz@uni-konstanz.de)
Date: Mon Nov 04 2002 - 12:21:17 EST
Doug Ewell wrote:
> 1. Language tags may be useful for display issues.
> For example, it is often said that Japanese
> users prefer “Japanese-style” glyphs universally, even for Chinese text.
> The Plane 14 tagging approach is not perfect, but it is sufficient to
> solve this problem. Japanese users who prefer “Japanese-style” glyphs
> universally can tag all Han text as “ja”, which may be linguistically
> wrong but achieves the desired effect. Users who want Chinese glyphs
> for Chinese-language text and Japanese glyphs for Japanese-language text
> can tag the former as “zh” and the latter as “ja” as they see fit.
The "user" viewing the text (and preferring 'Japanese-style' glyphs)
may be another person than the "user" authoring the text (and inserting
the plane-14 tags); in fact the user viewing the text may not be able
to modify the plane-14 tags, or may not even be aware of them.
I guess, this argument should be reworded, based on a clear distinction
of the various "users".
> Other scripts besides Han can benefit from plain-text language tagging
> as well. A common Latin-script example
A common Cyrillic example is the difference in the italic forms for,
e. g., Russian and Serbian, cf. "Rendering Serbian italics" (used to
be at <http://www.tiro.com/transfer/Serbian_Rendering.pdf> -- John,
can we have it back?).
Other examples include the different current (handwriting) forms,
e. g., a UK "I" is perceived as a "T" by most Germans; the Russian-
Serbian contrast mentioned above is also in current.
> 2. Language tags may be useful for non-display issues.
> 3. Conflict with HTML/XML tags need not be a problem.
> The potential disruption caused by this scenario is probably overstated.
> Almost every HTML file ever created contains at least one plain-text
> line separator (CR and/or LF) and at least one HTML-style line separator
> (<p> and/or <br>). Which to follow? The HTML specification very
> clearly states that the higher-level protocol takes precedence in this
> case (unless <pre>preformatted text</pre> is explicitly indicated). The
> same could be said for the interaction between Plane 14 language tags
> and HTML language tags.
Other possibilities include a clear rule about their mutual interaction.
Paradigms to follow are
- interaction between Unicode formatting characters, such as U+200E,
U+200F, and U+202A through U+202E, and HTML tagging, such as
the Dir attribute and the Bdo element (cf.
- interaction between HTTP arguments and the HTML Meta tag, e. g.,
the HTTP Content-Type, including its charset attribute,
> 4. The original need for language tags has not disappeared.
> 5. “Statefulness” disadvantage is exaggerated.
> 6. Plane 14 tags are easy to filter out, and harmless if not
> Tags [...] do not affect searching,
There are indeed situations where language tags would affect searching,
if not handled properly.
Example: In my German WWW pages, I take pains to tag all English terms
in the hope to help speech synthesizers, or other clients depending on
the correct identification of the language. Now, German attaches pre-
fixes and suffixes to the word-stems, and also tends to form compounds.
Of course, I have to confine my LANG=EN span to the English word proper.
This leads to monsters such as
... aus den <span lang="en">Received-Header</span>n ...
A search engine should remove these tags before comparing a search argument
to this sort of text. For perfect results, this normalizing should be ap-
plied to HTML tags and Unicode tags, alike. (I fear that Google is not
that smart, but I haven't tested it.)
So the correct argument for Doug's issue #6, the correct argument is
Plane-14 Tags do not affect searching any more than high-level tags do.
> 7. Rapid deprecation creates an image of instability.
> 8. Other, as yet uninvented tags would be implicitly deprecated.
This archive was generated by hypermail 2.1.5 : Mon Nov 04 2002 - 12:56:13 EST