RE: In defense of Plane 14 language tags (long)

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Tue Nov 05 2002 - 06:14:59 EST

  • Next message: Johan Marais: "Special characters"

    Doug Ewell wrote:
    > [...]
    > Readers are asked to consider the following arguments individually, so
    > that any particular argument that seems untenable or contrary to
    > consensus does not affect the validity of other arguments.
    > [...]

    Here are my three pence *pro* the deprecation:

    > 1. Language tags may be useful for display issues.
    >
    > The most commonly suggested use, and the original impetus,
    > for Plane 14 language tags is to suggest to the display
    > subsystem that “Chinese-style” or “Japanese-style” glyphs
    > are preferred for unified Han characters. [...]

    IMHO, there has never been any practical need to consider these glyphic
    differences in plain text. It is a non-issue raised to the rank of issue
    because of obscure political reasons.

    It is false that Japanese is unreadable if displayed with Chinese-style
    glyphs, or that Polish is unreadable if displayed with Spanish-styles acute
    accents.

    It is true that any language looks odd if displayed with an improper font,
    and that these esthetic issues must be properly addressed in "rich text" and
    in decent typography.

    But such a level of graphical correctness does not apply to plain text: if
    it would apply, we should also rule out many other typographic
    simplifications which are in current use, such as fixed-width fonts for
    Western script, fixed-height fonts for the Arabic script, horizontal display
    of Japanese, etc.

    > 2. Language tags may be useful for non-display issues.
    >
    > Although not frequently mentioned, plain-text language tagging could
    > also be useful for applications such as speech synthesis,
    > spell-checking, and grammar checking. [...]

    These kinds of applications cannot rely on the presence of any kinds of
    language tagging because, in most real-word cases, this will not be present.

    { As a side note, the idea that a language my use "foreign" words seems
    terribly naive to me. It is true that, in Italian, we use loanwords such as
    "hardware", "punk", or "footing", but it would be silly to consider or tag
    them as "English words". They are genuinely Italian words, as demonstrated
    by the fact that their pronunciation is very different from the English
    (['ɑrdwer(e)], ['pɑŋk(e)] and ['futiŋg(e)], respectively), that their
    morphology is different (e.g., plural is invariable), and that their meaning
    is slightly different ("hardware" only refers to computers, "punk" only
    refers to music and fashion), or even totally different from the English
    original ("footing" means "jogging"). }

    > 3. Conflict with HTML/XML tags need not be a problem.
    >
    > A common criticism of the Plane 14 language tags is that higher-level
    > protocols such as HTML and XML already provide a mechanism
    > for language tagging. There is a concern that the language specified
    > by the “lang” attribute in HTML or “xml:lang” attribute in XML could
    > conflict with the one specified in a Plane 14 language tag, [...]

    As I see it, the problem is not merely that the two fashions of tags may
    specifying different languages. That would not be a real conflict. It is
    perfectly legitimate to embed language tags into each other: the rule is
    that the inner language tag wins. This general rule can be extended to
    accommodate plain text tags, they will always take the precedence as they
    clearly are the innermost specification.

    The real problem is with *overlapping* and *unpaired* tags. XML parsers have
    built in validation of the tree structure of a document, which ensures that
    all tags are properly opened, closed and embedded into each other. E.g.,
    overlapping spans like:

            <x lang="en"> ABC <y lang="fr"> DEF </x> GHI </y>

    would not pass validation because the English and French span overlap
    irregularly (as do tags <x> and <y>).

    But that built-in validation cannot properly detect a situations like:

            <x lang="en"> ABC \uE0001 \uE0066 \uE0072 DEF </x> GHI \uE007F

    where the English span (specified in tag <x>) overlap with the French span
    (specified with plain text tags).

    Just suggesting to ignore plain text tags is no solution, because this would
    waste part of the information (and the author's effort provide this
    information).

    > 6. Plane 14 tags are easy to filter out, and harmless if not
    > interpreted.

    If they are not processed correctly or filtered out, they are by no means
    harmless.

    If they are rendered as visible glyphs (such as [LNG][f][r]) or with
    "missing glyph" boxes, they clutter the text, making it less readable --
    i.e., they pejorate the main problem that they were supposed to solve.

    If they are rendered as invisible glyphs, they make the text more difficult
    to edit and to move the cursor within, because the user will have no way of
    understanding why the cursor stops twice in apparently random positions.
    This also exposes the information contained in language tags to be
    unwillingly corrupted by subsequent editing.

    _ Marco



    This archive was generated by hypermail 2.1.5 : Tue Nov 05 2002 - 07:03:06 EST