Re: Embedded language ID pr

From: Mark Leisher (mleisher@crl.nmsu.edu)
Date: Sat Sep 09 1995 - 15:31:15 EDT


    Mark> Subject: RE>>Embedded language ID proposal Time: 5:42 PM
    Mark> Date: 9/8/95

    Mark> I am still unconvinced of the need to have language
    Mark> information in plain text; there are legitimate needs for
    Mark> that information, but there are needs for other particular
    Mark> attributes that go along with rich text, and it is hard to
    Mark> see why this one should be singled out.

For the most part I personally agree that language identifiers would
seem most logically markup.

But from a multilingual natural language processing perspective (and
perhaps others), having a single codeset with embedded language
identifier capability would provide an attractive reference text
representation.

Had the proposal not provided any utility for areas other than ours, I
doubt we would have bothered to present it other than as an
idiosyncrasy of our particular Unicode support implementation.

    Mark> In terms of commenting on these particular suggested private
    Mark> use implementations, the string scheme (LANG_ID_START text
    Mark> LANG_ID_END) has the very considerable drawback of
    Mark> introducing fr_FRgarbageen_US into data streams that don't
    Mark> recognize LANG_ID_START, LANG_ID_END. Using independent
    Mark> private use characters exclusively at least allows other
    Mark> implementations to filter them out without knowing
    Mark> bracketing semantics.

Telling point. I hadn't thought of that.

    Mark> As far as terminology goes, these are not combining
    Mark> characters: they are not positioned relative to a preceding
    Mark> base character; they are not positioned at all! They are
    Mark> more akin to the formatting characters such as RLM or ZWJ.

Our initial conclusion as well.
-----------------------------------------------------------------------------
mleisher@crl.nmsu.edu
Mark Leisher "The trick is not gaining the knowledge,
Computing Research Lab but surviving the lessons."
New Mexico State University -- "Svaha," Charles de Lint
Box 30001, Dept. 3CRL
Las Cruces, NM 88003



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:32 EDT