Re: Embedded language ID pr

From: Mark Davis (
Date: Sat Sep 09 1995 - 20:17:44 EDT

Subject: RE>>Embedded language ID pr Time: 4:13 PM Date: 9/9/95

I guess a clearer question would be: what do you want to use language ids for,
and why is it that you don't use rich text in that context?

Date: 9/9/95 12:37 PM
To: Mark Davis
From: Mark Leisher
    Mark> Subject: RE>>Embedded language ID proposal Time: 5:42 PM
    Mark> Date: 9/8/95

    Mark> I am still unconvinced of the need to have language
    Mark> information in plain text; there are legitimate needs for
    Mark> that information, but there are needs for other particular
    Mark> attributes that go along with rich text, and it is hard to
    Mark> see why this one should be singled out.

For the most part I personally agree that language identifiers would
seem most logically markup.

But from a multilingual natural language processing perspective (and
perhaps others), having a single codeset with embedded language
identifier capability would provide an attractive reference text

Had the proposal not provided any utility for areas other than ours, I
doubt we would have bothered to present it other than as an
idiosyncrasy of our particular Unicode support implementation.

    Mark> In terms of commenting on these particular suggested private
    Mark> use implementations, the string scheme (LANG_ID_START text
    Mark> LANG_ID_END) has the very considerable drawback of
    Mark> introducing fr_FRgarbageen_US into data streams that don't
    Mark> recognize LANG_ID_START, LANG_ID_END. Using independent
    Mark> private use characters exclusively at least allows other
    Mark> implementations to filter them out without knowing
    Mark> bracketing semantics.

Telling point. I hadn't thought of that.

    Mark> As far as terminology goes, these are not combining
    Mark> characters: they are not positioned relative to a preceding
    Mark> base character; they are not positioned at all! They are
    Mark> more akin to the formatting characters such as RLM or ZWJ.

Our initial conclusion as well.
Mark Leisher "The trick is not gaining the knowledge,
Computing Research Lab but surviving the lessons."
New Mexico State University -- "Svaha," Charles de Lint
Box 30001, Dept. 3CRL
Las Cruces, NM 88003

------------------ RFC822 Header Follows ------------------
Received: by with SMTP;9 Sep 1995 12:34:22 -0800
Received: from by (AIX 3.2/UCB 5.64/4.03)
          id AA36205; Sat, 9 Sep 1995 12:35:02 -0700
Received: from UNICODE.ORG by with SMTP (5.67/23-Oct-1991-eef)
        id AA26899; Sat, 9 Sep 95 12:31:42 -0700
Received: by Unicode.ORG (NX5.67c/NX3.0M)
        id AA25009; Sat, 9 Sep 95 12:23:10 -0700
Date: Sat, 9 Sep 95 12:23:10 -0700
From: unicode@Unicode.ORG
Message-Id: <9509091923.AA25009@Unicode.ORG>
Reply-To: (Mark Leisher)
Errors-To: uni-bounce@Unicode.ORG
Subject: Re: Embedded language ID pr
To: unicode@Unicode.ORG

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:32 EDT