Re: Embedded language ID proposal

From: Mark Leisher (mleisher@crl.nmsu.edu)
Date: Thu Sep 07 1995 - 13:42:33 EDT

Next message: Joan Aliprand: "Unicode & libraries: report from IFLA Conference"
Previous message: Hannu Aronsson: "Re: Embedded language ID proposal"
In reply to: Hannu Aronsson: "Re: Embedded language ID proposal"
Next in thread: chris@logos.com: "RE: Embedded language ID proposal"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hannu, you have brought up a number of good points.

    Hannu> I don't think any language numbering scheme will work well,
    Hannu> because it makes it too hard / impossible to add new
    Hannu> languages or variants of languages quickly.

This is dependent on the numbering scheme. For example, if languages
are simply numbered based on the order they appear on the list, new
languages would simply be given the next available number.

Approaches based on "frequently occuring" languages would make it
somewhat more difficult to assign numbers without conflicts.

    Hannu> No list of languages will contain everything the users will
    Hannu> need, so the only working solution is string-based,
    Hannu> i.e. something like the locale names (e.g. en_us, en_uk,
    Hannu> ...) used in many places: - The basic language (first 2 or
    Hannu> 3 letters, 3 need to be allowed too) are standardized, so
    Hannu> you can always get to the major language - They're easy to
    Hannu> extend, as the remaining part can be anything, including
    Hannu> user-defined stuff. There are many standards about
    Hannu> language identification in this way (e.g. HTML 3) where it
    Hannu> has been found a very good and working way to mark
    Hannu> language.

As was mentioned in an earlier message on this list, basing standards
on what are essentially politically generated country names causes
changes too frequent to provide a stable standard.

I happen to agree with that assessment.

    Hannu> The only practical way to add custom languages to a
    Hannu> enumerated system is to reserve a range of codes for custom
    Hannu> languages, but a unknown number does not say anything.

    Hannu> A string approach, for example, en_fi still tells you it's
    Hannu> english with probably a Finnish specialty (e.g. currencies
    Hannu> or numbers might be in Finnish format), but you still know
    Hannu> it's basically english. When you get an unknown number,
    Hannu> there is nothing you can do.

Again, depending on the numbering system chosen, even an unknown
number may contain enough information to determine what language to
fall back on.

For example, if you interpret the 16-bit value in our proposal as
Win32 language id's, then given any language id, you at least know
which language family the language belongs to.

    Hannu> Every application/system will need a mapping table from the
    Hannu> language numbers to whatever they use internally, which is
    Hannu> probably something locale-style. Having one more mapping
    Hannu> table to keep up-to-date is a added burden.

This may be true. Any language id proposal that gets accepted by a
standardization organization will cause someone problems.

But if the approach is designed well, then this kind of maintenance
may be minimal or even uneccessary. The primary cost would be the
initial conversion to the adopted approach.

    Hannu> Also, your proposal would introduce a strange set of
    Hannu> combining characters to Unicode, whose parsing is different
    Hannu> from everything else and thus would complicate the
    Hannu> standard.

We are currently interpreting these codepoints as control code types
with "other neutral" bidirectional behavior. We are still trying to
decide if this a good idea or not.

The only change needed in our code was to check for the existence of
these codepoints in a similar fashion to the pseudo code in the
original mailing.

We hadn't considered viewing them as combining characters, but that
seems to make a certain amount of sense as they must effectively be
"combined." More to think about!

    Hannu> Anyway, we have learned from the working and simple
    Hannu> internet RFC standard conventions (and many other places),
    Hannu> that it's much better to use strings to describe things
    Hannu> instead of magic numbers, especially size-limited numbers.
    Hannu> Strings are about easy to handle, anyway, and offer
    Hannu> infinite extensibility and can be human readable too.

I happen to like strings myself and have a great deal of respect for
the reasons strings are chosen in RFC's, but when processing extremely
large corpora (multi-gigabyte, terabyte), looking up language support
from a string would seem to add a rather noticeable amount of overhead
compared to a numeric approach.

    Hannu> Even if you wanted to do a language enumeration system, it
    Hannu> would be better to do it using the extension to about 1
    Hannu> million codepoints and reserve from there a range of
    Hannu> codepoints for language IDs.

This is the ideal situation. The obvious question is: where will we
get those million codepoints? If we extend our approach to 32
codepoints, then we can construct identifiers that will allow 2^32 - 1
possible language ids (somewhat excessive, but over a million
possibilities :-).

    Hannu> I agree that we might need some mechanism to indicate
    Hannu> language in Unicode *plain text* files, as everywhere else
    Hannu> you already have some form of tagging on top of plain
    Hannu> Unicode, e.g. SGML-based, so you can use that for language
    Hannu> identification too.

Higher-level language identification works quite well. But when you
get documents with different higher-level markup from different
systems, it makes processing that text annoyingly complicated.

    Hannu> But better than some binary enumeration, would be e.g. to
    Hannu> define 2 additional characters, LANG_ID_START and
    Hannu> LANG_ID_END, and define that language would be indicated
    Hannu> with LANG_ID_START <2 or 3 character standard language
    Hannu> code, ASCII values only _ or . (optional) <1*N character
    Hannu> detail code> (optional) LANG_ID_END This is not really good
    Hannu> either, but it would be a little better than the proposed
    Hannu> numbering approach, I think. It's about as easy to parse,
    Hannu> and doesn't suffer from the limitations of numeric
    Hannu> enumeration.

This is a reasonable approach. It may even be nearly as efficient
when scanning text, but there is still the cost of determining the
language from some combination of the values between LANG_ID_START and
LANG_ID_END, particularly if those values represent a string.

To reiterate, using a string to determine the language is noticeably
slower compared to a numeric approach. In addition, the process is
often implemented by converting the string to a number before the
lookup happens anyway.

Consider the case of looking for one of these sequences while scanning
backward through the text. You find LANG_ID_END, you go backward till
you find LANG_ID_START, then you have to look at everything between
LANG_ID_START and LANG_ID_END *again* to determine the language. Add
that to a (possibly) string representation of a language id, and you
have even more overhead looking up the language support.

The obvious argument against this is that in general, the number of
times a given program needs to scan backward is comparatively small to
the number of times it needs to scan forward.

    Hannu> We should not trade a little of programmer convenience for
    Hannu> major long-term limitations in any standard. The
    Hannu> programming work will (should be) done once in the
    Hannu> framework or OS libraries, anyway, so it won't trouble most
    Hannu> programmers, anyway.

I completely agree. Limitations introduced now will come back to hurt
us later. Ideally, we need an approach that is flexible enough to
meet sophisticated research or scholarly needs and is convenient for
implementers.

    Hannu> Most applications will have their own higher-level tagging
    Hannu> (SGML-style markup will probably dominate), so having a
    Hannu> different tagging mechanism for indicating language would
    Hannu> require handling two different kinds of markup.

    Hannu> The really good and desirable approach, I think, would be
    Hannu> to raise the abstraction level of "plain unicode text
    Hannu> file", to include e.g. SGML/HTML style tagging as a
    Hannu> standardized part of "plain text unicode files". Then we
    Hannu> could really easily indicate language, and whatever else
    Hannu> you might want, in a standard and easily parsable (and
    Hannu> easily ignorable or removable markup) format.

In my opinion, this would be the same as changing the SGML standard to
make the default character set some form of Unicode or 10646. It may
be a good idea, but it would take a *long* time to determine the
impact, and a *longer* time for the changes to propagate.

Hannu> ASCII-style plain text files deserve to die.

"Plain text" certainly makes my life difficult at times, so I guess I
would have to agree :-)
-----------------------------------------------------------------------------
mleisher@crl.nmsu.edu
Mark Leisher "The trick is not gaining the knowledge,
Computing Research Lab but surviving the lessons."
New Mexico State University -- "Svaha," Charles de Lint
Box 30001, Dept. 3CRL
Las Cruces, NM 88003

Next message: Joan Aliprand: "Unicode & libraries: report from IFLA Conference"
Previous message: Hannu Aronsson: "Re: Embedded language ID proposal"
In reply to: Hannu Aronsson: "Re: Embedded language ID proposal"
Next in thread: chris@logos.com: "RE: Embedded language ID proposal"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:30 EDT