Mark> Subject: RE>>Embedded language ID pr Time: 4:13 PM Date:
Mark> I guess a clearer question would be: what do you want
Mark> to use language ids for, and why is it that you don't use
Mark> rich text in that context?
We are currently using our own home grown "rich text" as a common
representation. We have to. We get texts with a wide variety of
markup. Making every tool we use sensitive to all the different
markup schemes is a prohibitive task.
The question we ended up asking was, "For text processing tasks that
are language sensitive, can we identify a consistant correspondence
between the language identifiers in the different markup schemes?"
Our conclusion, probably not.
So, over time, our baseline representation ended up being plain text
with language identifiers.
Currently, our use of language identifiers is somewhat limited. They
are used for things like determining codeset and font (no widespread
deployment of Unicode yet), when to switch segmentation algorithms,
and flags to invoke other, language-specific tools
(e.g. spell-checking, sorting, morphological analysis).
It may seem that we are being selfish about our particular needs
(happens often enough in proposals to standardization committees), but
we feel that interest in a standard language identification scheme
crops up often enough to warrant at least another discussion (as we
are doing now).
Another concern is that the adoption of a language id approach in a
codeset standard might act as a bad precedent. It could open doors
for other features that don't really belong in a codeset standard.
So the question boils down to whether language id should be markup, at
the codeset level or at some point in between.
The crux of my argument (however unclearly stated), is that language
identifiers are now a general need and that markup schemes are too
different to allow consistent language identification.
I think that most of us would agree that markup schemes are very
important to the future of on-line resources. So no matter where
standard language id's are adopted, they should allow consistent
incorporation into any given markup scheme, present and future.
Mark Leisher "The trick is not gaining the knowledge,
Computing Research Lab but surviving the lessons."
New Mexico State University -- "Svaha," Charles de Lint
Box 30001, Dept. 3CRL
Las Cruces, NM 88003
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:30 EDT