On Thu, 5 Jun 1997, Chris Newman wrote:
> On Thu, 5 Jun 1997, Martin J. Duerst wrote:
> > On Thu, 5 Jun 1997, Chris Newman wrote:
> > > So Multi-Lingual text can be formed by combining a multi-lingual coded
> > > character set with language tags.
> > Good morning! Guten Morgen! Bonjour! Bongiorno! Ohayou gozaimasu!
> I do not equate "human readable text" simply with a visual representation.
> There are numerous processes, such as collation, spell check, text to
> speech, machine translation, etc which require knowledge of the language.
Humans read the visual representation. "human readable text" IS the
visual representation. Spell checking, text-to-speach, and translation
are not reading, and can be done by humans without language tags.
What you are speaking about is a form of "computer processable text",
which is not the same as "human readable text".
And even for computer processing, I am not sure that language tagging
is the one-and-only solution. Let's have a look at spell-checking.
Spell-checking developed in single-language contexts. We have very
good single-language spell checkers now. When people make multilingual
systems, they combine various single-language spell checkers.
Office 97 is a very good examlpe. I recently transfered a presentation
to a labtop of a colleague who had Office 97. When I opened the
presentation, all text was underlined with red waves. I found
out that the default language of my friend was French, but my
presentation was in English. Setting the language to English
for the whole document removed most of the red waves.
It's not difficult to think one step ahead. Having found that many
"wrongly spelled" French words, the checker could easily have
concluded that French was a bad working hypothesis, and could
have tried with other languages, maybe first applying a general
language detection heuristic. It could show a short dialog saying
"Looks like the document is English; ok to use this for spell
checking and other operations?". When going through the document,
on checking, the dialog that lists "correct", "learn",... could
also show "accept as French" if that's a viable alternative.
So I predict that we will move from a simple model completely
separating the languages to a more integrated model (which can
be implemented on top of already existing technology). I would
be surprised if ideas in this direction wouldn't already be
pondered by word processing software makers.
For text-to-speach and for MT, the same applies, with somewhat
different timing (because they are more demanding, and not yet
as developed) but also better prospects (because they work
with more knowledge about the language they are processing,
and therefore make it easier to detect language as a side
The idea that language tagging is necessary for MT is rather
ridiculous. It would take you about 10 minutes, or even less,
to learn to identify say Hungarian text and typical Hungarian
words. It will take you years to learn the language and become
a good translator. Computers aren't humans, but the relations
are pretty much the same.
> Unicode is sufficient (in most communities) to encode a visual
> representation of multi-lingual text.
Can you tell me what you mean by "in most communities"?
I might understand that together with "high typographic quality",
but not together with "*a* visual representation".
> But it can not be used to represent
> the intrinsic nature of multi-lingual text.
Would you then say that once printed out, or when written
by hand, texts that contain multiple languages cease to be
> > The text itself represents the language. If it doesn't,
> > invisible language tags won't help.
> Yes, but a visual representation of the text probably doesn't include
> language information. What language is "I"? Is it a typo or is it
> spelled correctly? Since all you'll receive is a visual encoding of the
> character, you can't know. If this were tagged with "en", then you would
> know it is a correctly spelled word. Some other tags would indicate
What language is "I"? That's a good question. What language are the
"I"s apparing as pixel patterns on your screen, or as ink on your
paper? What language is the lead character used by a typesetter?
What language is the "I" you write on paper? What language is the
"I" when it's represented as the bit pattern 0x49 in a memory cell
of your computer?
The answer is that the "I" is a character, which in and by itself
doesn't have any language, and doesn't need to have any.
Whether it's at its appropriate place ("spelled correctly") is not
a property of the "I" itself, but of the context it appears in.
And if I receive it, via email, ACAP, or whatever, I will either
be able to understand it in the context it appears in, or I won't.
If it appears with enough context (which can be inside or outside
the mail I receive), language tagging will be irrelevant. If it
doesn't have enough context, I won't see the language tags either,
and they would be one of the most inconvenient ways to provide
> > > So what is plain text? I would define plain text as text where no
> > > character has multiple meanings or interpretations.
> > Would be nice to hear from you what you mean by this.
> If a character can be used to represent part of the text, it is never used
> to encode something else.
I understand what you mean. But that would mean that text appearing in
MIME body parts would not qualify, would it? The boundary marker is not
allowed to appear inside the text itself.
And saying that one edits HTML with a plain text editor would also have
to be forbidden.
> > > I disagree. Unicode plain text is an encoding of a visual representation
> > > of multilingual plain text. It is not multilingual plain text since it
> > > does not carry complete language information.
> > It carries all the necessary information. The computers may be
> > too dumb to figure it out, but that doesn't affect the information
> > as such.
> Human text is not simply a visual representation.
No. It's mainly semantics, meaning. But in the context of computers,
that's quickly forgotten.
> A visual representation
> of a short word or phrase is often not sufficient to identify the language
> by a computer or a human.
Yes for short words and in particular for letters. Not really that much
for short phrases. In some cases not necessary, when it's a common word.
And where it's necessary, language tagging might help the computer, but
because it doesn't help the human reader, it's a bad idea to have the
computer rely on it too much. And it's a bad idea to define language
tags as a necessary property of multiligual text.
> Unicode choose not to have separate copies of
> the same character for use with different languages. This makes a lot of
> sense (given dialects and subdialects), but it does remove semantic
> information about the text.
No, it doesn't. The information is all there. It's not encoded in a
way that may be easily accessible to a computer all the times, but
that in no way means it isn't there.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT