I have some hope that Unicode plain text will prove useful, but I expect
people to mess with it. If we are going to standardize it, let us not add
anything that will cause these complications. This particularly includes,
in my opinion, language tags. Let plain text be plain. If you are marking
up language IDs, then admit you are doing markup, and either use a standard
markup language consistently, or use some rich text format.
I note that Ethnologue lists an amazing number of languages; more than 5 000
living languages if I recall correctly. I further note that what counts
officially as a living language changes: apparently Scots English was
recently registered by the E.C. as a "minority language" distinct from
British English. I note further that a language may change in relevant
ways throughout its history, so as to warrant being considered as a
different language: Anglo-Saxon, Middle English, and Modern English are
not to be mistaken, and Old French is obviously different from Modern
French even to someone with very poor French. Not only does this enlarge
the set of language tags considerably, different authors might wish to
draw the lines differently.
Markup information does not have to be in the character stream. You
_can_ store a document as a character stream and a parallel markup tree,
and in fact doing it that way makes it possible to have several
incompatible markup devices for the same base character sequence. There
have been word processors based on this idea.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:32 EDT