Re: Unicode plain text (Was: Line Separator Character)

From: Edward Cherlin (cherlin@cauce.org)
Date: Thu May 22 1997 - 02:56:11 EDT


kenw@sybase.com (Kenneth Whistler), commenting on my previous message, did
an admirable job of summarizing the state of the problem of Unicode plain
text in terms of what the Unicode standard does and does not cover, and the
fact that a standard for use of such files must address many more issues. I
(Ed) agree with his summary entirely. My added comments here address the
issues of function of editors and renderers.

>I (Ken) commented:
>
>>But other than that, there is not much more to be said about a Unicode
>>plain text file. The usefulness of the concept lies in its simplicity.
>
>And Ed Cherlin responded:
>
>>
>> I disagree about the simplicity of the problem.
>
>And now I think I understand where we were miscommunicating. I was
>speaking of a Unicode plain text *file*, which I thought was the
>issue. And for that the issue is simple. A Unicode plain text *file*
>is Unicode plain text in a file (preferably marked with U+FEFF
>and in MSB byte order).
>
>But what Ed is addressing here is the standardization of the meaning
>of Unicode *plain text*--an issue which should be considered outside
>instantiation of that plain text in transmissible computer files.
>On that point I agree that there are a vast number of issues which
>require specification and standardization. And I do believe that the
>Unicode Standard is the correct place to address many of them. I've
>made the point before that one of the big differences between ISO/IEC
>10646 and the Unicode Standard is that 10646 standardizes the encodings
>and names of the characters, but that the Unicode Standard goes way
>beyond that and attempts to provide enough information (some
>normative and some informative) to enable meaningful and transmissible
>implementations of Unicode plain text.
>
>Below is Ed's list of leading issues. I've interspersed my comments
>indicating what I think the current Unicode Standard's take is on
>many of them. (Others may disagree, or may feel that things which
>are not covered should be.)
>
>> Some of the leading issues are:

>> byte order in storage and transmission
>
>Byte order is addressed by the Unicode Standard.

No problem there. We might want to go further and *require* a byte order mark.

>> line, paragraph, and page breaks
>
>The Unicode Standard specifies LINE SEPARATOR and PARAGRAPH SEPARATOR,
>but considers page break to be out of scope.

That would have to be addressed, because it will be used.

>> BIDI (Hebrew, Arabic, etc.)
>
>The normative bidi algorithm is specified in great detail in
>the Unicode Standard.

So Unicode text editors should be required to implement it correctly, if
they handle BIDI at all.

>> non-linear scripts (Indic, Korean, Mongolian, Ethiopian, etc.)
>
>The Unicode Standard considers specification of script behavior to
>be part of the desired content of the standard. It doesn't do an
>equally detailed accounting of all cases, mostly due to resource
>and information constraints. But Devanagari and Tamil script
>handling are provided in significant detail as a guide to Indian
>script behavior, and there is an extensive discussion of Arabic
>script shaping behavior. There is a specification
>of normative behavior for Hangul combining jamo. If we could get
>equally detailed expert contributions for each complex script,
>I expect the inclination of the UTC and the editors would be to
>include them in the standard, for everybody's benefit.

That would be a very great improvement.

>> multiply accented characters (IPA, math, several human languages)
>
>This is considered an integral part of the Unicode Standard, and
>is detailed with both normative and informative sections.

So should it be required in all editors? I think so.

>> math
>
>There is a definite gap here, though the topic has been a continuing
>one for the UTC. The consensus seems to be that we would like to
>get a consistent model of plain text math formula construction
>stated, to make such information exchangeable in Unicode plain text.

There has been some good work on this reported at IUC conferences. An
option in an editor, for now anyway.

>> compatibility characters
>
>These are now completely specified in the Unicode Standard names list.

It should be possible to use them, but the user should have to choose to
activate them.

>> private use characters
>
>Also specified by the standard, although the interpretation of
>particular usages of private use characters is, by definition, out
>of scope for the standard. But there has been some effort by people
>to make available specifications of their particular private or
>corporate private usage repertoires of private use characters.

I don't know of any particular behavior that could be required of software,
other than the option of marking them all as unrecognized.

>> control codes
>
>If you mean by this, U+0000 .. U+001F, U+0080..U+009F and the
>control chimera U+007F, then the Unicode Standard does provide
>a answer. It doesn't try to reinvent control function standards,
>but it says those characters should be interpreted as if they
>were 16-bit analogues of the 8-bit encodings of the corresponding
>control functions. Maybe unsatisfying, but probably the best we
>can expect, given existing control code usage.

More precision is required, I think, at least for CR, LF, HT, and FF.

>> other deprecated characters
>
>There may be room for improvement here, but the Unicode Standard
>has had to tread a little carefully here. There are political
>consequences in crying out too loudly that xyz are *deprecated*
>when xyz may be somebody else's favorite set they lobbied hard
>to get in!

We can't just forbid them, certainly.

>> surrogates, especially unpaired surrogate codes
>
>Surrogate usage (in general, as opposed to particular encodings
>for surrogate pairs, none of which exist yet) is fully specified
>by the Unicode Standard.

OK. Unpaired surrogate codes should be marked in some way in rendering
plain text.

>> non-character values
>
>As opposed to unassigned character values, there are only two
>non-character values in Unicode: 0xFFFE and 0xFFFF. The standard
>specifies that 0xFFFE is the illegal byte-swapped version of
>U+FEFF. The use of 0xFFFF is deliberately unspecified and is
>untransmissible by design.

Why do I think someone is going to decide to use it? :(

>> text processing algorithms (sorting, upper and lower case, pattern matching)
>
>Default case mapping is provided as an informative part of the
>Unicode Standard. Language-specific casing is effectively also
>a part of the standard, since everybody knows the few instances
>in question: Turkish i, the debatable French accents, German , etc.,
>and they are discussed in the standard.
>
>Beyond that, sorting, pattern matching, etc. are out of scope of
>the Unicode Standard (though some implementation guidelines are
>provided), and, in my opinion, appropriately belong to other standards
>under development.

The question is to some degree whether there is or will be a standard
library of string functions, as there has been in C and C++. Of course I
recognize that there were many such libraries, and perhaps that is
unavoidable.

>> Full portability of data requires some rules. If there is no standard,
>> users of "Unicode text files" will make every possible choice about each of
>> these issues. CRLF will be nothing in comparison. We have begun to see
>> programs that can handle CRLF, CR alone, and LF alone, either line-by-line
>> or in paragraph format, reading and writing in any option. The range of
>> choices for Unicode is far greater, and I don't want to think about how
>> long it would take to achieve unity if we don't do it now.
>
>Yes, but... The goal is interchangeable plain text that is legible
>when interpreted and rendered in accord with the standard. The goal
>is not to force everyone to "spell" multilingual text exactly the
>same way. The drafters of the Unicode Standard tried to place normative
>requirements on plain text where failure to do so would lead to
>complete chaos. Obvious examples are specification that combining
>marks must follow (not precede) their base character, and specification
>of the complete bidi algorithm. Failure to specify either of these
>would clearly have led to uninterpretable gibberish if everyone
>made up their own rules, and that was clearly understood by the
>members of the Unicode Technical Committee.

I think the best way to discuss this is over some sample texts.

I don't know how much time I can put into this, but if I can I will go
through the standard and see if I can pick out anything else that might be
a problem.

>But one draws the line somewhere. No one wants to legislate against
>people, for example, making cross-linguistic puns in text by
>spelling out Russian words with Latin letters, or any other
>"inappropriate" or creative usage of the characters at
>their disposal, once Unicode implementations become more widely
>available. Half the joy of having universal multilingual text
>implemented on computers will be seeing what creative and fantastic
>new inventions millions of users put it to.
>
>--Ken Whistler

Think of the smilies we can make. %-]

--
Edward Cherlin       Help outlaw Spam     Everything should be made
Vice President     http://www.cauce.org      as simple as possible,
NewbieNet, Inc.  1000 members and counting      __but no simpler__.
http://www.newbie.net/    17 May 97   Attributed to Albert Einstein



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT