Re: Unicode plain text (Was: Line Separator Character)

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue May 20 1997 - 20:01:18 EDT


I (Ken) commented:

>But other than that, there is not much more to be said about a Unicode
>plain text file. The usefulness of the concept lies in its simplicity.

And Ed Cherlin responded:

>
> I disagree about the simplicity of the problem.

And now I think I understand where we were miscommunicating. I was
speaking of a Unicode plain text *file*, which I thought was the
issue. And for that the issue is simple. A Unicode plain text *file*
is Unicode plain text in a file (preferably marked with U+FEFF
and in MSB byte order).

But what Ed is addressing here is the standardization of the meaning
of Unicode *plain text*--an issue which should be considered outside
instantiation of that plain text in transmissible computer files.
On that point I agree that there are a vast number of issues which
require specification and standardization. And I do believe that the
Unicode Standard is the correct place to address many of them. I've
made the point before that one of the big differences between ISO/IEC
10646 and the Unicode Standard is that 10646 standardizes the encodings
and names of the characters, but that the Unicode Standard goes way
beyond that and attempts to provide enough information (some
normative and some informative) to enable meaningful and transmissible
implementations of Unicode plain text.

Below is Ed's list of leading issues. I've interspersed my comments
indicating what I think the current Unicode Standard's take is on
many of them. (Others may disagree, or may feel that things which
are not covered should be.)

> Some of the leading issues are:
>
> byte order in storage and transmission

Byte order is addressed by the Unicode Standard.

> line, paragraph, and page breaks

The Unicode Standard specifies LINE SEPARATOR and PARAGRAPH SEPARATOR,
but considers page break to be out of scope.

> BIDI (Hebrew, Arabic, etc.)

The normative bidi algorithm is specified in great detail in
the Unicode Standard.

> non-linear scripts (Indic, Korean, Mongolian, Ethiopian, etc.)

The Unicode Standard considers specification of script behavior to
be part of the desired content of the standard. It doesn't do an
equally detailed accounting of all cases, mostly due to resource
and information constraints. But Devanagari and Tamil script
handling are provided in significant detail as a guide to Indian
script behavior, and there is an extensive discussion of Arabic
script shaping behavior. There is a specification
of normative behavior for Hangul combining jamo. If we could get
equally detailed expert contributions for each complex script,
I expect the inclination of the UTC and the editors would be to
include them in the standard, for everybody's benefit.

> multiply accented characters (IPA, math, several human languages)

This is considered an integral part of the Unicode Standard, and
is detailed with both normative and informative sections.

> math

There is a definite gap here, though the topic has been a continuing
one for the UTC. The consensus seems to be that we would like to
get a consistent model of plain text math formula construction
stated, to make such information exchangeable in Unicode plain text.

> compatibility characters

These are now completely specified in the Unicode Standard names list.

> private use characters

Also specified by the standard, although the interpretation of
particular usages of private use characters is, by definition, out
of scope for the standard. But there has been some effort by people
to make available specifications of their particular private or
corporate private usage repertoires of private use characters.

> control codes

If you mean by this, U+0000 .. U+001F, U+0080..U+009F and the
control chimera U+007F, then the Unicode Standard does provide
a answer. It doesn't try to reinvent control function standards,
but it says those characters should be interpreted as if they
were 16-bit analogues of the 8-bit encodings of the corresponding
control functions. Maybe unsatisfying, but probably the best we
can expect, given existing control code usage.

> other deprecated characters

There may be room for improvement here, but the Unicode Standard
has had to tread a little carefully here. There are political
consequences in crying out too loudly that xyz are *deprecated*
when xyz may be somebody else's favorite set they lobbied hard
to get in!

> surrogates, especially unpaired surrogate codes

Surrogate usage (in general, as opposed to particular encodings
for surrogate pairs, none of which exist yet) is fully specified
by the Unicode Standard.

> non-character values

As opposed to unassigned character values, there are only two
non-character values in Unicode: 0xFFFE and 0xFFFF. The standard
specifies that 0xFFFE is the illegal byte-swapped version of
U+FEFF. The use of 0xFFFF is deliberately unspecified and is
untransmissible by design.

> text processing algorithms (sorting, upper and lower case, pattern matching)

Default case mapping is provided as an informative part of the
Unicode Standard. Language-specific casing is effectively also
a part of the standard, since everybody knows the few instances
in question: Turkish i, the debatable French accents, German  ß, etc.,
and they are discussed in the standard.

Beyond that, sorting, pattern matching, etc. are out of scope of
the Unicode Standard (though some implementation guidelines are
provided), and, in my opinion, appropriately belong to other standards
under development.

>
> Full portability of data requires some rules. If there is no standard,
> users of "Unicode text files" will make every possible choice about each of
> these issues. CRLF will be nothing in comparison. We have begun to see
> programs that can handle CRLF, CR alone, and LF alone, either line-by-line
> or in paragraph format, reading and writing in any option. The range of
> choices for Unicode is far greater, and I don't want to think about how
> long it would take to achieve unity if we don't do it now.

Yes, but... The goal is interchangeable plain text that is legible
when interpreted and rendered in accord with the standard. The goal
is not to force everyone to "spell" multilingual text exactly the
same way. The drafters of the Unicode Standard tried to place normative
requirements on plain text where failure to do so would lead to
complete chaos. Obvious examples are specification that combining
marks must follow (not precede) their base character, and specification
of the complete bidi algorithm. Failure to specify either of these
would clearly have led to uninterpretable gibberish if everyone
made up their own rules, and that was clearly understood by the
members of the Unicode Technical Committee.

But one draws the line somewhere. No one wants to legislate against
people, for example, making cross-linguistic puns in text by
spelling out Russian words with Latin letters, or any other
"inappropriate" or creative usage of the characters at
their disposal, once Unicode implementations become more widely
available. Half the joy of having universal multilingual text
implemented on computers will be seeing what creative and fantastic
new inventions millions of users put it to.

--Ken Whistler



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT