RE: UTF-8 signature in web and email

From: Bill Kurmey (Bill.Kurmey@v-wave.com)
Date: Fri May 25 2001 - 03:13:36 EDT


Are there not 2 versions of UTF-8, the Unicode Standard (maximum of 4
octets) and the ISO/IEC Annex/Amendment to 10646 (maximum of 6 octets)?

Is Unicode UTF-8 diverging from ISO by the way in which a scalar value is
encoded in UTF-8? Should folks be concerned that the IETF RFC-2279 and
RFC-2781 refer to UTF-8 and UTF-16 as "a transformation format of ISO
10646" with UTF-8 on the Standards Track? Will the discrepancy between the
Unicode and ISO versions be synchronized?

I realize that in practical terms the discrepancy may be only common sense
at this time until substantially more scalar values are assigned, but would
it not become a concern if ISO decides to retain its original method of
assigning 1-6 octets as specified in RFC-2279?

Finally, which method of encoding a scalar value in UTF-8 are Internet
software developers using, the ISO method as specified in the RFCs possibly
with Unicode as an optional variant subset?

On some of the historical comments:

At 07:30 AM 5/23/01 -0400, John Cowan wrote:
The ambiguity of 0x0A as "line feed" versus "new line" was present from the
beginning: at least some Teletypes had a mode to treat 0x0A as "new line".

At 02:51 PM 5/23/01 +0200, Marco Cimarosti wrote:
The fathers of Unix used a teletype with a "non standard" mode turned on,
and assumed that any other device worked the same as *their* teletype with
*their* settings.

They haven't even considered following any standard: simply tried a sequence
on their machine and liked what happened. And they didn't realize that it
worked just because that machine had an hack to send one byte less per line!

One byte less per line on a minicomputer with a Teletype as the only
input/output device is about half an hour of printing time for one
assembly/compiler listing for 13000 lines of code, the size of Unix source
code in the PDP-11 version released to academic institutions. I don't
think Ken Thompson had a choice. As I recall, the first Unix system was
developed on an unused DEC PDP-7 minicomputer system at Bell Laboratories
by Ken Thompson in 1970. The original version of Unix was ported to the
PDP-11 where it was rewritten in C which was also developed on the PDP-11.
C was developed to replace the programming language B which in turn
replaced the programming language BCPL.

Many of the control codes had physical implementation in hardware (later
firmware). I think the Friden Flexowriter, for example, only used one code
as a "line break" character punched into paper tape for which it performed
2 functions when used for output while reading paper tape; physical
carriage return and paper advance. This avoided patent infringement on the
AT&T "Teletype" devices.

I don't think the 0x0A was a design choice for Unix, it was simply the way
DEC distinguished its hardware and software from other manufacturers for
reasons similar to Friden, avoiding patent infringement and/or Trademark
litigation. However, C was ported from the PDP-11 to the IBM System/370,
Honeywell 6000, and Interdata 8/32, while Unix was only ported to the
Interdata 8/32, and later development occurred primarily on DEC systems,
System V and Berkeley Unix. Then the proliferation into ULTRIX, HPUX, AIX,
UTX, XENIX, all "standardized" variants of Unix by different manufacturers.

While on an historical note, Remington Rand had a typewriter-like device
called a Unityper which recorded characters directly onto a reel of
metallic magnetic tape which could then be read from a tape drive on the
Univac I. It also used only one code as a "line break" character and was
also hard-wired to produce a physical carriage return and paper advance
when a "line break" character was encountered while reading a tape.

The Friden Flexowriter and the Remington Rand Unityper were two of the very
limited devices for preparing text with upper and lower case ASCII
characters with Shift-In and Shift-Out codes hard-wired to activate the
keyboard shift keys and predated the IBM 029 key punch, IBM 870 Document
Writing System, and System/360. CJK folks might like to check out the IBM
archives on the Sinowriter, a modified IBM typewriter with stroke/radical
keys which could be "overstruck" generating a variable number of
"backspace+character" sequences on paper tape.

At 10:02 AM 5/23/01 +0900, Martin Duerst wrote:
"And it can still be done!"

Correction please, has been done, but using .t8f rather than .t8t. 8-)
Also am using .t8i DOS files for the 6 octet version as referenced in
RFC-2279.

For those folks under the impression that DOS users are browser deprived,
check out www.arachne.cz.

Bill Kurmey, Edmonton, AB, Canada



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT