Re: UTF-8, C1 controls, and UNIX

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Mar 01 2001 - 14:54:29 EST

Next message: Frank da Cruz: "Re: UTF-8, C1 controls, and UNIX"
Previous message: P. T. Rourke: "Re: UTF-8, C1 controls, and UNIX"
Maybe in reply to: Frank da Cruz: "UTF-8, C1 controls, and UNIX"
Next in thread: Frank da Cruz: "Re: UTF-8, C1 controls, and UNIX"
Reply: Frank da Cruz: "Re: UTF-8, C1 controls, and UNIX"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Frank,

> My point is that UTF-8 is not really up to the task it was designed for,
> i.e. transparent usability with hosts that are ignorant of it. In fact it
> was designed only for UNIX (Plan 9), which is why "/" is sacrosanct, and why
> it contains no NULs (because of C).

I don't understand this part of your rhetoric here. In UTF-8, *ASCII* is
sacrosanct, not just "/".

> The C1 problem was overlooked because
> nobody really considered it. And non-UNIX platforms use lots of characters
> besides "/" in pathname syntax, so even leaving aside the C1 issue, we'd
> need another UTF for VMS, another for VOS, another DOS and Windows, and so
> on.

Why? "\" and ":" are also "sacrosanct" in UTF-8. No pathname syntax that
I know of is disturbed by UTF-8. (Well, EBCDIC paths, I suppose, but then
ASCII itself would trash EBCDIC paths if you didn't convert.)

The early description of UTF-8 (FSS-UTF) focussed on "/" because its
predecessor, UTF-1, did not preserve "/". So UTF-8 was a fix for that.

And as for your overall point, I don't know of any claim that UTF-8
was designed for "transparent usability with hosts that are ignorant of
it." The documentation at the time claims the following criteria:

1. Compatibility with historical file systems. (met by ASCII preservation)

2. Compatibility with existing programs. (and by this is meant 8-bit
API usability as strings, as well as ASCII preservation)

3. Easy conversion from/to [16-bit] Unicode.

4. First byte indication of length of trailing byte sequence.

5. Non-extravagance in number of bytes needed for encoding.

6. Local resynching capability.

I think UTF-8 met all those criteria.

And yes, anybody who participated at the time was perfectly aware
that you couldn't just pump UTF-8 at a terminal or host that was
interpreting C1 control values and expect nothing odd to happen.

--Ken

>
> - Frank
>
>

Next message: Frank da Cruz: "Re: UTF-8, C1 controls, and UNIX"
Previous message: P. T. Rourke: "Re: UTF-8, C1 controls, and UNIX"
Maybe in reply to: Frank da Cruz: "UTF-8, C1 controls, and UNIX"
Next in thread: Frank da Cruz: "Re: UTF-8, C1 controls, and UNIX"
Reply: Frank da Cruz: "Re: UTF-8, C1 controls, and UNIX"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT