Re: UTF-8, C1 controls, and UNIX

From: Frank da Cruz (fdc@columbia.edu)
Date: Wed Feb 28 2001 - 17:56:38 EST


> Maybe one should make a transmission safe UTF that left C1 alone?
>
Remember this? --

From: Markus Scherer <markus.scherer@jtcsv.com>
To: "Unicode List" <unicode@unicode.org>
Date: Mon, 10 Apr 2000 15:23:53 -0800 (GMT-0800)
Subject: What if UTF-8 had been defined after UTF-16?

What if UTF-8 had been defined just for the code point range 0..0x10ffff?
What if UTF-8 had been designed to be not just "File-System-Safe" but also
"Terminal-Safe"?

UTF-8 could have had all the nice features that it has now, plus:
- C1 control codes (0x80..0x9f) passed through as single bytes
- no sequences longer than 4 bytes, BMP still covered with 3 bytes
- no checking for code points > 0x10ffff because
  it could have been designed just for that range
- no minimum-length problem -> no security concerns
- all byte values used for some encoding

It would have been possible. Interested? See
http://www.mindspring.com/~markus.scherer/utf-8c1.html .

Note: This is _not_ an approved UTF. I am _not_ proposing this as a new
UTF. This is _not_ compatible with any existing UTF or other Unicode
implementation. It is just a play with bits and bytes, a "what if", a
"Gedankenexperiment".

Just to share a thought -

markus



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT