Re: UTF-8, C1 controls, and UNIX

From: Frank da Cruz (
Date: Wed Feb 28 2001 - 17:56:38 EST

> Maybe one should make a transmission safe UTF that left C1 alone?
Remember this? --

From: Markus Scherer <>
To: "Unicode List" <>
Date: Mon, 10 Apr 2000 15:23:53 -0800 (GMT-0800)
Subject: What if UTF-8 had been defined after UTF-16?

What if UTF-8 had been defined just for the code point range 0..0x10ffff?
What if UTF-8 had been designed to be not just "File-System-Safe" but also

UTF-8 could have had all the nice features that it has now, plus:
- C1 control codes (0x80..0x9f) passed through as single bytes
- no sequences longer than 4 bytes, BMP still covered with 3 bytes
- no checking for code points > 0x10ffff because
  it could have been designed just for that range
- no minimum-length problem -> no security concerns
- all byte values used for some encoding

It would have been possible. Interested? See .

Note: This is _not_ an approved UTF. I am _not_ proposing this as a new
UTF. This is _not_ compatible with any existing UTF or other Unicode
implementation. It is just a play with bits and bytes, a "what if", a

Just to share a thought -


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT