From: Lars Kristan (lars.kristan@hermes.si)
Date: Fri Dec 17 2004 - 10:07:29 CST
Philippe Verdy wrote:
> What Lars wants has a name: it's a
> "transfer-encoding-syntax", to allow
> transporting any code unit sequences into a more restricted
> environment.
> This is not a new thing, but this is not specified by Unicode.
Good. It is a known thing. Which also means we can use previous experience
with transfer-encoding-syntaxes. For example, what are the security
implications and how they can be dealt with.
> But note that any occurence of U+EE80 to U+EEFF in the
> original NON-UTF-8
> "text" are escaped, despite they are valid Unicode. However,
> choosing U+EE80
> to U+EEFF is not a problem because these PUAs are very unlikely to be
> present in valid source texts, in absence of a prior PUA-agreement.
And would be no problem at all if new codepoints would be assigned for this
purpose.
> Remember that this is only a Transform-Encoding-Syntax, not a
> new encoding.
> It does not require ANY new codepoint allocation by Unicode!
But does not mean there are no benefits in doing so. Escape characters are
always a pain, like your example of """. OK, the next step is to assign
a new codepoint for this purpose. SBCS had little room, the need was not
recognised early enough and even if it would, people would use the escape
character simply because they would like the way it would display. With
(less than) 255 glyphs to choose from, people were bound to use them all.
But Unicode has A LOT of codepoints, so it makes sense to do something like
that.
At some point, someone thought of mapping bytes in invalid sequences to
codepoints. Didn't know how to call them (or perhaps called them replacement
characters), but UTC thought such codepoints shouldn't be assigned. But, if
we call it "Transform-Encoding-Syntax" instead of "conversion", then they
should be called "escape characters" instead of "replacement characters".
And for the first time in history, you have an escaping method with more
than one escape character. Very efficient. Very compact. Very
straightforward. And Unicode is the one encoding that has both enough
codepoints to afford it and at the same time more need for it than any other
encoding.
One can compare it with MBCSs, and say the same thing could be done there
but wasn't. But actually there was less need for it. Many SBCSs have no
unassigned codepoints, and MBCSs were too busy with their own problems to
worry about cross-compatibility at this level. But Unicode has learned a lot
from mistakes made there, and can be better in every aspect. Shouldn't it
be?
Anyway, if a very good Transform-Encoding-Syntax is devised, UTC could
recognise the fact that everyone would benefit from it. If it means
assigning 128 codepoints, then that is the price. And one can hardly say it
has nothing to do with Unicode. It uses Unicode for transport. And Unicode
can benefit from it itself.
Lars
This archive was generated by hypermail 2.1.5 : Fri Dec 17 2004 - 10:14:43 CST