Re: 32'nd bit & UTF-8

From: Arcane Jill (arcanejill@ramonsky.com)
Date: Thu Jan 20 2005 - 02:40:29 CST

Next message: Antoine Leca: "Re: 32'nd bit & UTF-8"

Previous message: Curtis Clark: "Re: Good news for Balinese"
Maybe in reply to: Hans Aberg: "32'nd bit & UTF-8"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: 32'nd bit & UTF-8"
Reply: Rick McGowan: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

-----Original Message(s)-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]
On Behalf Of Hans Aberg
Subject: Re: 32'nd bit & UTF-8

Wow - so much to which to reply. Where to start? I know, let's do this one
first:

> The Unicode standard is like Big Brother in George Orwell's "1984",
> making it possible to only speak about what is right, but not what is
> wrong.

Clearly not. The truth is that the Unicode standard doesn't give a damn what
you do - it only cares about what you /call/ it. Thus, you can use any form of
encoding you like - so long as you don't call it "UTF-8". Similarly, you are
absolutely free to ignore BOMs - so long as you don't claim to be a Unicode
Conformant Process.

So, for example, there is absolutely nothing wrong with the unix "cat" utility,
which contatenates binary files. No-one is going to argue. But what you mustn't
do is claim that "cat" is Unicode Conformant. And that's totally fair enough.
In England, one is not allowed to call a drink "orange juice" unless it's
actually got oranges in it! Without the oranges, you have to come up with some
other consumer-fooling phrase like "orange flavored drink" or some such. And
quite rightly, too - otherwise you'd be just plain misleading the public.
Bottom line, if you want to use the more impressive name, you'd better put some
oranges in, or stop complaining.

Of course, it would be the work of ten minutes to write a Unicode Conformant
version of "cat". Let's call it "ucat". <sarcasm>or "uct", in case "ucat" is
not sufficiently terse for the Unix world</sarcasm>. "ucat" would concatenate
text files, not binary files. It would correctly interpret BOMs. It would be
able to, for example concatenate together three files, one UTF-8, one UTF-16LE,
one UTF-32BE, and generate output in whatever UTF was desired. But good old
fashioned "cat" would still exist, and would still do the job it's always done,
which is to concatenate binary files. So no-one is dictating anything except
nonclamenture - and personally I find that a very, very good thing.

> BOM's screw up the UNIX platforms,
> so it is not going to honored there anyway.

> The problem is that UNIX software looks at the first bytes to determine if
> it is a shell script.

As noted above, so long as such software does not claim to be Unicode
Conformant, who cares? Ah - but wait. What if there are users out there
demanding Unicode Conformant software? Hmmm...

> And lexers that are made for ASCII
> data will most likely treat a BOM as an error.

Quite rightly so. A BOM /is/ an error in ASCII, as is /any/ character beyond
U+007F. Lexers, or indeed /any/ software made purely for the seven-bit-wide
standard that is ASCII, can't be expected to work correctly if bytes 0x80 to
0xFF are present in the stream.

Or did you mean "lexers that are made for 8-bit character sets which are
supersets of ASCII and trivially encoded"?

> The main point is that BOM will not be specially treated in the UNIX world,
> regardless what Unicode says. So I guess MS does not want its text files to
> be read in the UNIX world. Unicode has made the mistake of favoring a
> special platform over all the others.

It would be more accurate to say that Unicode Conformant Processes often do not
care if non-Unicode-Conformant Processes can't read them. Unicode has therefore
"made the mistake" of favoring processes that conform the Unicode Standard over
those that don't. And this is a problem because...?

This one got me:

> So it is clear that MS somehow has tricked Unicode to adapt an inhouse file
> format as part of the UTF-8 encoding, in the expense on other platforms.
> Unicode might loose prestige here, favoring one platform over all others.

Er ... what?
Sorry, I don't understand. I'm certain you're wrong though. I find Windows
support of Unicode to be laughable.

> The problem is that platforms such as UNIX use different methods to
> determine file encodings that file contents, and there are other problems
> with it, see <http://www.cl.cam.ac.uk/~mgk25/unicode.html>

I am not clear why you keep citing this web page. It is not definitive. This
one is: http://www.unicode.org.

The "locale" notion, as Lars made plain to us last year, imposes a limitation
that one cannot correctly interpret two different documents having different
encodings in the same "locale". This, to me, sucks. Again, as Lars pointed out,
it makes changing one's "locale" to UTF-8 a real pain in the butt. I hope other
solutions will eventually be found.

> One might give a purely mathematical definition of a Unicode character,
> freed from any computer representation, as a pair (k, s), where k is an
> integer, and s is string, or finite list, of elements from the set S := {A,
> ..., Z, ' '} (i.e., an element in the free monoid on the set S). Then, UTF-8
> defines a function f: (k, s) |-> (b, s), where b is a finite sequence of
> bytes (math definition omitted), where k in [0, 0x10FFFF]. The
> transformation I spoke about is a function g: k |-> b, where k in [0,
> 2^32-1] such that f(k, s(k)) = (g(k), s(k)) when k in [0, 0x10FFFF].

One might indeed, but, astonishingly, this has already been done, which is why
people are arguing with you on this one. I've been lurking on this newsgroup
for a while now, posting occasionally, and making a fool of myself more often
than not. And one thing I've learned is that YOU HAVE TO USE THE JARGON.
There's no way round it. You have to use the terms defined in the web page
http://www.unicode.org/glossary/, and the document
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf, or else you WILL be
misunderstood. UTF-8 is called a "Unicode Encoding Form". It is a mapping
between every "Unicode Scalar Value" and a finite subset of all "Encoded
Character Sequence"s. If you get the jargon right, you'll get a lot less
argument. Like I said at the start - what you call things is very important to
Unicode. Like any specialist jargon, the /intention/ is to enable people to be
clear, precise and unambiguous. No technical vocabulary every /intends/ to
divide the world into those-who-know-it (one of us) and those-who-don't
(obviously an outsider), but it happens, as it does with medicine, physics,
biology, whatever. It's not on purpose, but that's life. On this list, however,
I recommend that you take the trouble ro read the definitions I just cited. It
will make your words clearer to others (as well as others' words clearer to
you).

Jill

Next message: Antoine Leca: "Re: 32'nd bit & UTF-8"
Previous message: Curtis Clark: "Re: Good news for Balinese"
Maybe in reply to: Hans Aberg: "32'nd bit & UTF-8"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: 32'nd bit & UTF-8"
Reply: Rick McGowan: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 02:42:16 CST