Re: 32'nd bit & UTF-8

From: Arcane Jill (arcanejill@ramonsky.com)
Date: Thu Jan 20 2005 - 02:40:29 CST

  • Next message: Antoine Leca: "Re: 32'nd bit & UTF-8"

    -----Original Message(s)-----
    From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]
    On Behalf Of Hans Aberg
    Subject: Re: 32'nd bit & UTF-8

    Wow - so much to which to reply. Where to start? I know, let's do this one
    first:

    > The Unicode standard is like Big Brother in George Orwell's "1984",
    > making it possible to only speak about what is right, but not what is
    > wrong.

    Clearly not. The truth is that the Unicode standard doesn't give a damn what
    you do - it only cares about what you /call/ it. Thus, you can use any form of
    encoding you like - so long as you don't call it "UTF-8". Similarly, you are
    absolutely free to ignore BOMs - so long as you don't claim to be a Unicode
    Conformant Process.

    So, for example, there is absolutely nothing wrong with the unix "cat" utility,
    which contatenates binary files. No-one is going to argue. But what you mustn't
    do is claim that "cat" is Unicode Conformant. And that's totally fair enough.
    In England, one is not allowed to call a drink "orange juice" unless it's
    actually got oranges in it! Without the oranges, you have to come up with some
    other consumer-fooling phrase like "orange flavored drink" or some such. And
    quite rightly, too - otherwise you'd be just plain misleading the public.
    Bottom line, if you want to use the more impressive name, you'd better put some
    oranges in, or stop complaining.

    Of course, it would be the work of ten minutes to write a Unicode Conformant
    version of "cat". Let's call it "ucat". <sarcasm>or "uct", in case "ucat" is
    not sufficiently terse for the Unix world</sarcasm>. "ucat" would concatenate
    text files, not binary files. It would correctly interpret BOMs. It would be
    able to, for example concatenate together three files, one UTF-8, one UTF-16LE,
    one UTF-32BE, and generate output in whatever UTF was desired. But good old
    fashioned "cat" would still exist, and would still do the job it's always done,
    which is to concatenate binary files. So no-one is dictating anything except
    nonclamenture - and personally I find that a very, very good thing.

    > BOM's screw up the UNIX platforms,
    > so it is not going to honored there anyway.

    > The problem is that UNIX software looks at the first bytes to determine if
    > it is a shell script.

    As noted above, so long as such software does not claim to be Unicode
    Conformant, who cares? Ah - but wait. What if there are users out there
    demanding Unicode Conformant software? Hmmm...

    > And lexers that are made for ASCII
    > data will most likely treat a BOM as an error.

    Quite rightly so. A BOM /is/ an error in ASCII, as is /any/ character beyond
    U+007F. Lexers, or indeed /any/ software made purely for the seven-bit-wide
    standard that is ASCII, can't be expected to work correctly if bytes 0x80 to
    0xFF are present in the stream.

    Or did you mean "lexers that are made for 8-bit character sets which are
    supersets of ASCII and trivially encoded"?

    > The main point is that BOM will not be specially treated in the UNIX world,
    > regardless what Unicode says. So I guess MS does not want its text files to
    > be read in the UNIX world. Unicode has made the mistake of favoring a
    > special platform over all the others.

    It would be more accurate to say that Unicode Conformant Processes often do not
    care if non-Unicode-Conformant Processes can't read them. Unicode has therefore
    "made the mistake" of favoring processes that conform the Unicode Standard over
    those that don't. And this is a problem because...?

    This one got me:

    > So it is clear that MS somehow has tricked Unicode to adapt an inhouse file
    > format as part of the UTF-8 encoding, in the expense on other platforms.
    > Unicode might loose prestige here, favoring one platform over all others.

    Er ... what?
    Sorry, I don't understand. I'm certain you're wrong though. I find Windows
    support of Unicode to be laughable.

    > The problem is that platforms such as UNIX use different methods to
    > determine file encodings that file contents, and there are other problems
    > with it, see <http://www.cl.cam.ac.uk/~mgk25/unicode.html>

    I am not clear why you keep citing this web page. It is not definitive. This
    one is: http://www.unicode.org.

    The "locale" notion, as Lars made plain to us last year, imposes a limitation
    that one cannot correctly interpret two different documents having different
    encodings in the same "locale". This, to me, sucks. Again, as Lars pointed out,
    it makes changing one's "locale" to UTF-8 a real pain in the butt. I hope other
    solutions will eventually be found.

    > One might give a purely mathematical definition of a Unicode character,
    > freed from any computer representation, as a pair (k, s), where k is an
    > integer, and s is string, or finite list, of elements from the set S := {A,
    > ..., Z, ' '} (i.e., an element in the free monoid on the set S). Then, UTF-8
    > defines a function f: (k, s) |-> (b, s), where b is a finite sequence of
    > bytes (math definition omitted), where k in [0, 0x10FFFF]. The
    > transformation I spoke about is a function g: k |-> b, where k in [0,
    > 2^32-1] such that f(k, s(k)) = (g(k), s(k)) when k in [0, 0x10FFFF].

    One might indeed, but, astonishingly, this has already been done, which is why
    people are arguing with you on this one. I've been lurking on this newsgroup
    for a while now, posting occasionally, and making a fool of myself more often
    than not. And one thing I've learned is that YOU HAVE TO USE THE JARGON.
    There's no way round it. You have to use the terms defined in the web page
    http://www.unicode.org/glossary/, and the document
    http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf, or else you WILL be
    misunderstood. UTF-8 is called a "Unicode Encoding Form". It is a mapping
    between every "Unicode Scalar Value" and a finite subset of all "Encoded
    Character Sequence"s. If you get the jargon right, you'll get a lot less
    argument. Like I said at the start - what you call things is very important to
    Unicode. Like any specialist jargon, the /intention/ is to enable people to be
    clear, precise and unambiguous. No technical vocabulary every /intends/ to
    divide the world into those-who-know-it (one of us) and those-who-don't
    (obviously an outsider), but it happens, as it does with medicine, physics,
    biology, whatever. It's not on purpose, but that's life. On this list, however,
    I recommend that you take the trouble ro read the definitions I just cited. It
    will make your words clearer to others (as well as others' words clearer to
    you).

    Jill



    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 02:42:16 CST