Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Wed Jan 19 2005 - 13:56:33 CST

  • Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"

    On 2005/01/19 11:47, Jon Hanna at jon@hackcraft.net wrote:

    >>>> According to <http://www.cl.cam.ac.uk/~mgk25/unicode.html>,
    >>>> UTF is short for
    >>>> UCS Transformation Format, where UCS stands for Universal
    >>>> Character Set.
    >>>
    >>> Minor incidences of a website being out of date aren't
    >> really relevant here
    >>> unless the website is unicode.org.
    >>
    >> This is quite hard to interpret.
    >
    > The document you give a URL to is wrong. UTF used to stand for both "UCS
    > Transformation Format" and "Unicode Transformation Format" in different
    > contexts (it was defined separately by ISO and Unicode, but those two
    > definitions match). UTF is now just a name, UCS Transformation Format is an
    > etymology of that name, but no longer an expansion of an initialism. It is a
    > minor point however.

    Other posters suggest that the truth is more complicated than that, passing
    via a sequence of RFC's. It appears that Unicode just have clamped about,
    and stolen the name, not bother resolving the history properly.

    >> A lexer generator like Flex does not process Unicode
    >> directly, it generates
    >> a lexer that processes bytes. And the question is how to get
    >> it to emulate
    >> Unicode a lexer, if the fellow who writes the input grammar
    >> so wishes. Then
    >> it is up the lexer writer, not the developer of the Flex
    >> program, to decide
    >> what happens in the case of illegal Unicode numbers.
    >
    > This is why your suggestion that UTF-8 be redefined, or a sister format
    > defined,

    It would in fact be a parent format, of which what Unicode now calls UTF-8
    is a specialization.

    > is pointless. UTF-8 encodes Unicode characters into octets. Unless
    > you are going from octets to characters or characters to octets you are not
    > using it.
    >
    > Something working at the octet level on UTF-8 data can do so in a variety of
    > ways, including interpreting it as part of a super-set that treats illegal
    > values in the manner you describe, as long as it maintains whatever promises
    > are documented (e.g. if you document that it will not let illegal values
    > through then it must not do so, though whether that is done by identifying
    > illegal sequences, or by interpreting all sequences and then identifying those
    > that are outside of the range 0-10FFFF?? or are too low for the number of
    > octets it took to produce them - i.e. "overlong" sequences - doesn't matter).

    On the contrary, here you indicate the point of my extension: Clearly the
    way I suggest Flex to be extended will not deal with Unicode directly, but
    merely facilitate then implementation of a lexer that may or may not
    recognize one or more of the Unicode encodings. It will have to deal with
    those Unicode illegal sequences anyhow, in order to facilitate proper error
    handling, if the fellow how writes a particular lexer so wants.

    Another reason is that, even 2^21 perhaps is well enough for human produced
    characters such bounds could easily be broken by computer generated
    extensions. For example, in Bison, people want to be able to generate such
    large grammars. There the limits of tokens, states, etc. are now at 2^31-

    >> 32 bits is the smallest natural computer alignment if 21
    >> bits, and will
    >> therefore have to be dealt with anyhow when writing a lexer.
    >
    > Computers that would handle word sizes that are not multiples of 8 are by no
    > means unheard of - 36bit words in particular.

    I have had long discussions in the C/C++ newsgroups. It turns out that most
    of those archicteures are archaic. C99 now support 32-bit intergal types,
    but I do not think there is a 36 bit integral type.

    > My point isn't (just) to be pedantic about word sizes, but that I think you
    > are focused on your particular needs whereas Unicode has a responsibility to
    > produce technologies that can cater for your needs but also those of others.

    For your argument to pass, you would need to present a computer with a 21 to
    31 bit word size, on which the 32-bit format would become considerably
    slower. And it would not be a UTF-8 or Unicode encoding, so one does not
    need to use it if one is not processing characters.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 13:57:46 CST