RE: 32'nd bit & UTF-8

From: Jon Hanna (jon@hackcraft.net)
Date: Wed Jan 19 2005 - 04:47:00 CST

  • Next message: Vinod Kumar: "Forms for invisible ZWJ (and ZWNJ)"

    > >> According to <http://www.cl.cam.ac.uk/~mgk25/unicode.html>,
    > >> UTF is short for
    > >> UCS Transformation Format, where UCS stands for Universal
    > >> Character Set.
    > >
    > > Minor incidences of a website being out of date aren't
    > really relevant here
    > > unless the website is unicode.org.
    >
    > This is quite hard to interpret.

    The document you give a URL to is wrong. UTF used to stand for both "UCS Transformation Format" and "Unicode Transformation Format" in different contexts (it was defined separately by ISO and Unicode, but those two definitions match). UTF is now just a name, UCS Transformation Format is an etymology of that name, but no longer an expansion of an initialism. It is a minor point however.

    > A lexer generator like Flex does not process Unicode
    > directly, it generates
    > a lexer that processes bytes. And the question is how to get
    > it to emulate
    > Unicode a lexer, if the fellow who writes the input grammar
    > so wishes. Then
    > it is up the lexer writer, not the developer of the Flex
    > program, to decide
    > what happens in the case of illegal Unicode numbers.

    This is why your suggestion that UTF-8 be redefined, or a sister format defined, is pointless. UTF-8 encodes Unicode characters into octets. Unless you are going from octets to characters or characters to octets you are not using it.

    Something working at the octet level on UTF-8 data can do so in a variety of ways, including interpreting it as part of a super-set that treats illegal values in the manner you describe, as long as it maintains whatever promises are documented (e.g. if you document that it will not let illegal values through then it must not do so, though whether that is done by identifying illegal sequences, or by interpreting all sequences and then identifying those that are outside of the range 0-10FFFF₁₆ or are too low for the number of octets it took to produce them - i.e. "overlong" sequences - doesn't matter).

    > 32 bits is the smallest natural computer alignment if 21
    > bits, and will
    > therefore have to be dealt with anyhow when writing a lexer.

    Computers that would handle word sizes that are not multiples of 8 are by no means unheard of - 36bit words in particular.

    My point isn't (just) to be pedantic about word sizes, but that I think you are focused on your particular needs whereas Unicode has a responsibility to produce technologies that can cater for your needs but also those of others.

    Regards,
    Jon Hanna
    Work: <http://www.selkieweb.com/>
    Play: <http://www.hackcraft.net/>
    Chat: <irc://irc.freenode.net/selkie>



    This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 04:48:37 CST