Re: Does Unicode 4.1 change NFC?

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Tue Apr 05 2005 - 03:09:09 CST

  • Next message: Peter Kirk: "Re: Does Unicode 4.1 change NFC?"

    "Arcane Jill" <arcanejill@ramonsky.com> writes:

    > In particular, I have played around with writing code-generators, of
    > the ilk which Ken mentioned in another post on this thread, and I
    > /never/ assumed that all (or indeed, any) generated codepoints would
    > be 16-bits wide. That would be a really dumb thing to do. Why is
    > anyone even mentioning this as a possibility?

    Since code produced by my generator is embedded in every program
    compiled by my compiler, the primary goal is small data and code size.
    I can live with updating the code when UCD changes some assumptions.

    I mean just tables which give raw decomposition data. Strings are
    represented by ISO-8859-1 and UTF-32, there is no BMP bias in
    interfaces - only in some internally used tables.

    The representation I used before for canonical decomposition:
    - An array of 256 pointers to arrays of 256 pairs of 16-bit words
      gives decompositions of BMP characters. A pair is 0,0 for no
      decomposition, X,0 for a single-char decomposition and X,Y for
      two-char decomposition. All-zero pages are shared.
    - An array of 32-bit words gives single-character decomposition
      for 542 characters starting from U+2F800.
    - The remaining 13 characters with decompositions are treated by
      a switch statement in the code.

    A change needed for Unicode 4.1:
    - When 0xFFFF is stored in the place for a single-character
      decomposition, an additional switch statement finds the real
      decomposition. This affects 6 characters.

    I claim that it was not a bad idea to use 16-bit entries in the
    tables.

    Compatibility decomposition is another story. The length may be longer
    (up to 18) but currently only BMP characters are produced (including
    the range of 1024 characters with some holes starting from U+1D400,
    the only non-BMP characters having compatibility decompositions),
    so my code doesn't currently include mechanism for producing non-BMP
    characters here.

    -- 
       __("<         Marcin Kowalczyk
       \__/       qrczak@knm.org.pl
        ^^     http://qrnik.knm.org.pl/~qrczak/
    


    This archive was generated by hypermail 2.1.5 : Tue Apr 05 2005 - 03:11:42 CST