Re: NFC/NFKC Normalization Edge Case

From: Jeff Senn (senn@maya.com)
Date: Tue Sep 29 2009 - 11:45:19 CDT

  • Next message: announcements@unicode.org: "[Unicode Announcement] Unicode Haiku Contest"

    Unless I'm mistaken your analysis misses the case where there can
    be repeated combiners (or even multiple combiners of the same
    class)...e.g.

    NFC(0042 0300 0300 0300 0300 0323) --> 1E04 0300 0300 0300 0300

    I guess there is no theoretical limit, hence the "need" for the
    "Stream-Safe Text Format".
    (see UAX 15 ch20)

    On Sep 29, 2009, at 12:02 PM, verdy_p wrote:

    > As long as you have not reached the end of stream or reached another
    > character that has a lower ccc and that CAN be
    > combined or be reordered with the existing characters in the buffer
    > (up to and including the last starter), you
    > can't presume that your sequence is terminated and you can't flush it.
    >
    > Fortunately, due to the limited number of possible ccc values that
    > exist in the database (and given that your look-
    > ahead buffer will never contain several characters with identical
    > ccc values, so that your buffer will only contain
    > characters sorted in ascending order of ccc values), this
    > effectively limits the needed size for your buffer. In
    > addition, this limit is much lower, given that you can easily
    > predict what is the maximum length of existing
    > sequences after which a further combining character could be
    > combined with the first character of your buffer, or
    > inserted between it and the other combining characters in your
    > buffer :
    >
    > The longest sequences to keep in the buffer would be those starting
    > with a starter S (with ccc=0), with one
    > combining character in each of the existing combining classes (in
    > ascending ccc), pending for a character with the
    > highest ccc value possible that would be the last one.
    >
    > This length is necessarily limited, given that the maximum ccc value
    > possible is 255 (if such combining character is
    > defined, and gets encoded in a canonically combining pair, and if
    > all the 256 values of ccc were allocated, you
    > would need a buffer for up to 255 characters maximum (the first
    > location being used by the starter, all the other
    > 254 locations being used by combining characters of each class, and
    > finally, when reading the last character with
    > ccc=255 that could combine with the first one.
    >
    > Note however that in almost all cases, the used length in the 255-
    > codepoints buffer will be very small, so adding a
    > new character in the buffer should really use a simple insertion
    > sort, or could even use a bubble sort :
    >
    > (1) first compare the ccc value of the new character with the
    > existing ccc value of characters in the buffer in slot
    > [0..N-1] (keep a variable for it) : if it is equal, then the new
    > character is not blocked, and you can output the
    > content of your buffer (you would have to sort it first by ascending
    > ccc value, if you don't use the insertion
    > exposed below for step 3), keeping just the new character in slot[0].
    >
    > (2) first compare the new character with the one in used slot[0] to
    > see if it matches a canonical pair: if it
    > matches, replace the character in slot[0] by the precombined
    > character.
    >
    > (3) otherwise just place the new character in slot[N] (where N is
    > the used length) and increment N.
    >
    > In this last step (3), you may (and should) also want to use a
    > sliding insertion in backward direction (instead of
    > just placing the new character at end of the buffer in slot[N]),
    > creating the equivalent of an insertion sort, to
    > avoid keeping the maximum ccc value within the N characters stored
    > in the buffer : it will even be more efficient
    > than sorting the buffer before outputing it (in step 1), because the
    > buffer length will (almost always) be extremely
    > small with length 0 or 1, in more than 99% of cases (the additional
    > cost of "efficient" sorts like Quicksort is not
    > worth the value on very small vectors that will constitute the huge
    > majority of cases, even with the longest
    > theoretical ones that are MUCH smaller in length than 255, with the
    > existing UCD): insertion sorts are very
    > efficient on very small vectors.
    >
    > If you do this, then the character in slot[0] (if there's one, i.e.
    > when N>0) will be kept in the buffer only :
    > * if it can effectively combine (when it has ccc=0) within an
    > existing canonical pair of the UCD,
    > * or if there are other characters in the UCD with a ccc value
    > distinct from the ccc values of non-starter
    > characters already in your buffer in slot[1 .. N-1].
    >
    > You can also perform the backward sliding for insertion in the same
    > loop used to scan the content of your buffer for
    > comparing ccc values: when you find the location of the insertion,
    > you can immediately place the character at that
    > location, given that the rest of the buffer is already slid:
    >
    > When this is done, the slot[i] of insertion is the one where you
    > just store the codepoint, so every other prior
    > stots in [0..i-1] can be output. Theoretically, you would want to
    > slide back the rest of the buffer so that slot[i]
    > will go back to slot[0] after you have output every character in
    > slot[0 .. i-1], but the simplest is to not slide
    > this content at all, but keep a persistant index of the location
    > with the first used character.
    >
    >> Message du 23/09/09 16:04
    >> De : "Jeff Senn"
    >> A : unicode@unicode.org
    >> Copie à :
    >> Objet : Re: NFC/NFKC Normalization Edge Case
    >>
    >>
    >>
    >> On Sep 22, 2009, at 5:44 PM, Kenneth Whistler wrote:
    >>>>
    >>>> All of these characters have combining class 0. Can they be
    >>>> canonically
    >>>> combined? Even though the 2nd characters are NOT "combining"?
    >>>
    >>> There's the first mistake. Both of the 2nd characters in
    >>> these sequences *ARE* combining:
    >>
    >> Thanks guys. Clear now.
    >>
    >> I was led astray by:
    >>
    >> "D1. A character S is a starter if it has a combining class of zero
    >> in
    >> the Unicode Character Database. Any other character is a non-
    >> starter."
    >>
    >> and
    >>
    >> "In some implementations, people may be working with streaming
    >> interfaces that read and write small amounts at a time. In those
    >> implementations, the text back to the last starter needs to be
    >> buffered. Whenever a second starter would be added to that buffer,
    >> the
    >> buffer can be flushed."
    >>
    >> Which are technically correct (once you know the correct answer),
    >> but led me to suspect I could do slightly more aggressive flushing of
    >> a streamed buffer...
    >>
    >>
    >>
    >>
    >>
    >>
    >>
    >>
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Tue Sep 29 2009 - 11:50:41 CDT