Re: NFC/NFKC Normalization Edge Case

From: Jeff Senn (
Date: Tue Sep 29 2009 - 11:45:19 CDT

  • Next message: "[Unicode Announcement] Unicode Haiku Contest"

    Unless I'm mistaken your analysis misses the case where there can
    be repeated combiners (or even multiple combiners of the same

    NFC(0042 0300 0300 0300 0300 0323) --> 1E04 0300 0300 0300 0300

    I guess there is no theoretical limit, hence the "need" for the
    "Stream-Safe Text Format".
    (see UAX 15 ch20)

    On Sep 29, 2009, at 12:02 PM, verdy_p wrote:

    > As long as you have not reached the end of stream or reached another
    > character that has a lower ccc and that CAN be
    > combined or be reordered with the existing characters in the buffer
    > (up to and including the last starter), you
    > can't presume that your sequence is terminated and you can't flush it.
    > Fortunately, due to the limited number of possible ccc values that
    > exist in the database (and given that your look-
    > ahead buffer will never contain several characters with identical
    > ccc values, so that your buffer will only contain
    > characters sorted in ascending order of ccc values), this
    > effectively limits the needed size for your buffer. In
    > addition, this limit is much lower, given that you can easily
    > predict what is the maximum length of existing
    > sequences after which a further combining character could be
    > combined with the first character of your buffer, or
    > inserted between it and the other combining characters in your
    > buffer :
    > The longest sequences to keep in the buffer would be those starting
    > with a starter S (with ccc=0), with one
    > combining character in each of the existing combining classes (in
    > ascending ccc), pending for a character with the
    > highest ccc value possible that would be the last one.
    > This length is necessarily limited, given that the maximum ccc value
    > possible is 255 (if such combining character is
    > defined, and gets encoded in a canonically combining pair, and if
    > all the 256 values of ccc were allocated, you
    > would need a buffer for up to 255 characters maximum (the first
    > location being used by the starter, all the other
    > 254 locations being used by combining characters of each class, and
    > finally, when reading the last character with
    > ccc=255 that could combine with the first one.
    > Note however that in almost all cases, the used length in the 255-
    > codepoints buffer will be very small, so adding a
    > new character in the buffer should really use a simple insertion
    > sort, or could even use a bubble sort :
    > (1) first compare the ccc value of the new character with the
    > existing ccc value of characters in the buffer in slot
    > [0..N-1] (keep a variable for it) : if it is equal, then the new
    > character is not blocked, and you can output the
    > content of your buffer (you would have to sort it first by ascending
    > ccc value, if you don't use the insertion
    > exposed below for step 3), keeping just the new character in slot[0].
    > (2) first compare the new character with the one in used slot[0] to
    > see if it matches a canonical pair: if it
    > matches, replace the character in slot[0] by the precombined
    > character.
    > (3) otherwise just place the new character in slot[N] (where N is
    > the used length) and increment N.
    > In this last step (3), you may (and should) also want to use a
    > sliding insertion in backward direction (instead of
    > just placing the new character at end of the buffer in slot[N]),
    > creating the equivalent of an insertion sort, to
    > avoid keeping the maximum ccc value within the N characters stored
    > in the buffer : it will even be more efficient
    > than sorting the buffer before outputing it (in step 1), because the
    > buffer length will (almost always) be extremely
    > small with length 0 or 1, in more than 99% of cases (the additional
    > cost of "efficient" sorts like Quicksort is not
    > worth the value on very small vectors that will constitute the huge
    > majority of cases, even with the longest
    > theoretical ones that are MUCH smaller in length than 255, with the
    > existing UCD): insertion sorts are very
    > efficient on very small vectors.
    > If you do this, then the character in slot[0] (if there's one, i.e.
    > when N>0) will be kept in the buffer only :
    > * if it can effectively combine (when it has ccc=0) within an
    > existing canonical pair of the UCD,
    > * or if there are other characters in the UCD with a ccc value
    > distinct from the ccc values of non-starter
    > characters already in your buffer in slot[1 .. N-1].
    > You can also perform the backward sliding for insertion in the same
    > loop used to scan the content of your buffer for
    > comparing ccc values: when you find the location of the insertion,
    > you can immediately place the character at that
    > location, given that the rest of the buffer is already slid:
    > When this is done, the slot[i] of insertion is the one where you
    > just store the codepoint, so every other prior
    > stots in [0..i-1] can be output. Theoretically, you would want to
    > slide back the rest of the buffer so that slot[i]
    > will go back to slot[0] after you have output every character in
    > slot[0 .. i-1], but the simplest is to not slide
    > this content at all, but keep a persistant index of the location
    > with the first used character.
    >> Message du 23/09/09 16:04
    >> De : "Jeff Senn"
    >> A :
    >> Copie :
    >> Objet : Re: NFC/NFKC Normalization Edge Case
    >> On Sep 22, 2009, at 5:44 PM, Kenneth Whistler wrote:
    >>>> All of these characters have combining class 0. Can they be
    >>>> canonically
    >>>> combined? Even though the 2nd characters are NOT "combining"?
    >>> There's the first mistake. Both of the 2nd characters in
    >>> these sequences *ARE* combining:
    >> Thanks guys. Clear now.
    >> I was led astray by:
    >> "D1. A character S is a starter if it has a combining class of zero
    >> in
    >> the Unicode Character Database. Any other character is a non-
    >> starter."
    >> and
    >> "In some implementations, people may be working with streaming
    >> interfaces that read and write small amounts at a time. In those
    >> implementations, the text back to the last starter needs to be
    >> buffered. Whenever a second starter would be added to that buffer,
    >> the
    >> buffer can be flushed."
    >> Which are technically correct (once you know the correct answer),
    >> but led me to suspect I could do slightly more aggressive flushing of
    >> a streamed buffer...

    This archive was generated by hypermail 2.1.5 : Tue Sep 29 2009 - 11:50:41 CDT