Understanding normalisation

From: Theodore H. Smith (delete@elfdata.com)
Date: Sun May 28 2006 - 09:18:18 CDT

  • Next message: Doug Ewell: "Re: Unicode, SMS, PDA/cellphones"

    I've got some code that can do a multiple, parallel replacement upon
    Unicode strings. I can use this successfully to decompose and compose
    some Unicoode glyphs.

    But that's all that it does, multiple parallel string replacement. No
    reordering or anything else.

    I'm wondering, what limitations would it have for being useful for
    doing decomposition? And for doing composition?

    Is it true that you can't successfully decompose a string without
    doing a proper NFD operation on it? And just the same for composing,
    is it true that you can't compose a string without doing a proper NFC?

    For example: I seem to understand, that one problem that could occur
    when doing a blind "composition" upon a unicode string, is that a
    glyph may have it's combiners in a different order than my composer
    recognises, and thus this character won't get composed.

    Let's say I were to make a shell tool or something like that, that
    performed my "multiple parallel string replacement" upon text files,
    to do composition or decomposition. What limitations should I write
    into the documentation for the tool, to say that given certain kinds
    of text, it won't produce correct output. Basically, given what
    limitations would this tool still produce correct output. Or would it
    be better to make a simple additional processing step to make it
    produce proper NFC or NFD output.

    Is it true, that if I perform a proper combining character reordering
    (As described by UTR15) upon some Unicode text, and then did my
    "parallel string replacement based composer" upon the text, that I'd
    generate correct NFC?

    That question might give like I didn't understand normalisation. The
    problem I'm having is that understanding Unicode.org's technical
    information is a bit hard at times. I'm sure it can be explained in a
    simpler manner?

    Thanks for any answers!

    PS: I've some users who have been using this composer/decomposer I've
    made, for converting file names between Windows and OSX, and it
    actually works perfectly for them.

    But then I read the Unicode TR15 again, and realised that maybe it
    was only a matter of time before a situation would come up where this
    decomposer/composer failed due to not doing anything about reordering
    combiners, but then I'm not sure if it will fail even, because I'm
    having a hard time understanding this report.

    This archive was generated by hypermail 2.1.5 : Mon May 29 2006 - 11:40:46 CDT