RE: Small Java implementation of NFC

From: Richard T. Gillam (rgillam@las-inc.com)
Date: Thu Mar 03 2005 - 09:47:29 CST

  • Next message: Richard T. Gillam: "RE: But E0000 Custom Language Tags Are Actually *Required* For Use By Unicode"

    >Thus I'm looking at ways to implement NFC that don't require me to drag

    >around the Unicode data files or substantial chunks thereof. I notice
    >that java.lang.Character has a getType method that returns the Unicode

    >character class for each character. This is built-in to Java since 1.1.

    >It lets me tell if a character is one of the following types:
    >
    >...
    >
    >Is this sufficient information to perform NFC normalization? Or is that

    >a pipe dream and I'm just going to need to drag along the Unicode data
    >file or part thereof?

    If I'm reading your message right, this is definitely not enough to do
    NFC. It isn't enough to know the types of various characters; you have
    to have their actual decomposition mappings. Knowing you have a base
    character next to a combining character only lets you know this is a
    spot where you might be able to combine characters; it doesn't tell you
    which character you replace these two with, or even if such a character
    exists.

    You could pare down the NFC algorithm by restricting its domain to just
    characters you expect you're going to see (this might, for example, get
    you out of dealing with non-BMP combining characters [yes, they do
    exist]), and you could get rid of some of the more obscure code (for
    example, you might have <B C1 C2> where B is a base character and C1 and
    C2 are combining characters-- there's no single Unicode character for
    B+C1 or for B+C1+C2, but there is one for B+C2, so NFC would result in
    <B+C2 C1>, combining B with C2 even though C1 is in the way-- you could
    eliminate this).

    Of course, if you do any of these things, you're not doing NFC anymore,
    at least not if you're presented with data that you didn't expect to
    see. You're doing your own normalization that kind of looks like NFC.
    In a closed system, where the stuff your normalization doesn't handle is
    stuff the rest of your application can't handle either, this might be a
    reasonable way to go. What you couldn't do is take the output of such a
    modified converter and interchange it with the outside world claiming
    it's NFC.

    Hope this helps...

    --Rich Gillam
      Language Analysis Systems, Inc.



    This archive was generated by hypermail 2.1.5 : Thu Mar 03 2005 - 09:48:20 CST