Re: Problem with accented characters

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Aug 24 2004 - 17:52:29 CDT

  • Next message: Philippe Verdy: "Re: Mystery of Circled S solved"

    From: "John H. Jenkins" <jenkins@apple.com>
    > On Aug 23, 2004, at 3:34 PM, Doug Ewell wrote:
    >
    > > Deborah Goldsmith <goldsmit at apple dot com> wrote:
    > >
    > >> FYI, by far the largest source of text in NFD (decomposed) form in
    > >> Mac OS X is the file system. File names are stored this way (for
    > >> historical reasons), so anything copied from a file name is in (a
    > >> slightly altered form of) NFD.
    > >
    > > "Slightly altered"?
    > >
    >
    > Yes, the specification for the Mac file system was frozen before NFD
    > had been developed by the UTC, so it isn't exactly the same. But it's
    > close.

    Yes it is very close to NFD. The actual decompositions performed are fully
    listed in the documentation of the MacOS filesystems. Note that there are
    differences between various Mac filesystems, which where also localized into
    their driver (in a way quite similar to the legacy MSDOS filesystem with
    their unpredictable codepage: notably when reading removable medias where
    the codepage of the system creating that media is not stored on the
    support...)

    Actually, it was based on decompositions in Unicode 2.01. But the list of
    decompositions is now frozen with a specific Unicode version in the
    filesystem driver, for compatibility reasons. This is needed because some
    medias may be created later with characters from a later version of Unicode,
    which was still not supported in the driver of a legacy system in which the
    media would be used. It is even more important for networked filesystems for
    security reasons.

    Because of the same security reasons, Windows filesystems will NOT normalize
    Unicode filenames, which are stored as a binary vector of UTF-16 codeunits
    (with some of them restricted for special usage, or forbidden, notably for
    code-units/code-ppoints in the ASCII range that have some predefined
    functions, or are exclusions such as most controls), and optionally mapped
    to a secondary "short" 8.3 name using a local OS codepage.

    However, it is highly recommanded to use the NFC form when creating Unicode
    filenames on Windows (notably because it offers round-trip compatibility
    with filenames created in a Windows codepage where characters are
    precomposed). If you create a filename with decomposed characters in NFD
    form, you may not be able to open that file using the filename encoded in
    the Windows or OEM codepage (the filesystem will not find it, as it uses a
    simple one-to-one mapping from the codepage codes to Unicode codepoints in
    NFC form).



    This archive was generated by hypermail 2.1.5 : Wed Aug 25 2004 - 09:54:50 CDT