Re: Using combining diacritical marks and non-zero joiners in a name

From: Jukka K. Korpela (
Date: Tue Apr 01 2008 - 15:26:46 CST

  • Next message: Dominikus Scherkl: "Re: N3412: Last Resort Pictures"

    Thomas Hühn wrote:

    > Would a string like "Tho‍mas Hüh‍n" (ThoU+200Dmas HuU+0308hU+200Dn be
    > (a) a valid Unicode string with some semantics?

    I don't see why not, but the semantics is to be assigned at a higher
    protocol level, apart from the technical aspects, e.g. "T" is definitely
    an uppercase Latin letter and U+200D has certain defined properties and
    meaning: it suggests ligature or cursive rendering

    > (b) a valid Unicode string that may be used to transmit the
    > information that someone is called "Thomas Hühn"?

    Yes. The U+200D character is basically typographic in nature, and
    although it is generally pointless to use it that way (ligatures for
    "om" and "hn" are not actually used and it is difficult to see how they
    _could_ be used), but hey, it's a suggestion and can be ignored.

    Representing ü as u followed by U+0308 is not common, but surely
    possible, and it's just the canonically decomposed form of "ü". Many
    programs will choke on it, but that's a different story. Beware that
    although the rendering of the two representations of ü _should_
    generally be the same, it often isn't. And you should not expect
    programs to treat them as different, but neither should you rely on
    their _not_ being treated as different.

    > Question (b) aims at whether this string might be a valid From: in
    > some Internet mail message (properly MIME-encoded, of course) or just
    > a bunch of characters that just don't fit together semantically.

    This really depends on the Internet message header specifications, i.e.
    on higher level protocols. It is up to them to define which characters
    are allowed in such contexts.

    Many people still refrain from using any non-ASCII characters in
    Internet message headers (including even Subject headers, resulting in
    distortion of texts), and I can't really blame them, since I know that
    they still cause trouble. (I have even seen an E-mail message bounce
    back just because a recipient was specified in a Cc header so that his
    name contained a non-ASCII letter, "ä", properly inside quotation marks
    and with MIME encoding, and the bounce came from the primary recipient's
    E-mail system...) And surely U+0308 and U+200D can be expected to be
    more risky in message headers than the precomposed ü, U+00FD

    Jukka K. Korpela ("Yucca")

    This archive was generated by hypermail 2.1.5 : Tue Apr 01 2008 - 15:35:09 CST