Re: Exemplifying apostrophes

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Fri May 16 2008 - 03:28:12 CDT

  • Next message: Andrea Cantinotti: "RE: Exemplifying apostrophes"

    Jonathan Pool wrote:

    > I’m working on normalizing entries in over 1,000 languages in a
    > lexical database. One of the tasks I’m finding most difficult is
    > normalizing apostrophe-like characters.

    I can imagine that. It depends on on what you mean by normalization and
    what you mean by apostrophe-like, but it’s difficult anyway. However, if
    you mean normalization in the sense of transforming to a Normalization
    Form as defined in the Unicode Standard, then it’s algorithmically
    solvable, and the difficulty lies in finding suitable software for it.

    If you mean operations like replacing characters that look like an
    apostrophe by the characters that are more recommendable according to
    some recommendation, then it’s very, very difficult. Regarding in
    particular U+0027 APOSTROPHE in existing data, I strongly suggest that
    if you do not know absolutely and provably what it really “stands for,”
    don’t touch it. When reading text in a natural language that you know
    well, you can usually know what U+0027 should be changed to, but if it’s
    anything that might be a foreign name or code-like notation, it’s easy
    to go wrong.

    > I was hoping to use documents
    > at the Unicode Web site, including translations of “What is Unicode”
    > and of UDHR, as guides for some languages, but many of the documents
    > seem to contain U+0027 APOSTROPHE where my reading of the standard
    > says other characters are preferred. I’m curious about the reason.

    This was discussed some time ago on this list when I raised the issue.
    Check back the list archives if you are interested in people’s views as
    they expressed them, but my impression was that this was not regarded as
    important enough to be done right. I can understand this, though I do
    not accept it. It’s difficult to exercise control over voluntary work,
    since if you require too much, volunteers just stop volunteering, and in
    this issue, you would usually need volunteers to supervise other
    volunteers’ texts.

    Erkki mentioned the ease of typing, which is mostly true (though when
    using Microsoft Office or some similar software, it’s really U+0027 that
    is somewhat difficult to type, since the program automagically converts
    it to right or left single quotation mark). But any difficulties in
    typing characters should be just challenges, not obstacles, to a person
    who writes about Unicode. In the old days of the Web, U+0027 was almost
    universally used for any kind of an apostrophe-like character, since it
    worked more reliably than, say, the right single quotation mark or the
    prime, but regarding single quotes, this changed many years ago. On the
    authoring side, the single quotation marks can be entered (at least) as
    character references like ’ or as half-mnemonic entity references
    like ’, but not all people know that (and not all people know how
    to switch to “HTML mode” in authoring when needed).

    Jukka K. Korpela ("Yucca")
    http://www.cs.tut.fi/~jkorpela/



    This archive was generated by hypermail 2.1.5 : Fri May 16 2008 - 03:32:29 CDT