From: Jukka K. Korpela (firstname.lastname@example.org)
Date: Fri May 16 2008 - 03:28:12 CDT
Jonathan Pool wrote:
> I’m working on normalizing entries in over 1,000 languages in a
> lexical database. One of the tasks I’m finding most difficult is
> normalizing apostrophe-like characters.
I can imagine that. It depends on on what you mean by normalization and
what you mean by apostrophe-like, but it’s difficult anyway. However, if
you mean normalization in the sense of transforming to a Normalization
Form as defined in the Unicode Standard, then it’s algorithmically
solvable, and the difficulty lies in finding suitable software for it.
If you mean operations like replacing characters that look like an
apostrophe by the characters that are more recommendable according to
some recommendation, then it’s very, very difficult. Regarding in
particular U+0027 APOSTROPHE in existing data, I strongly suggest that
if you do not know absolutely and provably what it really “stands for,”
don’t touch it. When reading text in a natural language that you know
well, you can usually know what U+0027 should be changed to, but if it’s
anything that might be a foreign name or code-like notation, it’s easy
to go wrong.
> I was hoping to use documents
> at the Unicode Web site, including translations of “What is Unicode”
> and of UDHR, as guides for some languages, but many of the documents
> seem to contain U+0027 APOSTROPHE where my reading of the standard
> says other characters are preferred. I’m curious about the reason.
This was discussed some time ago on this list when I raised the issue.
Check back the list archives if you are interested in people’s views as
they expressed them, but my impression was that this was not regarded as
important enough to be done right. I can understand this, though I do
not accept it. It’s difficult to exercise control over voluntary work,
since if you require too much, volunteers just stop volunteering, and in
this issue, you would usually need volunteers to supervise other
Erkki mentioned the ease of typing, which is mostly true (though when
using Microsoft Office or some similar software, it’s really U+0027 that
is somewhat difficult to type, since the program automagically converts
it to right or left single quotation mark). But any difficulties in
typing characters should be just challenges, not obstacles, to a person
who writes about Unicode. In the old days of the Web, U+0027 was almost
universally used for any kind of an apostrophe-like character, since it
worked more reliably than, say, the right single quotation mark or the
prime, but regarding single quotes, this changed many years ago. On the
authoring side, the single quotation marks can be entered (at least) as
character references like ’ or as half-mnemonic entity references
like ’, but not all people know that (and not all people know how
to switch to “HTML mode” in authoring when needed).
Jukka K. Korpela ("Yucca")
This archive was generated by hypermail 2.1.5 : Fri May 16 2008 - 03:32:29 CDT