Re: Diacritical marks: Single character or combined character?

From: Jukka K. Korpela <jkorpela_at_cs.tut.fi>
Date: Fri, 06 Dec 2013 16:30:54 +0200

2013-12-06 0:10, Naz Gassiep wrote:

> Hi, does anyone have any answers to this question?
>
> From: mrnaz_at_hotmail.com
> To: unicode_at_unicode.org
> Subject: Diacritical marks: Single character or combined character?
> Date: Fri, 8 Nov 2013 18:37:29 +1100

As far as I can see, mainly by checking from
http://www.unicode.org/mail-arch/unicode-ml/y2013-m11/
it seems that the original question never got distributed to the list.

> I would like to know if there is a best practice or recommendation as to
> which method to use when representing letters with diacritical marks.
> For example, take the following two characters:
> ā
> ā
> They may look the same, however the first is a single character U+0101,
> while the second is a combination of two, the first being regular a
> (U+0061) and the second being the combining macron (U+0304).

There is a lot about this in the Unicode Standard, but no general
recommendation. There is e.g. the W3C Character Model for the WWW,
nominally still a Working Draft http://www.w3.org/TR/charmod-norm/,
which promotes the use of Normalization Form C. This basically means
that you should use precomposed characters when possible. It has been
taken so seriously that the W3C Markup Validator, when used in HTML5
mode, issues a warning message (and earlier an error message!) about
violations of this policy, even though HTML5 drafts do not specify this
policy. For notes on this, see
http://stackoverflow.com/questions/5465170/text-run-is-not-in-unicode-normalization-form-c

On the other hand, UTR #25, “Unicode Support for Mathematics”, recommends:
“for accented alphabetic characters used as variables, ideally only
decomposed sequences are used, because mathematics uses a multitude of
combining marks that greatly exceeds the predefined composed characters
in Unicode. Accordingly, it is better to have the math display facility
handle all of these cases uniformly to give a consistent look between
characters that happen to have a fully composed Unicode character and
those that do not.”
http://www.unicode.org/reports/tr25/

> In producing content, which is the better to use? When writing in
> languages such as Turkish, there are a limited finite set of diacritical
> marks, all of which are represented in the Unicode character set.

Yes. And this is by far the most common way of writing e.g. Turkish. If
you deviate from that, the visual result may differ from normal.

A precomposed character and the corresponding decomposed sequence can be
“expected” to have identical rendering, but in practice, they quite
often differ, for various reasons. For example, when reading your
message now in Thunderbird, I see “ā” and “ā” as completely identical;
but when I first read it on an Android device using its built-in e-mail
program, I saw them as very different (the latter looked almost like
“a¯”, i.e. “a” followed by a macron).

> However, when writing statistical formulae, every symbol used, including
> both Latin and Greek characters, can have a circumflex or overline added
> to it to denote a particular meaning. In that case, I found myself using
> the relevant character combined with U+0302 or U+0305 as needed.

That would correspond to the principle suggested in UTR #25. It might be
argued that some discrepancy may result if you apply that principle in
mathematical notations and the other principle in natural language
texts, which might contain e.g. “ā” as a letter. But this is not very
serious. The style of mathematical notations often differs from normal
text. Besides, letters in matematical notations, when written properly,
mostly appear in italic, so a mathematical “ā” (in italic) usually
differs from copy text “ā” (upright) anyway.

> Now that I am switching between the two activities (writing stats stuff
> and publishing transliterated content), I find myself unsure as to what
> the best method is, if one is better than the other.

It seems that in general, you should use precomposed characters for
natural language texts, decomposed sequences for mathematical notations.
An exception is e.g. Latin words in grammar texts where you may need to
have both a macron and an acute accent on vowels, and since not all
combinations exist as precomposed, you may wish to make all of them
decomposed for uniformity.

> I favour using a single method for all things, and so I am attracted to
> the idea of using combining characters for everything. However, language
> parsing tools for languages where those combined characters are used may
> be fooled when presented with U+0061 combined with U+0304 instead of the
> usual U+0101.

Caution is indeed needed. In general, we should expect that programs
cannot handle combining diacritic marks well. Many programs can, at
least under certain circumstances, but I would prefer nice positive
surprises to nasty failures.

With mathematical notations, the main question is usually which software
and format you will use for them. Plain text is rather unsatisfactory
for anything beyond elementary school math. If you decide to use e.g.
TeX-based systems (LaTeX, AMSTeX) or MS Word formula features or MathML
or MathJax, for example, then this more or less makes the decision about
diacritics for you.

Yucca
Received on Fri Dec 06 2013 - 08:34:13 CST

This archive was generated by hypermail 2.2.0 : Fri Dec 06 2013 - 08:34:15 CST