Re: phonetic superscripts, etc.

From: John Cowan (cowan@locke.ccil.org)
Date: Tue Jul 06 1999 - 12:53:45 EDT


Edward Cherlin wrote:

> If IPA
> can be extended to a closed set that will never ever need further
> extensions, then my *opinion* is that it should all be in Unicode. If it
> requires the possibility of arbitrarily large future extensions, then it is
> equally my *opinion* that Unicode should have enough of its base
> characters, and Unicode fonts should have enough glyphs, to render them
> all, but that the entities should not in that case be made into Unicode
> characters, and that an IPA text datatype should be defined by the
> linguistics community as a separate standard.

It is a given that IPA requires what ISO 10646 calls Level 3
implementation, i.e. full support for combining characters.

Beyond that, I think it unlikely that the set of IPA base
characters will grow very much in future. There are, however,
phonetic standards other than the IPA that still have currency,
for a variety of reasons not very different from the existing
proliferation of natural language scripts (why aren't Latin
and Cyrillic just extended varieties of Greek, after all?),
and which deserve support because they are in active use by
non-IPA phoneticists.

Some of these characters, notably the Americanist ones,
are in Unicode 2.0, and Unicode 3.0 adds support for the
"disordered speech" set promulgated by IPA-the-organization
but not in IPA-the-standard. In any event, I doubt that anything
as large as 1000 characters is contemplated by anybody.

> If IPA and Greek are to be mixed, but remain distinguishable, you will have
> to use markup, just as if you had mixed Greek and Coptic.

There is, indeed, a proposal to deunify Greek and Coptic, on the
grounds that unifying them was a mistake. See Michael's
Coptic vs. Greek vs. Cyrillic page at
http://www.indigo.ie/egt/standards/cy/coptic.html and the
actual Coptic proposal at
http://anubis.dkuug.dk/JTC1/SC2/WG2/docs/n1658.htm .

In short, Coptic is actually rather more legible in a typical
lowercase Cyrillic font than a typical lowercase Greek one.

> Unicode cannot carry the burden of all possible semantics for a particular
> character. We cannot do a correct linguistic sort on Unicode plain text
> with no language markers,

It's not clear that language markers are either necessary or
sufficient. Typically, the correct sort order is the one chosen
by the end user of the data, not the order "native" to the data
itself. Anglophones are best off if O WITH DIAERESIS is sorted in with
O even for data natively in Swedish (e.g. proper names); Swedish users
are better off if U WITH DIAERESIS is sorted in with Y even for
data natively in German. It's all in what the end user expects.

> The most fun cases are languages written by their users of different
> cultures or at different times in two or more scripts, including, but not
> limited to,

There is an ISO working group trying to standardize much of this,
but it's not a character set issue (it's just as important for
handwritten materials).

-- 
John Cowan	http://www.ccil.org/~cowan		cowan@ccil.org
   Schlingt dreifach einen Kreis um dies! / Schliesst euer Aug vor heiliger Schau,
   Denn er genoss vom Honig-Tau / Und trank die Milch vom Paradies.
			-- Coleridge / Politzer



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT