Codicology Meets Unicode: The Need for Southeast Asian Extensions

Christian Bauer - Humboldt University

Intended Audience: Software Engineers, Font Designers, Content Developers, Graphic Designers, Technical Writers, Librarians, Archivists
Session Level: Beginner, Intermediate, Advanced

The transcription of Southeast Asian manuscripts and epigraphs into their modern scripts presents the 'digital archivist' with a number of interesting coding problems.

These reveal contradictory coding and implementation practices in Unicode, plain oversights by national standardizing bodies, and language identification issues.

My aim is to make a case for Southeast Asian 'archival' extensions, which may be integrated into existing blocks, such as U+0E00-U+0E7F (Thai) and U+1780-U+17FF (Khmer).

I shall focus in my presentation on Thai and Khmer not only because of their vast textual resources but also because Khmer followed Thai scribal practices in the post-Angkorian period, while Thai was also being written --in certain regions down to the late 19th century-- in Khmer script.

One may argue that the need for special characters encountered in manuscripts should be relegated to the 'Private Use Area' or, alternatively, that such codicological requirements could be satisfied by coding in XML (by simply tagging them), using the modular DTD of the Text Encoding Initiative (TEI). The latter would require a partial rewriting of the DTD --which TEI does not allow--, the former ignores that the Thai and Khmer blocks provide already code points for some such 'archaic' characters.

Thus U+0E03 and U+0E05 are provided in Unicode but are so obsolete in modern Thai that they do not even appear in recent dictionaries.

By the same token, texts, in which these two characters may occur, also feature diacritics which may be rendered as U+0E48 and U+0E4B but fulfill a totally different function: in the latter case, the character may be used as a glyph of U+0E49, in the former it may not be a tone-mark at all but indicate syllabicity (anaptyctic vowel or epenthesis).

There are similar cases in Khmer, such as U+17C9, semantically ambiguous and which may also be subscripted, or the combination of U+17B6 and U+17D0 following characters other than U+17A4.

New extension sets for the coding of manuscripts and epigraphs will be proposed.

