From: Philippe Verdy (email@example.com)
Date: Wed Jun 25 2003 - 14:34:18 EDT
On Wednesday, June 25, 2003 8:14 PM, Peter Lofting <firstname.lastname@example.org> wrote:
> At 7:41 PM +0200 6/25/03, Philippe Verdy wrote:
> > If there are real distinct semantics that were "abusively" unified
> > by the canonicalization, the only safe way would be to create a
> > second character that would have another combining class than the
> > existing one, to be used when lexical distinction from the most
> > common use is necessary.
> > So the added character for the modified vowel signs would have the
> > same representative glyph, but would have the additional semantic
> > "contraction" (clearly indicated in their name). This does not break
> > the existing encoding of most texts, but allows a specific usage for
> > contractions where the existing canonical equivalences would be
> > inappropriate.
> How do you envisage this getting into the data?
> Often in Tibetan data capture, operators are keying in the appearance
> of a text and do not know what a stack represents.
> So the data then requires expert review after input to verify and
> assign the semantic representation.
This is not a major problem, in fact this occurs everyday in all scripts: there are correctors, and some dictionnary based corrections that may be used to help correct the "incorrectly" or ambiguously encoded string...
This is true even for all Latin-based languages, where the incorrect accents are used, or missing, and only native readers will be able to see the incorrect interpretation of a grapheme cluster, using their own knowledge of the language when the "error" (introduced by some intermediate technical constraint such as a past missing standard) appears.
I still think that the contraction "problem" has a limited impact, which doesnot affect the normal written form of the Tibetan language which clearly uses a single interpretation. If both interpretations of a grapheme cluster is needed, then we should keep the encoding of the existing characters for the most common interpretation (without the contraction semantics), and assign a variant specially to allow encoding the other interpretation or reading of the grapheme-cluster.
Legacy encoded text may still contain such ambiguous encodings that will look erroneous with the new updated standard, but this offers a way to correct later the encoded text, by looking at occurences of such ambiguous sequences, and letting actual native readers correct these interpretation, if the correction is absolutely required for some text processing.
I do think that most already encoded text will not need such correction, if the encoding is just a way to transport a text which is only intended to be rendered or printed, but not used with automated lexical analysis. And even in that case, if the encoding ambiguity is well documented in a revision of the standard, there is a possibility to enhance tools like automated full-text search engines to search for both encodings of the character, based on their actually identical glyphic representation.
This archive was generated by hypermail 2.1.5 : Wed Jun 25 2003 - 15:17:39 EDT