Re: Arabic - Alef Maqsurah

From: Peter_Constable@sil.org
Date: Thu Jul 15 1999 - 15:17:39 EDT


>Input methods, like presentation methods, should in my opinion be treated quite
distinctly from encoding design.

This is absolutely true. It is equally true, though, that encoding should be
treated quite distinctly from things like morphological analysis.

Within my workgroup, we adopt the following model for text processing:

        ------- -----------
       | INPUT | | RENDERING |
        ------- -----------
                 \ /
                    ----------
                   | ENCODING |
                    ----------
                 / \
    ------------ ----------
   | CONVERSION | | ANALYSIS |
    ------------ ----------

The meaning of 'input' and 'rendering' are clear, I think; 'conversion' refers
to conversion between alternate encodings; 'analysis' is a grab bag of many
operations that one may want to do on text: searching, sorting, morphological
analysis, syllabification, hyphenation, etc. Of course, input and rendering can
be seen as just two other operations to be done on text (former might be, e.g.,
an invertible mapping between sequences of character codes and sequences of
keystrokes; the latter might be, e.g., an invertible mapping between sequences
of character codes and sequences of positioned glyphs). We identify these
separately, though, because they are of particular importance in how text is
actually used; no matter what other kind of operation someone wants to do with
text, they almost always want to be able to key it and display it.

The key in all of this is that encodings must be designed to meet, as best as
possible, the needs of *all* the operations that the user expects to be able to
perform on the text. It is also essential to understand, though, that it may not
be possible to design a text encoding for a given script or language that
completely meets the needs of all operations users might want to perform.

For example, a user might desire to encode English text so that one of the
operations they can perform on it is to derive pronunciation (assuming some
dialect). It would be possible to design an encoding of English that contained
information about pronunciation, but without dictionary lookup that encoding
might not also be able to support display of English spelling, and keyboard
input would be far from what and English speaker woule expect. As a result, one
must decide what the priorities for the encoding must be.

For a lot of users, rendering is far and away the biggest priority. When you're
limited to using software that provides no support for complex relationships
between encoding and rendering (i.e. "dumb" renderers, which only support 1:1
relationships) and rendering is your top priority, then you design an encoding
around the needs of presentation. This was the whole reasoning behind the recent
discussion on dotless j (at least, initially that's how I understood it). But an
encoding that is optimised for rendering may make other operations more complex.
One has to consider the pros and cons and decide what approach to encoding will
best meet all of the needs for which the encoding will be used. Inevitably there
will be some compromises, though.

In the case of Unicode, a design decision was made at the outset that the
encoding must support round-trip conversion with existing encoding standards.
That has had an impact on its usefulness for other operations, though hopefully
that impact will prove over time to be not overly negative. Apart from that, I
think Unicode needs to be a general purpose encoding, able to support at least
the most common text operations, and beyond that as many operations as possible.
The supported operations must include conversion, input and rendering. Certain
"analysis" operations are also high priorities; searching and sorting are two
particular examples.

I think the question that should be asked with regard to the issues involving
Arabic that Gregg is raising is whether the operations he wants to be supported
are overly specific for a general purpose encoding, or whether they are a
reasonable expectation of this encoding.

While a design of the encoding of Arabic script based upon supporting things
like morphological analysis might not be appropriate for a general purpose
encoding, individuals certainly might want to allocate some PUA characters for
supporting such an operation in specific contexts (i.e. not all text should be
encoded using these characters, but in particular applications or with
particular corpi these characters would be used). If they think those characters
might be of wide enough interest to others, I would think it should be possible
for them to propose that those characters be added to the standard, even though
their use would be optional. (In some current work we're doing on Ethiopic
script, this is exactly the kind of thing I'm thinking of doing - creating
additional characters to be used along with Ethiopic characters in order to
support certain operations of importance to us - and I'm considering whether
they might be of enough interest to others to be included within the standard.)

In my mind, this is how Unicode should be presented - as a general purpose text
encoding intended to support a variety of operations, and specific questions
about design of the standard or proposals for particular scripts must be
evaluated in terms of the ability to support various operations, of which
conversion, input and rendering are the most important, but not the only
essential operations.

I hope this is useful to this discussion.

Peter



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT