Re: Inherent "a"

From: Avarangal (avarangal@hotmail.com)
Date: Thu Apr 11 2002 - 17:23:10 EDT


Dear Doug Ewell, William Overington, James E. Agenbroad, and Maurice
Bauhahn,

Thank you all for the reply.

May I assume u+0b85 as official?

Some explanations for the need for a visible "a".
In Tamil,

a/
dependent "ai", and "au" has ligatures. infact "au" and "ou" at present
utilise the same ligature.
(Additionally the use of "ai and au" are expected to be introduced, not as
ligatures.)

b/
(dithong?) such as ae, ao need to be linearly represented (without
ligatures)

d/
 The use of visible "a" for educational purposes with consonants are a
necessity.

e/
 A design plan need to be implemented, anticipating the possible use of
visible "a" instead of inherent "a" in the distance future.

Regards
Sinnathurai Srivas

>>
Avarangal

I'm not sure why you want a character for a "inherent a" which, in Indic
scripts, "exists"any consonant unmarked by a vowel sign or virama -
perhaps you could describe your application. You could use 17B4 in the
Khmer block. Since Khmer is also an Indic script, this character essentially
has the properties you're looking for - though it's in a different block. -
Another idea might be to use 0B85 (TAMIL LETTER A) + 093C (NUKTA)
or maybe 0B85 + 0F9D (VIRAMA); - I don't know Tamil but I think these
combinations would not normally occur.

There was once a proposal to encode an "inherent a" or "root marker" at 0F70
in the Tibetan block as some people thought this was necessary as Tibetan
syllables often contain silent prefixes and suffixes - but the primary
collation is on the main letter in a syllable (which may be the second or
third
character in the string). In Tibetan the first consonant (or consonant
stack)
marked by a vowel sign is the root of a syllable - but where there is no
vowel
sign (i.e. an "inherent a") there is no "flag" to indicate the root
consonant
so some thought it would simplify processing to have one.
A problem with this is that there would be no visible glyph for such a
character and if the consonant marked by such an invisible character
was deleted the inherent a character might get left behind and
consequently flag an adjoining character where it might not be wanted.

Also, since such a character is not necessary to display Tibetan properly,
chances
are you'd wind up with some people/ applications making use of this
character
and others not using it - so you'd get two different strings for the same
word.
In the case of Tibetan, the root consonant in a syllable or word can be
determined by rules or a lookup - and in the end it was thought better to
leave
it to applications to determine unmarked root consonants when they needed to
rather than having an inherent a character to mark them (which in any case
would
require a rule based system or lookup to insert reliably - unless you leave
it to
users to type in) . IMO in general use such a character would probably cause
more problems than it solved - though it might sometimes be useful in
private
data.

- Chris

>>>
>
> While we're waiting for someone with better knowledge of Indic scripts
> to reply...
>
> 1. An *inherent* A wouldn't have its own code point, would it? I don't
> think of it as having an existence outside of the consonant it goes
> with. Tamil KA is U+0B95, which represents K plus the inherent A. If
> you wanted to represent only the K, you would use U+0B95 plus the Tamil
> virama, U+0BCD, to kill the A. But how could you represent an inherent
> vowel by itself?
>
> 2. Assuming you have an answer to #1 above, the only way "you" could
> allocate a Unicode code point for this character would be to use the
> Private Use Area. You could choose any code point from U+E000 to U+F8FF
> for this purpose. (There are unofficial assignments for some of these,
> but you are perfectly free to ignore them.) Do *not* assign a code
> point in the Tamil block, or anywhere else except the Private Use Area,
> even if it's only for temporary and internal use. To do so would be
> very non-conformant.
>
> -Doug Ewell
> Fullerton, California
>
                                            Monday, April 1, 2002
There is always 0B85 for this vowel when it is not "inhering" to a
consonant.

     Regards,
          Jim Agenbroad ( jage@LOC.gov )
     "It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams." Adapted
from a letter by Gabriel Garcia Marquez.
     The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
     Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE,
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.

>>>>

I have no knowledge of the Tamil language and I am neither a member of the
Unicode Consortium nor a representative of the Unicode Consortium.

However, as a specific suggestion is sought for a temporary code point to
use for "inherent a", then as an individual end user of the Unicode system,
I suggest the following code point within the Private Use Area.

U+E7C0

This will not provide any exclusivity of definition for this code point,
however, it is in the lower part of the Private Use Area and is therefore
less likely to clash with code points in the Private Use Area used
internally within commercially available software, which uses tend to be in
the upper part of the Private Use Area in accordance with guidance notes in
the Unicode specification. The fact that a specific code point is being
suggested within this forum may possibly also mean that various people will
make a note of its use in their own lists of characters, so that, although
its use will not be an official Unicode Consortium allocation, people
interested in the use of the Private Use Area may well make a note of the
usage.

My reasons for suggesting this particular code point is that I am producing
some code points for research, and hopefully application, and have a block
of special characters from U+E700 through to U+E7FF, including U+E707 for a
ct ligature. I am looking at including a set of long s ligatures such as
LONG S B and LONG S L and so on. I have not yet finalized the codes, yet
they will be above U+E707 as I am not using U+E700 through to U+E706 at all,
so that this section of ct and long s characters dovetails with the
Alphabetic Presentation Forms, U+FB00 through to U+FB06, in the hope that
the ct and long s characters that I suggest might one day be promoted to the
Alphabetic Presentation Forms section.

The upper part of the 256 code point block from U+E700 through to U+E7FF is
presently unused in my use of the Private Use Area and so a section from
U+E7C0 through to U+E7FF would seem a good place to have a section for
various code points used for research.

I too am interested in how the inherent "a" character would be used. Does
it have its own glyph or is it a code that modifies something else, or what?

William Overington

1 April 2002

www.users.globalnet.co.uk/~ngo

-----Original Message-----
From: Avarangal <avarangal@hotmail.com>
To: unicode@unicode.org <unicode@unicode.org>
Date: Friday, March 29, 2002 9:33 PM
Subject: Inherant "a"

>I need to allocate a U+codepoint for inherent "a", to be used for Tamil
>research. Can anyone suggest a temporary location or is it possible to find
>such code point within the existing code point for Tamil.

>>>>>>
Why do you need to have a code for 'inherent a' in Tamil?

There is some imprecision concerning what constitutes an 'inherent' vowel.
In this note I am referring to normally unwritten vowels that are
nevertheless pronounced.

I know nothing about Tamil, but in Khmer Unicode there are two such inherent
"a" characters. A long inherent (native Khmer language) at U+17B5 and a
short inherent (Sanskrit/Pali) at U+17B4. Their encoding has raised some
outcry (in fact some parties are trying to deprecate them), but the more I
analyse grammars, dictionaries, and round-trip transliteration the more
importance they assume.

(1) If you look at a dependent vowel series in an Indic script...they often
start with an unwritten 'inherent a' character, recognising their unique
existence.

(2) If you transliterate between an Indic script and a Latin [or other
phonetic] transliteration, the inherent vowel must become explicit in the
transliteration (hence it would be extremely useful for round-trip
conversion reasons to have a code in the Indic encoding to match that).
Dependable round trip conversion of text is becoming increasingly important
when a single minority language spans national borders where government
authorities on opposite sides of the boarder insist the 'national' script of
their respective country be used to render that language.

(3) Not every consonant cluster that lacks an explicit dependent vowel also
contains an 'inherent a' (in particular in Khmer it is unpredictable from
the context [i.e., without a lookup] whether a final consonant cluster
without a dependent vowel has a pronounced inherent or not).

(4) Non-final clusters lacking an explicit dependent vowel 'always' (a
dangerous word to use!) have an 'inherent a', possibly short or long.

(5) Depending on the foreignness of the word an 'inherent a' in Khmer may be
short (foreign) or long (Khmer language)

(6) Dictionaries have to make the short 'inherent a' vowel explicit in their
pronouncing sections (usually borrowing U+17C8 to display it; however you
would not want to raise ambiguity by using that code both when it is
normally displayed and when it is there for making pronunciation clear)

(7) For phonetic rendering of an Indic script, therefore, it would be very
useful to selectively encode it. In the future data input and output will
increasingly move to verbal/aural, rather than keyboard means. This would be
quite an exciting development for Khmer...because Khmer is difficult to
keyboard and presumably relatively easy for a computer to recognise (what
with about fifty vowel/vowel-sign combinations that are easier for computers
to recognise than consonants). Hence, I would assume that codes to capture
verbal data converted to Unicode text will similarly become increasingly
important.

(8) 'Inherent a' is often used in combination with vowel-like signs such as
U+17C6 NIKAHIT, U+17C7 REAHMUK, U+17C8 YUUKALEAPINTU to generate vowels with
consonantal final sounds. Failure to recognised the 'inherent a' results in
wrongly interpreting those consonant-like signs as vowels. These vowel+sign
ligatures are in fact treated like unique vowels in sorting.

There are arguments against using 'inherent' vowels.

(a) Unwritten characters tend to not be typed! And if they were, the data
stream length would grow remarkably.
(b) Binary comparison of words with and words without 'inherent' vowels
would be problematic
(c) The average user would probably not gain advantage from the inclusion of
'inherent' vowels in the text stream
(d) I could not find more than one instance in the authoritative Chuon Nath
Khmer dictionary where two words otherwise spelled the same were
distinguished by the length of their inherent vowels. It is hard to write a
sorting rule on one data point;-)
(e) Rendering mechanisms may not recognise the (rarely used) inherent code
and cause problems when it is used.

Hence, it would be preferred that the use of inherent vowels be sharply
circumscribed...but not eliminated altogether.

In summary, inherent vowels:

(1) Are characters in their own right
(2) Are needed for round trip script conversion (transliteration)
(3) Are not a trivial case: They are not contained in every consonant
cluster even when that cluster does not contain a visual dependent vowel
(4) Are useful for preserving phonetic value in dictionaries or
text-to-speech applications

Interested,

Maurice Bauhahn

-----Original Message-----
From:
Sent: 29 March 2002 19:38
Subject: Inherant "a"

I need to allocate a U+codepoint for inherent "a", to be used for Tamil
research. Can anyone suggest a temporary location or is it possible to find
such code point within the existing code point for Tamil.

Maurice Bauhahn
>>>>>>>>



This archive was generated by hypermail 2.1.2 : Thu Apr 11 2002 - 14:54:57 EDT