Re: Romanized Singhala - Think about it again from Philippe Verdy on 2012-07-08 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Mon, 9 Jul 2012 03:14:53 +0200

2012/7/9 Naena Guru <naenaguru_at_gmail.com>:
>> Using Latin letters for a transliteration of Sinhala is not a hack, but
>> making fonts said to be Latin-1 with Sinhalese letters instead of the Latin
>> letters is a hack.

Your hack us a hack? Simply because you've absolutely not understood
anything abot what is Unicode. And you are always confusing concepts.
It's true that the Unicode and ISO/IEC 10646 need to use a teminology,
that may not be understood the way you mean it or use it. That's why
they include definitions of these terms. Don't interpret the
terminology in way different to what is defined.

> Well, you can characterize the smartfont solution anyway you like. The
> problem for you is that it works!

No it does not work. Becaue you seem to assume that we can always
select the font. In most cases we cannot. So letters are encoded and
given unique code points, but the font to render it is determined
externally by the renderer. IOn most cases users won't want to have to
guess which font to use. Notably when these fonts are also not
available on their platform.

There's a huge life of text outside HTML and rich text formats for
documents. You absolutely want to ignore it. The UCS is there to allow
exactly the separation between the presentation (fonts for example)
and the semantics of the encoded texts.

The UCS is also designed to avoid the dependancy between languages.
Only the scripts are encoded (see the desription of what is defined as
"abstract characters").

An encoding is not just a collection of bits in fixed-width numbers.
Otherwise we would only see numbers on screen. The code points in the
UCS are given semantics via character properties.

- The "representative glyph" seen in Charts is only a very tiny part
of these properties, and in fact the least used of all of them. They
are only useful for producing visual charts.
- What is more imporant is how each distinctive code is behaving
within various mappings to support various algorithms. Including the
possibility to switch fonts transparenlty without breaking the text
completely (for example displaying a Greek Theta when a Latin Z awith
acute was encoded, or even a Latin X when a Latin R was encoded). The
encoding is what allows words and orthographies to be recognized,
still independantly of the font styles and other optional typographic
effects (because all scripts are made of an almost infinite number of
possible styles, that users will still read as part of the script
while still also recognizing the orthography used and the language).

Unicode and ISO/IEC 10646 do not encode glyphs directly in the UCS.
They do not encode orthographies, they do not encode languages. What
is encoded is a set of correlated properties. One of these properties
is a numeric property name "code point" which is also independant of
the final binary encoding (it could be one of the standard UTF's or
even a legacy 8-bit encoding with a mapping to/from the UCS) !

> Sorry for this Kindergarten lesson, but you should understand the role of
> the font. A font is a support application at the User Interface level.

Yes. But Unicode does not really matter about which font you will use.
Provided that they map glyphs coherently in such a way that Sinhalese
letters will not be rendered instead of the intended Latin letters
EVEN if a Sinhalese font has been selected.

> When text moves between
> applications and between computers, they travel as numeric codes
> representing the text in the form of digital bytes. The computer can't say
> French from Singhala.

Note relevant to out discussions in this Unicode mailing list. We
don't care about that and SHOULD not even car about. Unicode supports
a wide range of possible binary encodings. They don't change however
the code point assignments which are the central point from which all
other properties are mapped in al applications, including for
rendering (but not limited to it).

>> Oh, thank you for the generosity of allowing me use of the entire Latin
>> repertoire. You don't have to tell that to me.

We need to tell it again to you because you absolutely want to
restrict the repertoire to an 8-bit subset, when you ALSO
contractictorily say that you want to support thousands of aksharas.

Unicode supports millions of characters and tens of millions of glyphs
(possibly more) using a 21-bit encoding space (actually less than 20
if we leave aside the PUA which are also supported separately but with
an extremely free encoding with almost no standard properties). This
space is still representable with various encodings (some are part of
the Unicode and ISO/IEC 10646 standards, some are supported in the
references, and there are tons others, incljuding many legacy 7-bit or
8-bit SBCS encodings, from ISO or from proprietary platforms, or from
national standards not part of ISO, e.g. those developed in China PR
such as GB18030, or in India such as ISCII, plus many older standards
that have since been deprecated and are no longer recommended).

But ISO 8859-1 is DEFINITELY NOT a mer set of 8-bit numeric values.
You are confusing anonymous "bytes" with what is a text encoding. The
ISO-8859-1 enoding has a strong and unambuguous mapping to/from the
UCS that when it will be used, ONLY the Latin letters will be
displayed (ano not every possible ones).

It is strictly IMPOSSIBLE to encode any Sinhalese letter (or diacritic
or digit or ligatured conjunts, or any more complex akshara, or
simpler mora) using ISO-8859-1. Trying to do that will clearly violate
the ISO 8859-1 standard (and finally the UCS standards due to the
strong mapping from ISO 8859- to the UCS).

Your attempt to do that is then clearly a hack (which also violates
the separation from the encoding from the rendering style, and limit
you to ONLY some specific rich-text formats). It just complicates
things everywhere because nothing in your proposal will properly allow
identifying the language, and authors will be restricted to rich text
formats that allow them to use very specific fonts that are not
portable to all platforms. Your proposal also requires them to tag
these specific fonts everywhere, and switch fonts constantly in
documents. Creating composite documents that mix contents from various
sources will be almost impossible. In addition it will severely limit
them in their design choices.

Users will be required to use only visual renderers. (Aural or Braille
renderers won't work at all, they are not basing their transoforms on
font styles found in rich text documents). It will be impossible to
name files in a filesystem (filenames do'nt have any capability to
specify a required font name). Your system is then NOT accesssible.

-- if you don't remember what is "mojibake" look for this term in a
search engine...

Mojibake will immediately reappear (this problem that exploded in the
1980's was largely due to the need for interoperability across
operating systems. With the developement of the UCS and it now almost
universal adoption, mojibake are now a disappearing problem (but it's
a long process because there are still tons of legacy applications
based on legacy SBDC/DBCS encodings (even some not endorsed by a
national or international published standard, or defined in
proprietary systems and not documented at all publicly)

> I have traveled quite a bit
> in the IT world. Don't be surprised if it is more than what you've seen.

Visibly not. First you don't even know what is the ISO 8859-1 standard
that you want to break, when it has strictly no problem in relation
with the newer UCS system.

>> My solution is supported by two standards: ISO-8859-1 and Open Type.
>> ISO-8859-1 is Basic Latin plus Latin-1 Extension part of Unicode standard.
>> It is not supported by ISO-8859-1. ISO-8859-1 isfor Latin letters, not
>> Sinhalese ones.

NO. NO NO. ISO 859-1 is definitely NOT usable for encoding any
Sinhalese letters or script. May be it's usable as in a romanization
of the Sinhalese language, but this is an orthogonal problem to which
Unicode does not mandate or standardizes anything.

>> Bottom line is this: If Latin-1 is good enough for English and French, it
>> is good enough for Singhala too.
>>
>> No, because Sinhala is not written with Latin letters.
>
> Declarations like that won't work in a technical discussion. You need to
> explain. Singhala is a language. Singhala native SCRIPT is the traditional
> way it is written.

And everywhere you are attempting to mix the concept of the language
with the concept of the script and writing system. This stops your
arguments immediately there.

> When I write Jean I really entered the four code points:
> 74 101 97 and 110. When you write naena, you enter 110 97 101 110 and 97. We
> think the former is a name of a pretty girl and the latter is a name I made
> up not in a particular language.

These are not code points. Start reading the standard (notably chapter
3) to understand the concepts.

>> And if Open Type is good for English and French, it is good for Singhala
>> too.
>>
>> Of course.
>
> Thank you for that.

But this does not approve anything about all the rest you said.
OpenType can work for the Sinhalese script, provided that you don't
use it to infer ligatures of the Sinhalese script or Sinhalese writing
system from the encoding of Latin letters (in ISO 8859-1 or any other
binary encoding mapped to the UCS). OpenType already works with the
Sinhalese script using the UCS-encoded Sinhalese characters of the
Sinhalese script, and without even using any intermediate romanization
system (which is always lossy...).
Received on Sun Jul 08 2012 - 20:16:43 CDT

This archive was generated by hypermail 2.2.0 : Sun Jul 08 2012 - 20:16:43 CDT