Re: Romanized Singhala - Think about it again

From: Jean-François Colson <>
Date: Mon, 09 Jul 2012 05:20:45 +0200

Le 09/07/12 01:29, Naena Guru a écrit :
> Jean-François,
> Let me approximate it in Romanized Singhala: 'jyó-frósvaa'. Just trying...
I don’t know how that transcription should be pronounced but in IPA,
Jean-François is /ʒɑ̃.fʁɑ̃.swa/.

> Thank you for your interest. See inline responses.
> On Thu, Jul 5, 2012 at 7:35 AM, Jean-François Colson <
> <>> wrote:
> Le 05/07/12 10:02, Naena Guru a écrit :
>> On Wed, Jul 4, 2012 at 11:33 PM, Philippe Verdy
>> < <>> wrote:
>> Anyway, consider the solutions already proposed in Sinhalese
>> Wikipedia. There are verious solutions proposed, including
>> several
>> input methods supported there. But the purpose of these
>> solutions is
>> always to generate Sinhalese texts perfectly encoded with
>> Unicode and
>> nothing else.
>> Thank you for the kind suggestion. The problem is Unicode Sinhala
>> does not perfectly support Singhala!
> What’s wrong? Are there missing letters?
> Many, many.
>> *The solution is for Sinhala not for Unicode!*
> Or rather for Sinhala by Unicode.
> Sure, if you want to do it with proper deliberation.
>> **I am not saying Unicode has a bad intention but an
>> ill-conceived product.
> What precisely is ill-conceived?
> Anglo-centric thinking is what is wrong.

> Let me take you on the scenic route:
> Number of letters in Singhala is only theoretical. In the case of
> Singhala orthography, the actually used number depends on the Sanskrit
> vocabulary.
Do you mean there are many conjunct consonants, sometimes with a
separate glyph?
In Devanagari, they’re made by typing two or more consonants separated
by halants. Isn’t that possible with Sinhala?

> My test font has about 1500.
IIRC there are only 191 code-points in Latin-1. Or 96? I’m not sure
whether ASCII is part of Latin-1.

> I need much more. Pali orthography enforces touch-letter rule (see
> later). Modern Singhala, meaning 1st century onwards, is an admixture
> of Singhala and Sanskrit. Pali is not mixed into Singhala text (except
> to quote like a foreign language).
> Generally, the Unicode approach is to treat the consonants as base
> shapes. Then the vowel signs are added around them. The vowel signs
> have their own codepoints. We hit upon the first problem here because
> there are two possible codepoints for each single-mora vowel,
> double-mora vowel and each diphthong. We lose all hopes of traditional
> search, replace etc. It complicates collation too.
In my first name, I usualy type the ç as U+00E7 because I have a Ç key
on my keyboard, but using the two code-points U+00E7 U+0327 (c +
combining cedilla) would be correct too and in many software the search
function won’t find the ç if I’m looking for a ç.

> Then they went up the gum tree of the notion that the Singhala
> consonant is actually a consonant with a vowel inside it -- an
> absurdity -- the Abugida theory. They then added two ligatures without
> normalizing them. Singhala has 15 ligatures in that category of
> ligatures. They included upadhmAnIya and left out jihvAmUlIya.
Can’t those ligatures be typed as separate signs with a zero-with
joiner? I’ve seen there’s one on Sinhalese keyboards.

> That concludes that Unicode Sinhala is not grammar compliant.
What do you mean?

> This is the first requirement. It is not Unicode compliant because is
> has canonicals that are not normalized.
Can’t you describe in details those missing “canonicals” to make a proposal?

> It has a jumble of Singhala letters and signs and duplicates the same
> phoneme.
Duplicates the same phoneme…
Do you mean there are characters which are encoded twice? In this case,
you can ignore one of them.
Do you mean vowels and diphtongs can be written both as separate letters
and as diacritics after a consonant? In this case, that works like in
Indic scripts. I don’t see anything wrong in that.

> It is good only for the trash can.
Isn’t that an exaggeration?

> Here are the considerations for a successful encoding:
> A consonant is called a 'hal akSara'. Check the Sanskrit dictionary:
> In the pre-printing tradition, adjoining consonants either had
> standard ligatures or they were written touching each other to
> indicate they are digraphs or trigraphs. The vowel signs surround
> these. When a consonant occurred at the end of a word, that was
> flagged by the halant sign. 'halant' means hal at end.
> With the advent of letterpress printing, touch letter technique became
> difficult or impossible to implement. So, now we have a new concept of
> 'antara' hal -- interior consonant. The modern orthography first
> honors the Sanskrit ligature rules and then drops the touch-letter
> rule. The ligatures are described in the following books:
> A.M. Gunasekara - Acomprehansive Grammar of the Sinhalese Language
> (1891) - pp 16-18, Rev Theodore G. Perera - The Sinhala Language
> (1932) -- 57 - 58.
> There is only one way that Singhala could be digitized to do justice
> to the continuation of its writing system and to smoothly support the
> past. That is to define the Vowels, diphthings, consonants and
> prenasals as individual codepoints. You can give a codepoint each for
> the anusvara (ng sound) and visarga (the guttural postfix on vowels)
That’s what has been done in Unicode.

> as the earlier Sanskrit transliterations did or you could provide one
> codepoint each for the modified vowels.
That’s the Unicode dependent vowel signs.

> I did this latter in romanized Singhala.
What did you exactly do?

>> The fault is with Lankan technocrats that took the proposal as it
>> was given and ever since prevented public participation. My
>> solution is 'perfectly encoded with Unicode'.
> No. It’s an 8-bit character set independant from Unicode.
> To think 8-bit is outside Unicode is wrong.
Could you explain it?

> 8-bit character set is the core of Unicode.
I’m very surprised, voiceless.

> It is the best placer to park any language because it is the most
> stable part of the Unicode character database.
I wouldn’t like to type in Japanese with 8-bit fonts, changing the font
for every new glyph. I tried it once on Windows 3.1… It was very

>> Yes thee may remain some issues with older OSes that have limited
>> support for standard OpenType layout tables. But there's now no
>> problem at all since Windows XP SP2. Windows 7 has the full
>> support,
>> and for those users that have still not upgraded from Windows XP,
>> Windows 8 will be ready in next August with an upgrade cost
>> of about
>> US$ 40 in US (valid offer currently advertized for all users
>> upgrading
>> from XP or later), and certainly even less for users in India
>> and Sri
>> Lanka.
>> The above are not any of my complaints.
>> Per Capita Income in Sri Lanka $2400. They are content with cell
>> phones. The practical place for computers is the Internet Cafe.
>> Linux is what the vast majority needs.
>> And standard Unicode fonts with free licences are already
>> available
>> for all systems (not just Linux for which they were initially
>> developed);
>> Yes, only 4 rickety ones. Who is going to buy them anyway?
> Why would you buy them if they’re free?
> Brilliant!
>> Still Iskoola Pota made by Microsoft by copying a printed font
>> is the best. You check the Plain Text by mixing Singhala and
>> Latin in the Arial Unicode MS font to see how pretty Plain text
>> looks. They spent $2 or 20 million for someone to come and teach
>> them how to make fonts. (Search Staying friendly with
>> them is profitable. World bank backs you up too.
>> Sometime in 1990s when I was in Lanka, I tried to select a PC for
>> my printer brother. We wanted to buy Adobe, Quark Express etc.
>> The store keeper gave a list and asked us to select the programs.
>> Knowing that they are expensive, I asked him first to tell me how
>> much they cost. He said that he will install anything we wanted
>> for free! The same trip coming back, in Zurich, the guys tried to
>> give me a illicit copy of Windows OS in appreciation for
>> installing German and Italian (or French?) code pages on their
>> computers.
>> there even exists solutions for older versions of iPhone
>> 4. OR on Android smartphones and tablets.
>> Mine works in them with no special solution. It works anywhere
>> that supports Open Type -- no platform discrimination
> Is there any platform discrimination with Unicode Sinhala?
> You mean Apple / Windows / Linux?
I mean what you meant one line above.

> Not really, but Microsoft was ahead of others. They all just support
> the crippled system.
>> No one wants to get back to the situation that existed in the
>> 1980's
>> when there was a proliferation of non-interoperable 8 bit
>> encodings
>> for each specific platform.
>> I agree. Today, 14 languages, including English, French, German
>> and Italian all share the same character space called ISO-8859-1.
> In fact, ISO-8859-1 is not well suited for French (my native
> language): it lacks a few letters which were added to ISO-8859-15.
> However, I always use Unicode today, even for French-only texts.
> Jean, you are lucky because you use Latin letters. Latin letters are
> always bare individual letters. Sinhala is not so. It has all these
> other shaping complications and special rules are applied per Complex
> language.
> I think you could appreciate my dilemma. This is how I see it. Going
> outside ISO-8859-1 is lot of trouble.
Which ones?

> Should I enumerate them to you?
Why not?

> Just see the daily questions and dedicated section for Indic at
>, and think why ordinary people Anglicize instead of using
> Unicode Sinhala. (e.g. <>).
Some also use the Sinhalese script.
I’ve sometimes seen people type in Arabic with Latin letters in a French
Library, because the computers they used only had French keyboards and
they didn’t know an Arabic keyboard enough to touch type in Arabic with
Arabic letters.

> It's a colossal failure!

> The people Anglicize than using Unicode Sinhala.
What do you mean? If they transliterate, that’s not really anglicization.

> To be fair, the Lankan technocrats did not have a clue when they were
> asked to approve the standard.
I know that problem. The same occured for French with Latin-1. That’s
why some French letters are missing in Latin-1.

> It was a time when there was (perhaps even now) a typist in the corner
> of the office of the bureaucrat. The big guys do not know touch-typing
> even now. Proof: A university professor wrote me a harangue using
> cyber-sex orthography (no capitals) accusing me for working for
> Americans. I had suggested that Unicode is a conspiracy to confuse us.
> (That is a bit way over, no such motive, nevertheless the effect is
> the same)
>> Romanized Singhala uses the same. So, what's the fuss about? The
>> font?
The fact that your encoding won’t be supported on many computers worldwide.

> The problem is that only your translitteration scheme, with Latin
> letters, is supported by ISO-8859-1, not the Sinhalese letters
> themselves.
> You are right partially. I do not need permission from anyone to use
> any font.
> Jean, the computer thinks it is ISO-8859-1. ISO 8859-1 is only a set
> of numbers!
No. It also is a table which makes relations between those numbers and a
well defined set of characters, mainly Latin letters.

> [128-255]. Don't get stuck with the names of the codepoints. The
> stupid computer cannot read their names. What travels the network are
> the bytes in there bare form.
OK. But your 8-bit encoding won’t be ISO-8859-1 a.k.a. Latin-1.

> When they are viewed, the user has the choice (theoretically) to
> select the font.
On most browsers it is easy to change the encoding. Changing the font is
less evident.

> Stop your imagination and do this. Go to this site:
> <>
> On the right-hand-side column (directly below the lion), there is a
> link in a light-blue box that says, "Latin Script". Click on that and
> get rid of the dreaded Singhala script and be happy. What you see is
> not Icelandic. It is romanized Singhala. And if you want to really
> read it, click on the next link below and see the pronunciation key.
> I have requested a fellow to translate at least the page on Unicode to
> English. Hopefully, he does it quick.
>> Consider that as the oft suggested IME. Haha!
>> And your solution also does not work in multilingual contexts;
>> If mine does not work in some multilingual context, none of the
>> 14 languages I mentioned above including English and French don't
>> either.
> They do because they use Latin letters, not Sinhalese letters.
> English, French and romanized Singhala do not work on multilingual
> contexts.You are confusing letters and codepoints. Letters are
> provided by FONTS in the user interface in the LOCAL device.
> By the way, how do you localize in France?
On the Linux computer I’m using right now, I use a UTF-8 locale
(fr_BE.UTF-8). And I’m not in France. Other common locales are Latin-1,
and Latin-9. They are disappearing, but that’s a long process.

> Do you know that the English writing was romanized when the English
> people were forced into Christianity?

> The only truly surviving English letter is þorn (þ).
It comes from ᚦ. But there’s also some similarity between ᚠ and F, ᚱ and
R, ᚳ and K, ᚻ and H, ᛁ and I, ᛋ and S (with a rotation), ᛏ and T, ᛒ and
B, ᛗ and M, so Þ is not really the only surviving Futhorc letter.

>> it does
>> not work with many protocols or i18n libraries for applications.
>> i18n is for multi-byte characters. Mine are single-byte characters.
> OK. Do it as you want, but it won’t be Unicode compliant.
> Thank you for your generosity, sire. I waited all this long for it. (I
> am kidding).
>> As you see, the safest place is SBCS.
> I don’t see. Why is it safer?
> Just compare romanized Singhala and Unicode Sinhala.
> First, the display of the script is not guaranteed. You get
> Character-not-found rows if you do not have the font.
And you get Latin letters if you don’t have your special font.

> Then you see garbage with letters and signs mixed up if you did not
> update your font renderer (e.g. uniscribe).
Is your font independent from Uniscribe for Windows users?
Where can I download your font?

> (Only Windows 7 comes with latest Uniscribe). Different fonts have
> different levels of letter construction, and some have wrong letters
> for wrong codepoints.
Is that a problem with Unicode or a problem with the font perhaps made
by an incompetent person?

> This is how it is in iPhone.
> When you transport Romanized Singhala, you do not need to re-encode it
> (e.g. UTF-8) for the purpose and bloat it. There is not even an HTML
> editor for it. You need to re-write all well established and seasoned
> applications using updated compilers that added wide-character functions.
> Here is a test for you. The following is a Unicode Sinhala paragraph
> (a random copy from the web site news web site
> (Sunday issue). Your computer must be Plain Text ready for this. I bet
> it is not.
> ළමයින්ගෙ අධ්‍යාපනය කඩාකප්පල් වෙනව තමයි. ඒත් ඉතිං මොකද කරන්නෙ? රටේ ආණ්‌ඩුවට
> පණිවිඩයක්‌ දෙන්න ස්‌ට්‍රයික්‌ නැතුව බැරිවීම අවාසනාවන්ත තත්ත්වයක්‌. මේ ප්‍රශ්න සමූහය දැන්
> තීරණාත්මක තැනකට ඇවිත් තියෙනව.
> 1 Copy it to Notepad
That’s a Windows software. Can I use Gedit instead?

> 2. From Notepad, copy it to a new MS Word page
That’s a Windows software. Can I use LibreOffice Writer instead?

> 3. Copy what you pasted into Word back to Notepad below the original
> 4 Copy that second one from Notepad back to Word below the one it
> already has
> Observe that MS Word altered the codepoints in the underlying text runs.
Could you make a few screenshots?

>> Or it
>> requires specific constraints on web pages requiring complex
>> styling
>> everywhere to switch fonts.
>> Did you see
>> <>? May be you are confusing Unicode
>> Sinhala and romanized Singhala. Unicode Sinhala has a myriad such
>> problems.
> Which problems?
> See above including the test.
A test I can’t reproduce because I don’t have a copy of M$ Word.

>> That is why it should be abandoned!
> Why wouldn’t you try to solve the problems, whatever they could
> be, instead of proposing an entirely new character set nobody will
> support?
> There are only two solutions. ONE: Completely redefine the Singhala
> code block .
Impossible. What is already encoded cannot be changed. But new
characters could be added.

> TWO: Just abandon it and use the transliteration. Why go through the
> trouble to satisfy fellows like you who do not use Singhala anyway?
Do it if you like. I’m not sure your fellow countrymen will follow you.

> If the rendering engines don’t work as you expect they should, how
> a new encoding scheme could solve the problem?
> The rendering engine works just fine! It is the code block that is sick.
It is not impossible it needs a treatment.

> You are way off base, buddy.

>> Please look at the web site and say it more coherently, if I
>> misunderstood you.
>> Plain text searches in mutliingual pages
>> won't work. Usability tools won't work.
>> Have you tried to search a vowel in Unicode Sinhala? Romanized
>> Singhala has no search problem. Try it in the my web site.
> Well, perhaps there’re problems with search engines.
> Haha! I am not talking about search engines. I am talking about text
> processing.
What’s the difference? Search engines like Google look for a string of
characters on the WWW, to say it simply, while your favorite word
processor’s search function looks for it in a single document, but
that’s still a piece of software to update.

> I am sorry but talking to you is like the Singhala saying, "biiri
> aliyaata veenaa gahanavaa vagee." -- Like playing the violin for the
> deaf elephant.
Isn’t the player deaf too?
> Wouldn’t it be possible to correct search engines instead of
> inventing a new character set?
> You need to go back to school. There is no new character set. A
> Unicode character is just a numeric code Unicode character database
> goes from zero to some very big number. There are no holes in it to
> define character sets for somebody's fancy. Well, Doug Ewell did one
> for Esparanto expanding fuþorc. We need to do something practical, and
> I did it already.
When you use a new 8-bit encoding for your Sinhalese font, that is a new
character set. And it has nothing to do with Unicode. No need to go back
to school to understand it.

>> Really consider abandonning the hacked encoding of the Sinhalese
>> script itself.
>> There is no re-encoding of Singhala. Singhala is transcribed into
>> Latin! When I say Singhala, I don't mean Unicode Sinhala. It is
>> the Singhala phoneme inventory that was transliterated.
> Using Latin letters for a transliteration of Sinhala is not a
> hack, but making fonts said to be Latin-1 with Sinhalese letters
> instead of the Latin letters is a hack.
> Well, you can characterize the smartfont solution anyway you like. The
> problem for you is that it works!
That’s not a problem for me. Note that I can’t copy and paste it to a
text editor.

> Sorry for this Kindergarten lesson, but you should understand the role
> of the font. A font is a support application at the User Interface
> level. It is what the user decides to use to see underlying text runs
> in an application's view port. The same text one person reads at the
> computer in Arial others read in Helvetica. In the same manner, if I
> did not deliver the font with the web page, you will see it in some
> sans-serif font your computer has.
But I’ll see Latin letters instead of Sinhalese ones.

> It is something that happens locally in the device. When text moves
> between applications and between computers, they travel as numeric
> codes representing the text in the form of digital bytes. The computer
> can't say French from Singhala.
>> It will however be more valuable if you just
>> concentrate on creating a simpler romanization system. that
>> will use
>> standard Unicode encoding of Latin
>> This is exactly what I did. Have I been talking to someone who
>> did not know what he was evaluating?
> I think he was speaking of the translitteration, not of your hack.
> I hope the fellow reads the above response. I wish you guys lived
> close by here in US so that I could hold a special class to teach you
> how computers function.
I live in Belgium. What would you teach me?

>> (note that you are absolutely not
>> limited to the reduced ISO 8859-1 subset for Latin and that
>> there's
>> already a much richer set of letters, symbols and diacritics
>> for all
>> needs ; but here again this requires using Unicode and not
>> just ISO
>> 8859-1).
>> Oh, thank you for the generosity of allowing me use of the entire
>> Latin repertoire. You don't have to tell that to me. I have
>> traveled quite a bit in the IT world. Don't be surprised if it is
>> more than what you've seen. (Did you forget that earlier you
>> accused me of using characters outside ISO-8859-1 while claiming
>> I am within it? That is because you saw IAST and PTS displayed.
>> They use those wonderful letters symbols and diacritics you are
>> trying to tout. Is there a problem with Asians using ISO-8859-1
>> code space even for transliteration?
>> The bonus will be that you can still write the Sinhalese
>> language with a romanisation like yours,
>> Bonus?
>> but there's no need to
>> reinvent the Sinhalese script
>> Singhala script existed many, many years since before the English
>> and French adopted Latin.
> Did any body say it didn’t?
> He said reinvent the Singhala SCRIPT. The script is the script. I use
> the same script in a more complete and correct manner than any Unicode
> font even with my incomplete, rough design, proof-of-concept font.
>> What I did was saving it from the massacre going on with Unicode
>> Sinhala.
> Which massacre? What’s wrong with the Unicode support of Sinhala?
> Could you give details, please?
> I gave the details earlier in this response
Not enough details.

>> itself that your encoding is not even
>> capable of completely support in all its aspects (your system
>> only
>> supports a reduces subset of the script).
>> What is the basis for this nonsense?. (Little birds whispering in
>> the background. Watch out. They are laughing).
>> My solution supports the entire script, Singhala, Pali and
>> Sanskrit plus two rare allophones of Sanskrit as well. Tell me
>> what it lacks and I will add it, haha! One time you said I
>> assigned Unicode Sinhala characters to the 'hack' font. What I do
>> is assigning Latin characters to Singhala phonemes. That is
>> called transliteration. There are no 'contextual versions' of the
>> same Singhala letters like you said earlier.
>> Ask your friends what they have more than mine in the Singhala
>> script. Ask them why they included only two ligatures when there
>> are 15 such.
> Can’t you make a proposal or describe the missing letters?
> Let it rot in place. (Lankan government might need it to get loans
> from WB to feed the IT guys over there). I proved that it is not
> necessary. Romanizing takes care of it and the native readers can use
> the orthographic font if they want. Otherwise, they can use Latin
> script just like you and I do here. Remember that the font is a local
> decision. It need not go out of your computer and cause heart ache
> among people like you. The following is the first sentence at:
> oba kiyavana ðeruva heøa kramaya viðyaaþmakava haa
> vyaakaraµaanukuulava saðaa æþi nisaa, eya batahira yuroopiiya bhaaxaa
> parigaµakaya þula labana varaprasaaða elesama síhalataþ labaa ðeyi.
> I suggest you get with it and move on.
>> Ask them how many Singhala letters there are.
>> Even the legacy ISCII system (used in India) is better,
>> because it is
>> supported by a published open standard, for which there's a
>> clear and
>> stable conversion from/to Unicode.
>> My solution is supported by two standards: ISO-8859-1 and Open
>> Type. ISO-8859-1 is Basic Latin plus Latin-1 Extension part of
>> Unicode standard.
> It is not supported by ISO-8859-1. ISO-8859-1 isfor Latin letters,
> not Sinhalese ones.
> It is worth your traveling to America to learn what is a character
> encoding. A character set is not anything you go and ask permission to
> use it. If you use it, you have used it.
I’ve just said your font is not supported by Latin-1 which defines an
encoding for some LATIN letters.

>> Bottom line is this: If Latin-1 is good enough for English and
>> French, it is good enough for Singhala too.
> No, because Sinhala is not written with Latin letters.
> Declarations like that won't work in a technical discussion. You need
> to explain. Singhala is a language. Singhala native SCRIPT is the
> traditional way it is written. When I write Jean I really entered the
> four code points: 74 101 97 and 110. When you write naena, you enter
> 110 97 101 110 and 97. We think the former is a name of a pretty girl

Which one? Jean? That’s a male name and only the first part of my first

> and the latter is a name I made up not in a particular language.
>> And if Open Type is good for English and French, it is good for
>> Singhala too.
> Of course.
> Thank you for that.
Received on Sun Jul 08 2012 - 22:24:47 CDT

This archive was generated by hypermail 2.2.0 : Sun Jul 08 2012 - 22:24:48 CDT