Re: Unicode, SMS and year 2012 from Naena Guru on 2012-04-28 (Unicode Mail List Archive)

From: Naena Guru <naenaguru_at_gmail.com>
Date: Sat, 28 Apr 2012 20:22:45 -0500

Hi Cristian,

This is a bit of a deviation from the issues you raise, but it relates to
the subject in a different way.

The SMS char set does not seem to follow Unicode. How I see Unicode is as a
set of character groups, 7-bit, 8-bit (extends and replaces 7-bit), 16-bit,
and CJKV that use some sort of 16-bit paring. As Unicode says, they are
just numeric codes assigned to letters or whatever other ideas. It is the
task if the devices to decide what they are and show them

You say that there are only two character sets in GSM: 7-bit, which is a
reassignment of codes to a select Latin letter shapes, and 16-bit for the
rest. It appears as if they decided that a certain set of letters are
common to some preferred markets, and that it is efficient to reassign the
established Unicode characters to this newly selected letter shapes. Had
they simply used the 8-bit ISO-8859-1 set, the number of characters per SMS
would limit to 140 instead of 160. (Is that why Twitter limits the # of
chars to 140?). Of course, that would not have included some users whose
letters are 16-bit characters under Unicode.

I made a comprehensive transliteration for the Singhala script
(Singhala+Sanskrit+Pali). It shows perfectly when 'dressed' with a
smartfont. The following are two web sites that illustrate this solution
(every character is ISO-8859-1, except for the occasional ZWNJ, which
actually should be 8-bit NBH that somebody decided to leave undefined. Use
any browser except IE. IE does not understand Open Type)
http://www.lovatasinhala.com (hand coded)
http://www.ahangama.com/ (WordPress blog)

All Indic languages could be transliterated this way. it makes Indic
similar to Latin based European languages with intuitive typing and
orthographic results, which Unicode Sinhala can't do. It takes about half
the bandwidth to transmit that the double-byte set. I just noticed that
transliterated Singhala would not be fully covered with SMS 7-bit because
some Unicode 8-bit characters are not in this set.

Looking at my iPhone, I see that the International icon brings up
key-layout plus font pairs. I think what they should do is to separate
fonts and key-layouts.This way, the user could select the key layout for
input and whatever font they want to use to show it. The next thing I am
going to say made many readers here very angry, but may I say it again?

The idea of Last Resort Font that makes basic editors Plain Text is a ploy
to brag that the computer can show all the world's languages that most you
cannot read anyway. The text runs of foreign languages should show as
series of Glyph Not Found character or the specific hint glyph of a
language. The user of a foreign language would know where to download fonts
of their native language. In the small market of Singhala, no font is
present that goes typographically well with Arial Unicode. There is no
incentive or money to make beautiful fonts for a minority language like
Singhala. The plain text result for Singhala is ugly. The OS makers
unnecessarily made hodge-podge Last Resort Fonts

I hope both the mobile device industry and the PC side separate fonts and
characters and allow the users to decide the default font sets in their
devices. This is eminently rational because the rendering of the font
happens locally, whereas the characters travel across the network. This
will also help those who like me who understand that their language is
better served by a transliteration solution than a convoluted double-byte
solution that discourages the natives to use their script.

Actually, this is causing bilingual Singhalese to abandon their native
language. The government is making special emphasis on English, as Singhala
is terribly difficult to use in the modern setting. This is a grave problem
for a society of near 100% literacy rate, and just a few million.

On Fri, Apr 27, 2012 at 3:06 AM, Cristian Secară <orice_at_secarica.ro> wrote:

> Few years ago there was a discussion here about Unicode and SMS
> (Subject: Unicode, SMS, PDA/cellphones). Then and now the situation is
> the same, i.e. a SMS text message that uses characters from the GSM
> character set can include 160 characters per message (stream of 7 bit ×
> 160), whereas a message that uses everything else can include only 70
> characters per message (stream of UCS2 16 bit × 70).
>
> Although my language (Romanian) was and is affected by this
> discrepancy, then I was skeptical about the possibility to improve
> something in the area, mostly because at that time both the PC and
> mobile market suffered about other critical language problems for me
> (like missing gliphs in fonts, or improper keyboard implementation).
>
> Things evolved and now the perspectives are much better. Regarding the
> SMS, at that time Richard Wordingham pointed that the SCSU might be a
> proper solution for the SMS encoding [when it comes to non-GSM
> characters].
>
> Recently I studied as much aspects as I could about the SMS
> standardization, in a step that I started approx a year ago regarding
> the SMS language discrimination just because of the difference in
> message length and cost over a same sentence written with diacritical
> marks (written correctly for that language) or without diacritical
> marks (written incorrectly for that language). Or, for the same reason,
> language discrimination between (say) a French message and (say) a
> Romanian message, both written correctly.
>
> It turned out that they (ETSI & its groups) created a way to solve the
> 70 characters limitation, namely “National Language Single Shift” and
> “National Language Locking Shift” mechanism. This is described in 3GPP
> TS 23.038 standard and it was introduced since release 8. In short, it
> is about a character substitution table, per character or per message,
> per-language defined.
>
> Personally I find this to be a stone-age-like approach, which in my
> opinion does not work at all if I enter the message from my PC keyboard
> via the phone's PC application (because the language cannot always be
> predicted, mainly if I am using dead keys). It is true that the actual
> SMS stream limit is not much generous, but I wonder if the SCSU would
> have been a better approach in terms of i18n. I also don't know if the
> SCSU requires a language to be prior declared, or it simply guess by
> itself the required window for each character.
>
> Apparently the SCSU seems to be ok for my language, or Hungarian, or
> Bulgarian, etc., but is this ok also for non-Latin and non-Cyrillic
> scripts ? This versus the language shift mechanism, which is still 7
> bit. Release 10 of that standard includes language locking shift tables
> for Turkish, Portuguese, Bengali, Gujarati, Hindi, Kannada, Malayalam,
> Oriya, Punjabi, Tamil, Telugu and Urdu.
>
> Is there someone with more experience on this ?
>
> Thank you,
> Cristi
>
> --
> Cristian Secară
> http://www.secarica.ro
>
>
Received on Sat Apr 28 2012 - 20:27:51 CDT

This archive was generated by hypermail 2.2.0 : Sat Apr 28 2012 - 20:27:52 CDT