Re: Unicode, SMS and year 2012 from Mark Davis ☕ on 2012-04-27 (Unicode Mail List Archive)

From: Mark Davis ☕ <mark_at_macchiato.com>
Date: Fri, 27 Apr 2012 12:26:25 -0700

Actually, if the goal is to get as many characters in as possible, Punycode
might be the best solution. That is the encoding used for internationalized
domains. In that form, it uses a smaller number of bytes per character, but
a parameterization allows use of all byte values.

------------------------------
Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio è l’inimico del bene —*
**

On Fri, Apr 27, 2012 at 11:21, Doug Ewell <doug_at_ewellic.org> wrote:

> Cristian Secară <orice at secarica dot ro> wrote:
>
> > It turned out that they (ETSI & its groups) created a way to solve the
> > 70 characters limitation, namely “National Language Single Shift” and
> > “National Language Locking Shift” mechanism. This is described in 3GPP
> > TS 23.038 standard and it was introduced since release 8. In short, it
> > is about a character substitution table, per character or per message,
> > per-language defined.
> >
> > Personally I find this to be a stone-age-like approach, which in my
> > opinion does not work at all if I enter the message from my PC
> > keyboard via the phone's PC application (because the language cannot
> > always be predicted, mainly if I am using dead keys). It is true that
> > the actual SMS stream limit is not much generous, but I wonder if the
> > SCSU would have been a better approach in terms of i18n. I also don't
> > know if the SCSU requires a language to be prior declared, or it
> > simply guess by itself the required window for each character.
>
> I agree that treating character repertoire as simply a matter of
> language selection, and creating language-specific code pages, is a
> backward-looking solution. Not only is language tagging not always an
> option, as Cristian points out, but people don't want to be tied to the
> absolute minimum character repertoire that someone decided was necessary
> to write a given language, even in a text message. Just look at the rise
> of emoji in text messages.
>
> And, of course, I agree that SCSU would have been a much better
> solution. Most of the current arguments against SCSU wouldn't apply to
> SMS: the cross-site scripting argument wouldn't apply if SCSU were the
> only "extended" encoding, or if the protocol tagged it, and the
> complex-encoder argument wouldn't apply to any phone from the last 5
> years that can take pictures and shoot videos and scan bar codes and run
> numerous apps simultaneously. (SCSU doesn't require a complex encoder
> anyway, although it can benefit incrementally from one.)
>
> Interestingly, one of the first mentions I can find on the Unicode list
> of SCSU-like compression — actually a description of RCSU, the
> predecessor to SCSU — was in the context of SMS message compression:
>
> http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML001/0242.html
>
> Neither RCSU nor SCSU quite fits the original bill, which was to
> represent Unicode in 7 bits per character (with some overhead) and thus
> achieve 160 characters per message. Both schemes use 8-bit code units.
> Still, 140 characters is much better than 70.
>
> > Apparently the SCSU seems to be ok for my language, or Hungarian, or
> > Bulgarian, etc., but is this ok also for non-Latin and non-Cyrillic
> > scripts ? This versus the language shift mechanism, which is still 7
> > bit. Release 10 of that standard includes language locking shift
> > tables for Turkish, Portuguese, Bengali, Gujarati, Hindi, Kannada,
> > Malayalam, Oriya, Punjabi, Tamil, Telugu and Urdu.
>
> SCSU works equally well, or almost so, with any text sample where the
> non-ASCII characters fit into a single block of 128 code points. For
> anything other than Latin-1 you need one byte of overhead, to switch to
> another window, and for many scripts you need two, to define a window
> and switch to it. But again, two bytes is not what's holding anyone up.
>
> --
> Doug Ewell | Thornton, Colorado, USA
> http://www.ewellic.org | @DougEwell 
>
>
>
>
>
Received on Fri Apr 27 2012 - 14:29:41 CDT

This archive was generated by hypermail 2.2.0 : Fri Apr 27 2012 - 14:29:42 CDT