Re: Tajik alphabet code

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Mar 01 2004 - 20:22:38 EST

  • Next message: Chris Jacobs: "Re: Tajik alphabet code"

    From: "Peter Kirk" <peterkirk@qaya.org>
    > Windows 2000/XP and Office need no adaptation, just fonts and keyboards.
    > Well, the menus do need localisation, and obviously that is a
    > significant issue (although I guess most Tajiks know or can easily learn
    > the Russian for "File", "View", "Help" etc).
    >
    > Issues of localisation of non-Unicode software are off topic for this
    > list, surely.

    Here again the localization of the interface is not the main issue. I do agree
    that a program interface in Russian would work for most Tajik peoples. The main
    problem is for the documents that people creater themselves in their language
    for their own use and for interchange with others.

    This includes all the various tools used to create personal webpages, sending
    emails, and instant messaging, but also creating printed documents for snail
    mail, publishing books, writing papers, feeding databases...

    And also using the various databases that have been created with various
    encodings well adapted for Russian but not necessarily for Tajik, and the
    difficulty to interchange this legacy data and use them with the tools they have
    (the main problem is not in the standard office programs but in the
    business-specific softwares, which may have been developed with Russian
    standards or with legacy tools developed by lazy US programmers that just
    considered the case of handling English and a few Western European languages,
    and forgot the case of Cyrillic alphabet variants).

    To use these softwares that are still needed but difficult or expensive to
    adapt, there's a need to merge data from various sources which may have used
    several "personal" 8-bit encodings usable in some limited domain and transcode
    them into a common and well-accepted 8-bit encoding. Suppose this common 8-bit
    encoding is the ISO-8859 Cyrillic charset, then some Tajik characters present in
    this legacy data won't map well and there may be alteration of the data (which
    may be a serious issue if this data has some legal value, or is used for
    identification of persons or services or marks).

    Going to Unicode is of course a longer term target, but for now there will
    remain lots of use of 8-bit processing in softwares or devices before they are
    replaced with more modern ones (in fact I do think that Western programmers will
    continue for a very long time to be lazy, until classic C or C++ development is
    completely deprecated and will continue to produce software processing only
    single-byte coded characters, simply because the OS they use themselves are
    processing only 8-bit coded chars in its API, notably in POSIX services and
    Linux/Unix kernels where a "char" is a byte, as well as in many open protocols
    for the Internet). Using UTF-8 is a solution but not the simplest one for
    programmers and they are lazy in the code they produce and test, and they will
    too often forget the necessary code to handle multibyte sequences correctly,
    notably if there are security issues like possible buffer overruns.

    I took the case of Tajik, but this may be true for every language that needs
    more than just the ISO-8859-1 character subset. In many cases, a standardized
    ISO-8859 variant may help solve the immediate problem found in many countries,
    with the notable exception of China, Korea and Japan which always need large
    subsets and where programmers are used to not be lazy and to process MBCS
    sequences (including UTF-8 for Unicode) correctly.



    This archive was generated by hypermail 2.1.5 : Mon Mar 01 2004 - 21:04:22 EST