Re: Numerals across code pages

From: Ed Trager (ed.trager@gmail.com)
Date: Fri Feb 19 2010 - 09:41:16 CST

  • Next message: Mark E. Shoulson: "Re: Greek chars encoded twice -- why?"

    On Fri, Feb 19, 2010 at 6:57 AM, vainateya <vainateya@cdac.in> wrote:
    >
    > We are facing problems with operating with numerals from various code pages.
    >
    > Are there any available standards, to which applications, databases,
    > programming environments, etc. must comply,
    > while handling Numbers in different scripts,
    > including - processing Arithmetic operations, Logical operations with such
    > data.
    >

    I don't know of any standards that define how things "must" or
    "should" work, but
    based on your description below, what you are trying to do is
    certainly not difficult to achieve
    and can be quite useful.

    In a previous project written in C++, I implemented a "DigitConverter"
    class which functioned to
    normalize all digit characters present in strings to their ASCII
    equivalents which could then be
    processed by other classes such as the Number and Date classes. As
    you might expect, the
    class used static table lookup for efficiency.

    Because the project used Unicode strings, all the historical nonsense
    of "code pages" disappeared. I recall
    that we explicitly tested to insture that date expressions in any
    combination of digits (Arabic-Indic, Arabic, Hindi, Thai, Chinese,
    etc.) were interpreted correctly.

    So the "input" data could be in any format (known to Unicode v.3 at that time),
    internally the string input was converted to ASCII prior to parsing by
    a given class (Number, Date, etc.), and
    the output from a Number or Date object could be displayed according
    to whatever the current locale was.

    I think we also had options on whether to use Arabic-Indic or local
    digits in formatted output from classes such as Number or Date -- It
    is certainly very common in places like Thailand and I assume also in
    India to use local digits in certain
    contexts (such as in dates or page numbers in books) but Arabic-Indic
    digits in other contexts (such as math problems or financial
    statements).

    I remember I also had to add some extra intelligence into the Date
    class especially so that it would properly recognize various date
    formats. For example, a Chinese date like "二〇〇九年五月十日" might be
    correctly converted by the DigitConverter class to "2009年5月10日" but
    the originally naïve version of the Date class could only parse
    "2009-05-10" or "2009/5/10" as input. So I added some additional
    "intelligence" so that the Date class could parse almost any date in
    "YYYYdMMdDDd" format where the delimiter "d" might be anything from
    "-" to "年" or "月".

    - Ed

    > case 1 - Basic math operations (+, - , / , * ) when both params are from non
    > ascii numerals
    >               -  is such processing allowed or not.
    >                - if digits across scripts are to be treated as
    > programmatically equivalent,  then result will be in which script?
    >
    > case 2 - logical comparison operators across scripts :
    >
    >                eg: if n < m do something
    >
    >                       //  (for simplifying, lets assume both as character
    > variables / single digits)
    >                       with n = 4 m = 5 should succeed
    >                       however it might fail if n is hindi numeral 4 and m is
    > english 5 (ASCII = 53), simply because of ascii / rather unicode value of
    > hindi numeral 4 is greater.
    >
    >               similarly -
    >                   input n,m
    >                       if m == 0
    >                           return (cannot divide by zero)
    >                       else
    >                           output n /m
    >
    >               This gets handled by environment where ascii value of  'm'
    > ASCII = 48 is compared not ASCII = 0 - but using which standard / rule does
    > the programming environment behave this way.
    >               What's the behaviour when m is a 0 from some other script /
    > code page.
    >
    >               Similar problems arise while sorting such data.
    >
    > case 3 - things get more complex if we are dealing with complex datatypes
    > such as date and time,
    >        eg : routine checks gets system time and checks if it past 5:00 pm
    >           but if the system / locale settings or related service, return the
    > time in local language, then condition may fail.
    >
    > Different programming environments and applications seem to treat such
    > scenarios differently, most seem to expect an application level code /
    > routine which maps to equivalent ASCII before further processing or database
    > select query.
    >
    > Are there any standards / locale settings that help in determining the
    > expected behaviour.
    >
    > regards,
    > Mr. Vainateya Koratkar
    > Team Coordinator,
    > C-DAC GIST Pune.
    >
    >
    >
    > --
    > This message has been scanned for viruses and
    > dangerous content by MailScanner, and is
    > believed to be clean.
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Fri Feb 19 2010 - 09:44:02 CST