RE: NCR encode/decode, vs Unicode approach

From: Addison Phillips (
Date: Mon Jun 19 2006 - 15:01:41 CDT

  • Next message: Alexej Kryukov: "Re: U+0345 COMBINING GREEK YPOGEGRAMMENI not usable in other scripts as "hook below"?"

    If you switch to a Unicode implementation, the issues you will encounter are
    likely to be much less than the problems you'll have to deal with when
    processing legacy (i.e. non-Unicode encoded) strings with additional layers
    of encoding in them. Using a native Unicode encoding in the database, for
    example, will allow you to use the actual characters you are storing in your
    SQL queries or when indexing entries. Strings containing NCRs have to be
    processed in various ways--requiring quite complex code to detect and handle
    the NCRs. And you've already encountered variations and problems in encoding
    support going down this path.
    By contrast, changing your Web server to host pages using the UTF-8
    encoding, recoding the pages, and possibly including a UTF-8 <meta> tag in
    the header is the work of an afternoon. Migrating your database and fixing
    server-side code might still be an appreciable project (it depends on how
    internationalized your code is). However, when you are done, you'll actually
    be done. And nearly any problems you encounter switching to UTF-8 would
    equally apply to using a combination of legacy encodings and NCRs---only the
    code is much easier to write.

    Addison Phillips
    Internationalization Architect - Yahoo! Inc.

    Internationalization is an architecture.
    It is not a feature.



    From: [] On
    Behalf Of Huo, Henry
    Sent: lundi 19 juin 2006 06:26
    To: ''
    Subject: NCR encode/decode, vs Unicode approach

    We are evaluate the legacy systems, and would like to get you gurus' advises
    on what's the best approach to support multilingual web products.


    Currently, the legacy web applications are running on Websphere5 and Sybase
    12.5 which setup with CP850 for varchar and char.

    Web front-end will do NCR encoding/decoding (&#nnnnn;) for double-byte
    characters, e.g. Japanese, Chinese characters, and no encode/decode for
    us-ascii inputs.

    We are currently working on a plan to support all kinds of language,
    including English, German (umlauts), Korea, Chinese, Japanese, etc. Could
    you please advise what's the best approach? If we convert the Sybase
    database to use unichar/univarchar, then we need to change all of the legacy
    apps to use UTF-8 encode/decode, and the efforts are huge. If we would like
    to keep the current CP850 char/varchar in Sybase database site, should we
    encode/decode with NCR (&#nnnnn;) for all Web applications handling
    different languages cross different countries?? Will the NCR encode/decode
    support all languages w/o issues --- we already noticed some issues, like
    invalid characters "??" in the database.


    Thank you so much for your help and any input is highly appreciated.


    With best regards,

    - Henry

    This archive was generated by hypermail 2.1.5 : Mon Jun 19 2006 - 15:07:46 CDT