Re: XML Parser for Unicode Big Indian font MSWord document

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Mon Jan 19 2004 - 14:06:07 EST

  • Next message: Dean Snyder: "Re: Cuneiform Free Variation Selectors"

    N. Ganesh Babu wrote:
    > I having XML file in Unicode-Big Indian font created in MS Word. Please

    I believe you mean that you have chosen to save a document in the "Unicode Big Endian" encoding
    scheme, formally known as UTF-16BE. An encoding is different from a font.

    > let me know whether we can parse the XML file as it is with the MS Word?
    > If yes please let me know the parser name.

    Every XML parser that conforms to XML 1.0 must be able to handle UTF-8 and UTF-16. The latter is
    best supported if it includes a Byte Order Mark in the document. I believe that Word includes the
    BOM when you save as "Unicode" or "Unicode Big-Endian".

    Java 1.4 contains an XML parser.
    The Apache project provides the Xerces parser.
    There are many others.

    Spelling tip: big-endian, not "indian". From "end".
    See http://www.unicode.org/faq/utf_bom.html

    Encoding etc.:
    http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode/
    http://www.unicode.org/reports/tr17/

    I hope this helps,
    markus

    -- 
    Opinions expressed here may not reflect my company's positions unless otherwise noted.
    


    This archive was generated by hypermail 2.1.5 : Mon Jan 19 2004 - 14:45:03 EST