RE: Does OpenOffice 3.0 handle unicode?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Mar 21 2009 - 13:45:14 CST

  • Next message: Philippe Verdy: "RE: Does OpenOffice 3.0 handle unicode?"

    > De : Petr Tomasek [mailto:tomasek@etf.cuni.cz]
    > Envoyé : samedi 21 mars 2009 19:59
    > À : Philippe Verdy
    > Cc : 'Petr Tomasek'; Unicode@unicode.org
    > Objet : Re: Does OpenOffice 3.0 handle unicode?
    >
    >
    > On Sat, Mar 21, 2009 at 07:10:32PM +0100, Philippe Verdy wrote:
    > > > [mailto:unicode-bounce@unicode.org] De la part de Petr Tomasek
    > > > Envoyé : samedi 21 mars 2009 17:42 À :
    > Unicode@unicode.org Objet :
    > > > Does OpenOffice 3.0 handle unicode?
    > > >
    > > >
    > > > Can someone, please, confirm whether the new version of
    > OpenOffice
    > > > can handle unicode? OpenOffice 2.0 unfortunatelly can handle only
    > > > the BMP, while I need characters from the SMP.
    > >
    > > That's quite a stupid question: if OpenOffice can "handle" the BMP
    > > characters, it means that it "handles" Unicode.
    >
    > OK, it was a little bit provocation from me, but hey,
    > supporting only BMP nowadays should be considered buggy behaviour.
    >
    > > Appanretly you seem to ignore that OpenOffice was designed using
    > > Unicode as a goal, and using file formats that require the
    > correct support of Unicode.
    > > This support has always been part of the file format specifications
    > > (that are based on XML files compressed within a zipped archive).
    > >
    > > I can perfectly open Chinese documents containing
    > characters from the
    > > SIP, with OpenOffice (all versions, including those before 2.0).
    >
    > "My" OpenOffice 2.0.4 (on linux) cannot handle anything but BMP.
    >
    > If I copy text containign SMP characters onto OOo all I get
    > are two "boxes".
    > (Which makes me think OpenOffice handles UTF-16 as it was
    > UCS-2 internally, or something like that...)
    >
    > > This is not a problem of OpenOffice version but of support of the
    > > display of the characters and scripts (for complex scripts) in the
    > > system's or application's renderer.
    >
    > AFAIK OpenOffice uses the ICU library on linux. Other
    > programs build upon the ICU (such as xetex) work without any
    > problem with SMP characters.
    >
    > > But if you don't have any font for those scripts you want to render
    > > and that are part of the SMP, all you'll get is a set of
    > empty boxes.
    >
    > Of course I have fonts installed and other programs on my
    > system (such as those based on Pango - www.pango.org) show
    > them as expected.
    >
    > > So, on the same system, if I can open a document containing non-BMP
    > > characters with MS Office, I can as well open it with
    > OpenOffice (or
    > > Sun StarOffice).
    >
    > On the same system I can open a document (in gedit e.g.)
    > containing non-BMP characters but cannot open it using OpenOffice.
    >
    > So the conclusion: the OpenOffice is broken and what You
    > wrote is quite stupid :)

    But you've demonstrated above (in your own contradictions) that your issue
    was completely system-specific and not related to the application itself
    (you did not specify which system you were using, but from what you just
    wrote, I think it is some distribution of Linux: check your system for the
    relevant updates or support): you admitted that Ooo is using ICU. And ICU is
    FULLY compatible with Unicode (in all planes).

    If your system displays two boxes, it's not a problem of OpenOffice but of
    the renderers installed in your system and the way it is installed (and made
    accessible to your OpenOffice installation).

    For myself, I have absolutely no problem with SIP characters in Ooo, because
    i have FIRST solved the system requirements and installed the necessarty
    support. Look at the installation instructions for you office installation
    and make sure that your system has the relevant fonts, and that they are
    correctly accessible to your rendering libraries (Pango is not the only
    thing to check), you need to check also your X11 settings, and some
    parameters of your locale, because the character encoding support on your
    graphic console partly depends on it and affects the way your system fonts
    are loaded and handled by your renderer.

    If your display settings do not report Unicode capability, all what
    OpenOffice or other apps can do is to try to adapt to your display locale
    and map some characters to it, but there will not be any way for it to go
    beyonf this limitation. The version of your X11 emulation (XFree86?) and its
    builtin support for unicode fonts may also be needed: if it's not enabled,
    your X11 instalaltion will just exhibit sets of fonts for several specific
    encodings, and your application will just try to adapt to one or a few of
    them, when it converts a single internal (UTF-8 encoded in the XML document)
    code point into the target encoding used by your display:

    Seeing two boxes does not mean that Your office app handles the single code
    points as two separate characters: look by your self if you can edit the
    text and remove only one of the two surrogate characters that make a single
    code point (in OpenOffice, documents are UTF-8 encoded in the archived XML
    documents, so without using any surrogate for supplementary characters that
    are encoded as a whole; I am not able to select surrogates isolately in the
    GUI, even for characters for which I don't have any suitable font, or for
    characters that are still not encoded in Unicode but accepted anyway,
    displayed as an empty box glyph, and properly kept unchanged when saving).

    If you can break a single character into one of the two surrogates, it just
    means that internally, the application reads the utF-8 encoded oducments
    into memory by first converting them to UTF-16, but ignoring then the UTF-16
    requirements (this was what happended in old applications just recompiled in
    C by just changing "char" into "wchar-t". But ICU was written to support all
    Unicode requirements (including not breaking character in the middle) and
    correctly support all the classic text algorithms and support the new ones
    needed for internationalization (including complex scripts).

    OpenOffice documents using the ODT format are zipped archives containing
    several XML files: these XML files are (and have to be) fully conforming to
    XML, so you can even patch one manually in a plain-text editor and
    experiment with it: not only you can encode these characters using UTF-8,
    but you can also represent them using numeric character entities (in decimal
    or hexadecimal) using the Unicode code point: it works equally, and
    OpenOffice accepts the document as well without any difference (this means
    that ODT documents can be generated or modified by various tools and
    applications that know how to handle conforming XML documents, not just by
    saving them from OpenOffice itself).



    This archive was generated by hypermail 2.1.5 : Sat Mar 21 2009 - 13:48:40 CST