Unicode Supports in Mobile Devices
Daniel Chen - IBM Corporation & Bei Shu - IBM China Software Development Lab
ICU (International Component for Unicode) open source project is gaining its popularity those days. Adapting ICU for Unicode support in mobile applications is challenging due to the constraint of memory and performance of mobile devices.
This paper discusses the current architecture of ICU and how we create a lite version of ICU for mobile devices. The session will also cover other tips and best practice on globalizing mobile applications. The intended audiences are developers, system analysts who are interested in making their mobile applications for global market.
Bring i18n Policy Repository to SOA
Chun Jie Tong - IBM Globalization Certification Laboratory
This paper mainly talks about using policy repository for i18n to resolve locale negotiation and locale sensitive operations issues in Service-Oriented Architecture (SOA).
It is meant to convey the following information:
1. A brief introduction of SOA. This includes basic concepts of SOA, such as what is a service; how do service interact; what is service choreography and so on.
2. Current solutions for Web services i18n. It firstly tells the overview of Web services i18n, the scope of i18n, and locale patterns for Web services. It also involves implementation methodology for both Web services and its clients.
3. Internationalizing policy repository in SOA. It describes a usage scenario of using SOA to integrate business system and enable business processes. It summarizes possible i18n requirements in this scenario, and introduces a solution of creating and organizing a policy repository for Web Services i18n. Each Web services of a certain business process that has locale sensitive operations perform according to the policies defined in the repository. The hierarchy structure of the policy repository and how it is enabled are also discussed.
Internationalization Coding Convention Enforcement via AOP
Huapin Shen & Pu Yang - IBM China Company Limited Shanghai Branch
Building an internationalization solution needs developers pay careful attentions on extra considerations, such as being able to present data culturally correct and process multilingual data in Unicode. But many SDKs that we are familiar with are not sufficient for providing these functions. For example, JDK 1.4 can not handle Unicode supplemental characters and some of its objects can not parse and present data culturally correct. We call these environment insufficiencies the i18n pitfalls, and developers have to use other libraries to handle these pitfalls by themselves.
But in fact developers are not necessary have enough knowledge of all the i18n pitfalls in the environment and carry all these stuffs in mind when programming is just tedious. Aspect Oriented Programming can help to provide an internationalization secure environment in a quite transparent way. We have implemented a library by using AspectJ to provide multiple levels of support for attacking those i18n pitfalls: build time warning, runtime warning and making corrections. By using this library, developers now can concentrate on the functional internationalization features only, and forget about most of the i18n pitfalls in the environment and also get the corresponding internationalization features. The target audiences of this paper are developers concerning internationalization features and tool makers who are interested in the internationalization field.
Keywords: Internationalization, Unicode, i18n Pitfalls, AOP, AspectJ
Generic Transliteration Rule Framework for Indian Languages
Nagarajan Krishnamurthy - Hewlett Packard India Software Operations, Bangalore
Statement of purpose
We propose a generic transliteration framework for input and display of non-Latin, in particular Indian, languages in such a way that this framework can be used for any language with no coding on the part of the application developer; only transliteration rules have to be specified which are friendly to general personnel who are not programmers.
Need for such a generic rules framework : The Unicode pages for various Indian languages just define code points for various elements of a given script. They do not specify any 'input' methods nor do they have any guidelines on how input keystrokes (using any specific scheme) should be mapped to do 'glyph composition' for display of a word of the given script using a given font. So, different applications (such as web-browsers, content editors etc) develop and implement their own input and rendering methods, thus causing tremendous duplication and often differences in the way the Indian language scripts are interpreted and displayed.
This rule-based transliteration framework for mapping input key sequences to do context-sensitive complex glyph composition achieves that purpose in one go.
The transliteration rules have various facilities to define all legal combinations of members of defined classes in the form of a unique regular expression format. This regexp has features to specify members of a given class, start and end of word, zero-width separator, context etc. The action part of a rule has various facilities to find out the appropriate glyphs, to re-order them, to specify 'split dependent vowel signs' etc.
The library that implements this framework for the Indian languages consists of routines for implementing the language-independent actions specified in the rules for transliteration and glyph composition. For each language, one needs to just specify the transliteration rules and no coding to take care of special cases is needed. So, this helps app developer handling and glyph composition to this framework for all Indian languages.
In addition to phonetic input, UTF-8 input to phonetic input mappings have been developed so that a given stream of input 'words' in UTF-8 format can be 'transliterated' and rendered in a given Indian language.
Implementation : this framework has been implemented for four Indian languages/scripts (Devanagari (Hindi), Telugu, Kannada and Tamil) and is found to be successful.
This generic rule framework, which is particularly suited for phonetic languages like Indian ones, has potential to provide a common framework keyboard input and display of any given Indian language, by just changing the transliteration rules to point to the chosen language is to completely delegate input
’Ä¢ who will benefit from attending : all those who are developing apps that are to be internationalized and localized for Indian languages (and probably other), content developers and web admins.
’Ä¢ description of the (business) benefit : this framework provides OS-independent, platform-independent and vendor-independent method for developing input and display methods for phonetic, especially Indian, languages thus enabling rapid development of apps and quick time-to-market.
Advanced Java Globalization
Charles Hornig - IBM Corporation
This presentation describes how to improve the level of global support of Java code from "works most of the time" to ˆ¾works all of the timeˆ¥. The focus is on specific problem areas we have identified in reviews of existing IBM product code. Some of the topics covered include:
’Ä¢ How and when should code deal with individual "charactersˆ¥ in user data? Should the code deal with code units, Unicode characters, or logical characters? What about combining characters and supplementary characters?
’Ä¢ Is it ever appropriate to do string comparison and searching using String class methods, as opposed to the Collator class? If so, when?
’Ä¢ How should applications handle data intended to be read by both users and programs?
’Ä¢ When is case conversion appropriate?
’Ä¢ How should character encoding conversion errors be handled?
’Ä¢ How should character encoding be handled in JNI interfaces?
Examples of how to code to avoid these problems and how to test for them.
The presentation is intended for anyone responsible for producing Java applications to be used in a multinational environment. Attendees should be familiar with the Java programming language, basic globalization concepts, and basic Unicode concepts. Writing applications with these points in mind will allow them to run in less common environments without costly customer support activity to repair failing code.
Transitioning a Vastly Multilingual Corporation to Unicode
Lorna Priest - SIL International
As an organization with 2,000+ linguists, coming from over 35 countries, and working in over 1,200 languages, SIL International had a strong incentive to switch over to using Unicode. Document sharing can be a nightmare when you have up to 100 different custom-encoded fonts in use in one country! And as industry began to force the use of Unicode (by not supporting legacy solutions) the incentive grew stronger. In 2001 SIL International began a concerted effort to transition the organization to using Unicode.
Many steps were involved in our transition to Unicode. These included getting computer support people on board; helping upper management to understand the complexities and the need; and developing tools for converting legacy data to Unicode, fonts which cover Latin and Cyrillic repertoires, software to use Unicode and training in using the tools of Unicode. We have discovered that training is an ongoing process. We continue to find areas where we need to focus our training efforts.
This paper addresses the steps taken, the problems that arose, and the solutions that were found.
Creating a Globalized Text Search Engine: A Case Study
Thomas Hampp-Bahnmueller - IBM Germany
Creating an engine for searching the variety of documents available in the World Wide Web is a tough challenge in terms of globalization. The task becomes even more difficult if other data sources like databases, email or content management systems have to be supported.
Of course there is the usual globalization/localization task for the application and its GUIs. But beyond that every kind of transport, storage, manipulation and display has to be able to deal with the broadest range of natural language data presented in nearly every existing language and encoding. Unicode everywhere is an absolute must-have requirement for dealing with this.
And beyond properly handling global text data some more linguistic challenges have to be faced: Document encodings and languages have to be detected for all sorts document formats. Word and sentence boundaries have to be determined for all kinds of scripts. Base forms of words have to be computed for all types of languages. Word normalizations (case, accent, umlaut etc.) have to be performed - some of them not covered by any Unicode standard. Categories have to be assigned, summaries computed, synonyms expanded, spelling mistakes corrected and much, much more.
IBM's upcoming search product called Masala includes all of the above functionality and provides linguistic support for 20 languages (plus variants) and basic support for more than 50. This presentation will describe the task of integrating various kinds of globalization and linguistic technologies to provide a truly global search experience.
This talk should be interesting for users of search engines as well as people involved in creating software in the search, retrieval and text analysis area.
Unicode, Keyboards, and the Microsoft Keyboard Layout Creator
Cathy Wissink & Michael Kaplan - Microsoft Corporation
Getting data into applications by keyboards should be one of the simplest features on Windows. However, once fonts or rendering engines are added to the equation, it is anything but simple. Adding many different keyboard layouts on top of over 100 languages contributes to the chaos. Once you add the ability to define your own keyboard layouts where all of Unicode can be supported, it becomes downright complicated!
This presentation will discuss the following:
’Ä¢ the interaction between input, fonts, and rendering engines;
’Ä¢ the many features that keyboard layouts support such as dead keys and ligatures;
’Ä¢ code pages vs. Unicode and input;
’Ä¢ when IMEs are preferred and when they are not;
’Ä¢ the collation issues that enter into the equation;
’Ä¢ the Microsoft Keyboard Layout Creator, which simplifies all of the above.
Windows for the Rest of the World: Customizing Windows for Emerging Markets
Cathy Wissink & Michael Kaplan - Microsoft Corporation
In prior versions of Windows, internationalization support was focused on a complete and closed set of features. This has changed over the last two years, and is continuing to change in response to demands for a more flexible architecture. In addition, the need to localize into many new languages more quickly and at a cheaper cost has resulted in a more responsive collaboration and build process. This talk will focus on the new technologies developed by the Windows International team to meet the demands of the rapidly changing international market. Technologies discussed include:
’Ä¢ Language Interface Packs (LIPs);
’Ä¢ Microsoft Keyboard Layout Creator (MSKLC);
’Ä¢ out-of-band language enabling shipping with Windows XP Service Pack 2;
’Ä¢ other new technologies being developed for the .NET Framework and Windows "Longhorn".
Collation in ICU
by Mark Davis, Vladimir Weinstein, Andy Heninger
Collation is the general term for the process of determining the sorting order of strings of characters for a given language. It is a key function in computer systems; whenever a list of strings is presented to users, they are likely to want it in a sorted order so that they can easily and reliably find individual strings. It is also crucial for the operation of databases, not only in sorting records but also in selecting sets of records with fields within given bounds.
It is quite tricky to get collation to work correctly for many languages, and even more difficult to do it with the speed demanded by customers. Luckily, the ICU library provides a high-performance, full-functioned implementation of international collation, one that is used in IBM products and can be freely used in any other product. This presentation will review the capabilities of ICU collation and illustrate what can be done with it.
10 Years of Technology Development at the American University of Armenia
Richard Youatt - American University of Armenia Corporation
The purpose of this session is to condense 10 years of experience to a readily comprehensible level for those interested in working with the FSU or Central European states. It will stress the importance of cultural and organizational factors and their interrelationship to technical issues.
The paper will review the broader context of international relations and technology, starting with the fact that prior to 1991, Cold War relationships between the USSR and the US prohibited the export of US technology beyond an Intel 386 CPU. It will look at the subsequent transformations in international relations that permitted the creation of the American University of Armenia (AUA), the export of more advanced US technologies, and the establishment of a program in Computer Science and Information Technology oriented to Western techniques.
It will focus specifically on the role of Unicode, ISO and technical aspects of the reconciliation of international standards...with some focus on linguistics and communications issues.
It aims to condense the essence of "lessons learned" in 10 years experience for the benefit of those interested in similar endeavours in the FSU or Central Europe.
Attendees should have an interest in the synergy of technical, cultural, educational and organizational issues, but are not expected to be technical specialists.
The business benefit is that attendees can benefit from pertinent knowledge and the experience of others.
Analyzing Unicode Text: Regular Expressions, Boundaries, Sets and More
Andrew Heninger - IBM Corporation
Regular Expressions have been widely used for many years to analyze, parse or extract desired information from text data. They are used in applications large and small, and everywhere in-between, from simple search operations in word processors to scripting languages such as Perl to queries on large data bases.
Traditional regular expressions cannot easily deal with a character set of the size and complexity of Unicode. To address this shortcoming, the Unicode Consortium has published Technical Report #18, a set of guidelines for extending regular expressions to handle Unicode data. Following this allows organizations to correctly deal with data in different languages and scripts.
This paper will review the issues and techniques involved in writing Regular Expressions for Unicode data. The guidelines from TR 18 will be reviewed, including a discussion of Unicode encoding forms, character properties and classes, text boundaries, case sensitivity and normalization, and the implications of all of these for handling different languages in regular expressions. The paper will also survey the capabilities and limitations of those regular expression implementations known to provide significant support for Unicode.
The presentation is intended primarily for users of regular expressions rather than implementers of regular expression engines.
Reworking the Graphite API
Sharon Correll - SIL International
Graphite is a package that has been developed by SIL International to provide extensible rendering for complex writing systems. It includes a programming language called GDL (Graphite Description Language) that uses rules to specify smart font behaviors, such as stacking diacritics, contextual glyph selection, and reordering. This sort of extensibility and flexibility makes Graphite particularly suitable for minority languages that have special character needs, and those whose scripts are not yet included in Unicode or do not have operating system support.
Graphite's API was originally developed to coordinate with FieldWorks, SIL's new suite of linguistic tools. When we began efforts at integrating Graphite support into third-party projects, such as Mozilla and OpenOffice, we began to see how some reorganization of the API and high-level classes is needed to facilitate this kind of interoperability. In addition some extensions are needed to allow flexibility in how applications choose to support complex drawing and editing operations.
This talk will give an overview of the Graphite system capabilities, and discuss the current API, its limitations, and the changes that are in progress.
Getting Started With ICU - Part I - Beginner
Vladimir Weinstein - IBM Corporation
ICU is a mature, widely used set of C/C++ and Java libraries for Unicode support and software internationalization and globalization (i18n/g11n). It grew out of the JDK 1.1 internationalization APIs (which the ICU team contributed) and continues to be developed for the most advanced Unicode/i18n support. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.
This presentation walks the audience through the core concepts of using the ICU library, providing a introduction to how to setup and use ICU in practice. The concepts are presented using two internationalization tasks, illustrating the use of ICU for character conversion (with examples in C) and collation (with examples in Java). The presentation will walk through code snippets to solve these tasks, followed by a discussion of core features and conventions.
Getting Started With ICU Part II - Intermediate
Eric Mader - IBM Corporation
ICU is a mature, widely used set of C/C++ and Java libraries for Unicode support and software internationalization and globalization (i18n/g11n). It grew out of the JDK 1.1 internationalization APIs (which the ICU team contributed) and continues to provide the most advanced Unicode/i18n support. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.
This presentation walks the audience through the core concepts of using the ICU library to build internationalized software. The concepts are presented using two internationalization tasks: message formatting (with examples in Java) and text boundary analysis (with examples in C). The presentation will walk through code snippets for accomplishing these tasks, followed by a discussion of advanced techniques, related features, and where to find more information.
Localizing with XLIFF & ICU
Ram Viswanadha - IBM Corporation
This talks discusses the file formats and processes that are involved in software localization. Platform-specific formats are contrasted with an emerging industry standard (XLIFF) that is designed for efficient translation processing.
A globalized application does not have any user interface elements that differ by language or culture (text, icons, etc) in the source code. Instead, these elements are stored as separate elements, called resources. The process of translating these resources is called localization. Many source formats exist for representing and interchanging resources, according to different platforms and technologies: VC++ RC files, Java ResourceBundles, POSIX message catalogs, ICU resource bundles, etc. Translators, who usually are non-programmers, have to deal with this large variety of formats for translating the content of these resources. Tools are available for assisting translators in dealing with these formats, but in many cases the formats don't permit tools to support the most efficient process.
XLIFF, a format designed by localization industry experts for solving problems faced by translators, is an emerging industry standard for authoring and exchanging content for localization. After discussing the general issues, this talk will present an overview of how ICU facilites the localization of a product using XLIFF, describe a process for managing the localization, and then walk through a case study of product localization.
Developing International Applications with Visual Studio 2005
Marin Millar - Microsoft Corporation
The purpose of this talk is to provide education about new features in the .NET Framework Runtime and Visual Studio 2005 which provide richer functionality for creating globalized and localized applications. The talk will cover new programming interfaces in the .NET Runtime for handling new cultures and facilitate international text manipulation. The talk will also demo and discuss the features in Visual Studio 2005 for creating localized Windows Forms and Web applications. People who are developing multilingual applications or need to handle international text will learn how to utilize the .NET Framework and use Visual Studio 2005 for creating world-ready Windows and web-based applications. The target audience will be intermediate level developers, program managers or designers.
UNIcode Table based Input Method (UNIT)
Sriram Swaminathan - Sun Microsystems, Inc.
Statement of purpose
UNIcode Table based Input Method that enables multilingual text input using the next generation IM Framework (IIIMF).
As part of the IIIMF (Internet/Intranet Input Method Framework), there is a server side generic, multilingual composition engine called UNIT (UNIcode Table based IM).
1. The current UNIT engine supports 8 Indic languages (Hindi,Bengali, Gujarati, Malayalam, Gurmukhi, Kannada, Tamil and Telugu), European, Cyrillic, Hebrew, Arabic, Unicode Codepoint based input method (Unicode-Hex & Unicode-Oct), Vietnamese and two african languages Tigrigna- Eritrean and Amharic).
2. UNIT engine has the capability to support multiple keyboard layout (or) input methods for each of the supported languages.
3. It has the capability to add/remove supported languages as well as keyboard layouts using configuration files.
Description of the business benefit :
UNIT enables multilingual composition capability for a wide range of applications.
Unicode Support in ICU for Java
Doug Felt - IBM Corporation
ICU is an open-source software library for Unicode support and software internationalization, provided in both C/C++ and Java. Many people ask, why use ICU4J (ICU for Java) when Java already provides internationalization support? This talk discusses the functionality ICU4J provides that is not in the Java runtime, and why you might want to use it. Some features of ICU4J are APIs for normalization, unicode transforms, and full Unicode character property support. The talk also covers how you can package and deliver products that use ICU4J.
SING: Adobe's New Gaiji Architecture
Jim DeLaHunt - Adobe Systems Incorporated
High-quality Chinese, Japanese, and Korean (CJK) typography requires an open- ended class of Chinese-derived characters. Different CJK fonts provide a core selection of these characters as part of their standard character set. However, publishers need supplemental glyphs or characters that are known as "gaiji." This presentation describes a new Adobe initiative to address the gaiji requirement: the SING architecture. SING answers the need for a flexible gaiji workflow on the desktop. SING enables you to extend your CJK fonts with individual new OpenType-based "glyphlets," representing variant glyph shapes or symbols. These glyphlets are embedded in documents and move through the workflow. Adobe intends to include SING in future Adobe products. Building on work presented at the 22nd IUC in September 2002, this presentation reviews why gaiji are important; it describes the SING architecture; and it looks at some implications for the Unicode character-glyph model. We will demonstrate SING Technology Preview software. The presentation assumes basic knowledge of the Japanese, Chinese, or Korean writing systems, and of text formatting and the Unicode character-glyph model. Anyone entering, processing, or displaying text that contains person or place names in CJK languages, be it for publishing, for corporate databases, or for the web and cell phones, will encounter the gaiji requirement, and would benefit from being aware of the SING approach.
Introduction to Java Desktop System Globalization
Hideki Hiura - Sun Microsystems, Inc.
Java Desktop System is a fully integrated Unicode desktop environment, based on open source components and standards. This session will cover its globalization architecture and user experience, from the office suites components to popular desktop components such as mail, calendar, Web browser, instant messaging, etc. and articulate how it impacts to world wide desktop system users.
Unicode Nearly Plain-Text Encoding of Mathematics
Murray Sargent III - Microsoft Corporation
With a few conventions, Unicode 4.0 can encode most mathematical expressions in readable nearly-plain text. This format is substantially more compact and easier to read compared to [La]TeX or MathML and it looks very much like a legitimate mathematical notation. Most mathematical expressions up through calculus can be represented unambiguously in this format, from which they can be easily built up or exported to [La]TeX, MathML, C++, and symbolic manipulation programs. The format is useful for 1) inputting technical text, 2) searching for math expressions, 3) displaying by text engines that don't support built-up math display, and 4) computer programs. Use of this format will be demonstrated as an input method for technical documents.
Who is Running this Project, anyway? Managing Distributed Development Teams in China
Jacob Hsu - Symbio Group
In today's "ROI or Die" market, it is a business reality that software is often designed in the U.S. or Europe and then partially or fully developed in China, which has significantly lower engineering rates, to lower costs and decrease project delivery time. Distributed development is often used wherein a company's internal development team works in conjunction with the offshore lab. However, without building certain cultural, methodological and technological foundations into an organization, it can be difficult to effectively manage projects being completed by dispersed teams. This session looks at the issues that must be recognized and resolved when managing successful distributed development projects in China.
Although the methodological and technical framework issues need to be addressed, the bulk of the remaining challenges are related to "soft issues", including cultural incompatibility, leadership problems, trust issues and negative competitiveness. These are actually the major obstacles to successful completion of distributed projects. However, there are concrete ways to alleviate these problems, including re-defining your corporate culture, improving project planning, and adjusting project staffing and technical infrastructure. In addition, this model takes advantage of the face time of the on-shore team and the cost effectiveness of the offshore team while maintaining quality and decreasing delivery time, since the working "day" is expanded to 18- 24 hours.
This session identifies the changes in corporate culture and the tools needed to successfully manage teams distributed between the US and Asia.
Trends in Internationalization Outsourcing
Jacob Hsu - Symbio Group
The internationalization outsourcing market is changing rapidly. Large projects have been moving "offshore" to Ireland and India for years now, but relatively new players including China and formerly "Iron Curtain" countries such as Romania are driving prices down further. But which regions have the skills necessary to handle internationalization?
At the same time, less end-to-end outsourcing is occurring, and smaller, more specific I18N tasks are being outsourced. This is partly because the software development industry is maturing and I18N is becoming more integrating into the core development cycle. Also, understanding of I18N issues is becoming more ubiquitous. So what is next? In this presentation, we will discuss new ways to cut I18N costs, new outsourcing centers and new I18N testing techniques that will help keep you on the cutting edge.
Internationalized Resource Identifiers (IRIs) - An Update
Martin Dˆºrst - W3C/Keio University
Internationalized Resource Identifiers (IRIs)extend URIs (Web addresses) to Include non-ASCII characters and integrate Internationalized Domain Names (IDN). Their conversion to URIs is based on UTF-8.
This talk will explain general IRI concepts, from a content creator, webmaster, and end user perspective. The focus of the talk will be recent progress. By the time of the talk, the IRI specification should have advanced from an Internet- Draft towards an RFC. Based on this, implementations and testing are also expected to advance.
Building Localized Applications in Cocoa on Mac OS X
Deborah Goldsmith - Apple Computer, Inc.
In addition to improved Unicode support, Mac OS X offers an improved localization model over its predecessors. A single binary can readily support multiple localizations, and users can switch between localizations without needing to restart their machine. A user can switch the locale used by applications, then simply quit and restart an application for it to begin using the new locale.
In this tutorial, we will discuss the ins and outs of writing a localized application for Mac OS X. Particular attention will be given to the Cocoa framework, which is a powerful class library providing the ability to quickly write applications with Unicode support. We will also discuss the OS X bundle mechanism, which provides a flexible way to include data (including localizations of text strings and UI elements) within an application.
We will demonstrate how the tools Apple provides can be used with the advanced multilingual capacities of Mac OS X to quickly and easily write a fully localized application.
Adding Language Support to Mac OS X
Deborah Goldsmith - Apple Computer, Inc.
Mac OS X comes localized into 16 different languages, any of which may be selected by users. In addition, users can create and view content in many more languages by using the included fonts and keyboard layouts. Languages supported include those using Roman, Japanese, Chinese, Korean, Cyrillic, Greek, Arabic, Hebrew, Thai, Indic, Armenian, Canadian Syllabics, and Cherokee scripts.
Mac OS X allows customers to augment this set themselves, by adding fonts, keyboard layouts, and (via the Unicode Common Locale Data Repository, or CLDR) locale data.
Attendees will learn everything they need to know to add support for new languages to Mac OS X. We will cover how Mac OS X works with fonts, and how you can use the available tools to add extra features to a font. We will also show how to construct a new keyboard layout and add it to Mac OS X. Finally, we will discuss how Mac OS X uses ICU and CLDR, and how you can participate in the CLDR process to add support for new locales.
Fun with Regular Expressions: An Implementation of the Unicode Bidi Algorithm
Martin Dˆºrst - W3C/Keio University
The Unicode Bidirectional (Bidi) Algorithm is often thought to be very complicated and difficult to implement. However, using the right abstraction, it is actually very to implement. This abstraction is substitution based on regular expressions. The various rules of the Bidi algorithm can be mapped almost directly to such substitutions, leading to a straightforward implementation that is easy to understand and can help understanding of the Bidi algorithm.
The talk will start slowly, introducing the necessary concepts from bidirectional rendering and from regular expressions one-by-one. A full implementation of the algorithm will be built up, followed by some discussion of variants (e.g. for older versions of the Bidi algorithm or to take higher-level protocols into account). We will use Perl for the examples, mostly because of its compact syntax, but will also show the corresponding constructs in other programming languages with good regular expression support (e.g. Python).
Federated Metasearch Strategies and Issues: Searching, Finding, and Archiving Data of the World from Disparate Public Systems
Michael McKenna - California Digital Library, University of California
Libraries and museums have been cataloging and archiving creative, scholarly, and political works since before the time of the Greeks. In the Digital Age, institutions have been storing metadata about information for the past thirty years or more. Even though standards exist, and have existed for some time, each institution may have chosen to store its information in different formats or encodings, may use different subsets of metadata, or different protocols to access the information.
In order to allow federated metasearch (distributed searching) across multiple repositories physically owned and managed by different institutions, several problems must be overcome. This paper will discuss problems and solutions related to the California Digital Library (CDL) which links all libraries across all campuses of the University of California, several museums, California public libraries, and links to other institutions such as Stanford, MIT, and the Library of Congress.
In addition, we will take a look at emerging problems as the CDL takes on the issues of archiving antiquities, oral histories, historical web sites, and non- textual digital media.
The World of Localisation: Organisations, Resources and Contacts
Reinhard Schˆ§ler - The Institute of Localisation Professionals (TILP)
Are you responsible for the development of multilingual digital content and for the localisation of digital material? Are you looking for reliable, unbiased information on localisation service providers, standards in different countries, average industry figures on costs or return on investment, or just plain details on technical and cultural conventions in your target locales? - There are resources available that can help you answer these and similar questions. However, locating and gaining access to them is not always easy; often it is difficult to see which event will provide you with the information you are looking for and which organisation will allow you access to independently compiled information. This session will help you overcome the information barrier and provide you with an independent overview of:
’Ä¢ The for-profit and not-for profit organisations serving the localisation community
’Ä¢ The localisation service industry
’Ä¢ Information sources available in written form (journals, books, web sites)
Microsoft WinFX: Composite Font and Typography Extension
Michel Suignard - Microsoft Corporation
This talk would present new functionality being planned for WinFX (Windows future platform). A new font association mechanism (composite font) will alleviate the need for large fonts by combining several physical into a single logical font. The mechanism will also facilitate the support of CJK Extension B characters. In addition new access is provided to Open Type features. A demonstration will be part of the talk.
ALI-BABA - and the 4.0 Unicode Characters - 2
Thomas Milo - DecoType
Researching the Indigenous Scripts of Africa: An Exercise in Utility
Charles Riley - Yale University
This session will be an attempt to summarize the ongoing questions which are raised in the uniquely challenging context of working with African non-roman scripts: e.g., Bamum, Bassa Vah, Bete, Kpelle, Loma, Mende, Oromo, and Vai, among others. To a large extent, the user communities which make use of these scripts are loosely organized, and often may be lacking in resources and legacy encoding standards from which to draw in developing proposals for their scripts to be represented. Even so, they are of considerable research interest to specialists in the academic community, including those in linguistics, history and the social sciences. There have been a few discussion lists started to share information and ideas on how to ensure the representation of these scripts in Unicode, but more work needs to be done, particularly in the areas of identifying primary resources, generating appropriate mechanisms for approving standards, and developing usable input methods. The purpose of this session will be primarily to summarize and present the discussions in this area to a broader technical audience, and to generate more interest and energy in bringing these scripts into the Unicode standard.
A Social Commentary on Tibetan Unicode and Chinese National Standards
Christopher Walker - University of Chicago
The Chinese proposed standards of Tibetan-encoded IT, like we have seen in last year's BrdaRten proposal seeking a secondary encoding for Tibetan Unicode, often have more at stake than bridging the digital divide for this minority population in China. Given the interest to make 'digital libraries' mimicking those now active in the West, China has a desire to rapidly distribute via the Internet existing Tibetan literature and representations produced and sanctioned internally. The Chinese government has a long-standing partnership with domestic, commercial companies such as Founder, who provide the technology and fonts required for taking Tibetan texts to publishing houses. This proprietary Tibetan encoding and DOS editor system have offered challenges of affordability and implentation for the average Tibetan, even if the costs of adding Founder's specialized hardware to a computer has now jumped below a couple hundred dollars. With the advent of major operating systems in the West now supporting Tibetan Unicode and OpenType, not to mention other countries and locales which lend support to Tibet IT, China is now busy orienting itself to the new terrain of such Internationalized scripts.
IDN and Applications
Michel Suignard - Microsoft Corporation
International Domain Name (IDN)is now an IETF standard and opens the possibility of creating internationalized names for domain and host names. This talk will describe in details the IDN architecture and its different components. It will also covers the effects on applications and how IDN is affecting end user interaction concerning resource identification and location. It will also briefly mentions the interaction between IDN and Internationalized Resource Identifier (IRI).
What's New in Unicode 4.0?
Asmus Freytag - ASMUS, Inc.
Service Oriented Architecture & Web Services Globalization Support in Software Platforms: a Technical Overview
Adnan Masood - Next Estate Communications
In the emerging era of service orientation and web services orchestration, software architects and developers enjoy the choice, flexibility and strength of various software platforms available for developing robust interoperable applications. The objective of this paper is to examine the XML web services capabilities provided by these platforms, centric to globalization, internationalization, localization or multi-cultural application development in three major target niche; web, desktop and mobile applications. Both the major development platforms J2EE (Java 2 Enterprise Edition) and Microsoft .NET framework supports globalization inherently within the class architecture. This paper will examine the salient features of each class framework in the following perspectives.
XML web services internationalization in Desktop Applications: J2EE vs .NET framework
’Ä¢ Standardization: Character set / Unicode support in class framework.
’Ä¢ Ease of Implementation: Measuring class framework's inherent support for handling localization at the IL/JRE level.
’Ä¢ Deployment: Target Platform compatibility & support.
’Ä¢ Coupling: Degree of cohesion between Presentation & Business Logic.
’Ä¢ Source Granularity: Code Manageability & Function point analysis
’Ä¢ Performance & Optimization: Execution time-space tradeoff measurements
’Ä¢ Interoperability: Inter-process & Inter-application Communication support
XML web services internationalization in Web Applications: Struts (J2EE) vs ASP.NET (.NET framework)
’Ä¢ Standardization: Web client handling, Browser support for Character set / Unicode support in class framework.
’Ä¢ Ease of Implementation: Measuring class framework's inherent support for handling localization at the server side/ sandbox (applet) level.
’Ä¢ Performance & Optimization: Execution time-space tradeoff measurements from client and server perspective.
’Ä¢ Interoperability: Cross page posting, Unicode over HTTP, Unicode XML/SOAP, web services, Inter-process & Inter-application Communication support
XML web services internationalization in Mobile Applications: Java 2 Mobile Edition (J2EE) vs. ASP.NET / Windows CE (.NET Framework)
’Ä¢ Standardization: Character set / Unicode support in class framework for mobile devices.
’Ä¢ Ease of Implementation: With diverse device interfaces, measuring class framework's inherent support for handling localization at the IL/JRE level.
’Ä¢ Profiling: Device information profile compatibility in cultural application.
’Ä¢ Performance & Optimization: Execution time-space tradeoff measurements
’Ä¢ Interoperability: Overview of SIP, MIDP, MMAPI, CDC, MMIT, WML and their localization capabilities.
Along with sample implementation & code samples, comparative matrices are part of the paper to depict the XML web services internationalization support in software platforms.
This paper will benefit the organizations which are looking forward to invest in web services internationalization based projects. It provides them pros and cons of using different development platforms and their support for handling XML web Services therefore will help them evaluate the right solution for their target application.
My presentation at ASLIB's "Translating & the Computer 25" last year (http://www.aslib.com/conferences/programme.htm ) was on the following topic.
The Design and Implementation of Generic, Machine Translation Vendor Neutral Web Services A generic web services based communication mechanism for translation vendors is proposed in this paper. Considering the inter-communication between translation vendor, client and translator, either machine or manual; this simple framework's propositions are based on a detailed contemporary study of the implementation of web service architecture for real-time translation service vendors.
I'm author of various articles on XML Web Services. A list of those could be found here.
International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to firstname.lastname@example.org.