[Unicode]  Press Home | Site Map | Search
 

Unicode: A Sea Change

Without much fanfare, Unicode has completely transformed the foundation of software and communications. Whenever you read or write anything on a computer, you’re using Unicode. Whenever you search on Google, Yahoo!, Bing, Wikipedia, or many other websites, you’re using Unicode. Unicode marks a major milestone in providing people everywhere the ability to use their own languages on computers.

We developed Unicode with a simple goal: to unify the many hundreds of conflicting ways to encode characters, replacing them with a single, universal standard. Those existing legacy character encodings were both incomplete and inconsistent: Two encodings could use the same internal codes for two different characters and use different internal codes for the same characters; none of the encodings handled any more than a small fraction of the world’s languages. Whenever textual data was converted between different programs or platforms, there was a substantial risk of corruption. Programs were hard-coded to support particular encodings, making development of international versions expensive, testing a nightmare, and support costs prohibitive. As a result, product launches in foreign markets were expensive and late—unsatisfactory both for companies and their customers. Developing countries were especially hard-hit; it was not feasible to support smaller markets. Technical fields such as mathematics were also disadvantaged; they were forced to use special fonts to represent arbitrary characters, but when those fonts were unavailable, the content became garbled.

The Unicode Standard changed that situation radically. Now, for all text, programs only need to use a single representation—one that supports all the world’s languages. Programs could be easily structured with all translatable material separated from the program code and put into a single representation, providing the basis for rapid deployment in multiple languages. Thus, multiple-language versions of a program can be developed almost simultaneously at a much smaller incremental cost, even for complex programs like Microsoft Office or OpenOffice.

The assignment of characters is only a small fraction of what the Unicode Standard and its associated specifications provide. They give programmers extensive descriptions and a vast amount of data about how characters function: how to form words and break lines; how to sort text in different languages; how to format numbers, dates, times, and other elements appropriate to different languages; how to display languages whose written form flows from right to left, such as Arabic and Hebrew, or whose written form splits, combines, and reorders, such as languages of South Asia; and how to deal with security concerns regarding the many “look-alike” characters from alphabets around the world. Without the properties, algorithms, and other specifications in the Unicode Standard and its associated specifications, interoperability between different implementations would be impossible.

With the rise of the web, a single representation for text became absolutely vital for seamless global communication. Thus the textual content of HTML and XML is defined in terms of Unicode—every program handling XML must use Unicode internally, and all major browsers handle the world's HTML pages using Unicode internally. Furthermore, if you are using a desktop or a mobile device with an operating system such as Windows, OS X, iOS or Android, your operating system also uses Unicode natively. Search engines all use Unicode, and for good reason: even if a web page is in a legacy character encoding, the only effective way to index that page for searching is to transform it into the lingua franca, Unicode. All of the text on the web thus can be stored, searched, and matched with the same program code. Since all of the search engines transform web pages into Unicode, the most reliable way to have pages searched is to have them be in Unicode in the first place.

—This material was adapted from Mark Davis' Foreword to The Unicode Standard, Version 5.0.


Access to Copyright and terms of use