Big Dots, Little Dots, and Circled Dots: How Unicode can help (and hurt) the process of converting documents to information

Benson Margulies - Basis Technology Corporation

Intended Audience: Managers, Software Engineers
Session Level: Intermediate

Government agencies, and the integrators and forensic software companies that serve them, are faced with more and more documents in languages other than English and scripts other than the Latin alphabet. Traditional approaches which discard or transcribe foreign language text can't do the job. Unicode is an important component of software systems that work properly in others languages and scripts. Just pushing all the incoming data into Unicode, however, doesn't allow the same old algorithms and approaches to solve the problem. Foreign language data in general, and Unicode text in particular, poses a set of novel challenges to analytical and forensic software. Addressing these issues head-on will allow you to find the information in the mass of data.