A Composite approach to Language/Encoding Detection
19
th
International Unicode Conference 1 San Jose, September 2001
A composite approach to language/encoding detection
Shanjian Li (shanjian@netscape.com)
Katsuhiko Momoi (momoi@netscape.com)
Netscape Communications Corp.
(Please do not distribute this paper before 09/15/2001. Thanks!)
1. AbstractSummary:
This paper presents three types of auto-detection methods to determine encodings of documents without
explicit charset declaration. We discuss merits and demerits of each method and propose a composite
approach in which all 3 types of detection methods are used in such a way as to maximize their strengths
and complement other detection methods. We argue that auto-detection can play an important role in
helping transition browser users from frequent uses of a character encoding menu into a more desirable
state where an encoding menu is rarely, if ever, used. We envision that the transition to the Unicode would
have to be transparent to the users. Users need not know how characters are displayed as long as they are
displayed correctly -- whether it’s a native encoding or one of Unicode encodings. Good auto-detection
service could would help significantly in this effort as it takes most encoding issues out of the user’s
concerns.
2. Background:
Since the beginning of the computer age, many encoding schemes have been created to processing certain
language and languages for certain region. With the trend of global village, and especially the development
of Internet, Information exchange across regions become more andrepresent various writing
scripts/characters for computerized data. With the advent of globalization and the development of the
Internet, information exchanges crossing both language and regional boundaries are becoming ever more
important. But the existence of multiple coding schemes presents a established a bigsignificant barrier.
Thedevelopment of Unicode provideshas provided a universal coding scheme, but it could not replace
existing coding scheme for many reasons. Manyhas not so far replaced existing regional coding schemes
for a variety of reasons. This, in spite of the fact that many W3C and IETF recommendations list UTF-8 as
the default encoding, e.g. XML, XHTML, RDF, etc. Thus, today's global software applications haveare
required to handle multiple encodings in addition to supporting Unicode.
This work is done in the context to developing Internet browser. Today’s Internet is full of web pages in
various languages using various encoding. A lot of effort have been put into browser development to
handle web pages in various encoding, but inThe current work has been conducted in the context of
developing an Internet browser. To deal with a variety of languages using different encodings on the web
today, a lot of efforts have been expended. In order to get the correct display result, browser’s rely on http
server, html author or end user to provide the correctbrowsers should be able to utilize the encoding
information inprovided by http servers, web pages or end users via a character encoding menu. order to
interpret the text data correctly. Unfortunately, this piecetype of information is missing in many http server
and/or html pages, and manyfrom many http servers and web pages. Moreover, most average users are
unable to provide this piece of information. Withoutinformation via manual operation of a character
encoding menu. WBut without this charset information,the web pages are sometimes displayed inas
‘garbage’ characters, and users are rejected from accessing thatunable to access the desired information.
This also leadsto many users to believeconclude that their browser is not functioning well. To auto-
detectmal-functioning or buggy.
As more Internet standard protocols designate Unicode as the default encoding, there will undoubtedly be a
significant shift toward the use of Unicode on web pages. Good universal auto-detection can make an