没有合适的资源?快使用搜索试试~ 我知道了~
资源详情
资源评论
资源推荐
Americas Headquarters
EMEA Headquarters
Asia-Pacific Headquarters
100 California Street, 12th Floor
San Francisco, California 94111
York House
18 York Road
Maidenhead, Berkshire
SL6 1SF, United Kingdom
L7. 313 La Trobe Street
Melbourne VIC 3000
Australia
Delphi Unicode Migration for
Mere Mortals:
Stories and Advice from the Front Lines
Cary Jensen, Jensen Data Systems, Inc.
December 2009
(updated October 2010)
Delphi Unicode Migration for Mere Mortals: Stories & Advice from the Front Lines
Embarcadero Technologies - 1 -
SUMMARY
With the release of Embarcadero
®
RAD Studio XE (and beginning with the release of RAD
Studio 2009), Embarcadero Technologies has empowered you, the Delphi
®
and
C++Builder
®
developer, to deliver first class, Unicode-enabled applications to your
customers. While this important development is opening new markets for your software, in
some cases it presents a challenge for existing applications and development techniques,
especially where code has included assumptions about the size of strings.
This paper aims to guide your Unicode migration efforts by sharing the experiences and
insights of numerous Delphi developers who have already made the journey. It begins with
a general introduction of the issues, followed by a brief overview of Unicode basics. This is
followed by a systematic look at the various aspects of your applications that may require
attention, with examples and suggestions based on real world experience. A list of
references that may aid your Unicode migration efforts can be found at the end of this
paper.
INTRODUCTION
Embarcadero introduced full Unicode support in RAD Studio for the first time in August of
2008. In doing so, they ensured that Delphi and C++Builder would remain at the forefront
of native application development on the Windows platform for a very long time to come.
However, unlike many of the other major enhancements that have been introduced in
Delphi over the years, such as variants and interfaces (Delphi 3), frames (Delphi 5), function
inlining and nested classes (Delphi 2005) and generics (Delphi 2009), enabling Unicode
didn't involve simply adding new features to what was already supported in Delphi.
Instead, it involved a radical change to several fundamental data types that appear in
nearly every Delphi application. Specifically, the definitions for the String, Char, and PChar
types changed.
These changes were not adopted lightly. Instead, they were introduced only after
extensive consideration for the impact that these changes would have for existing
applications as well as how they would affect future development. In addition,
Embarcadero sought the input and advice of many of its Technology Partners who support
and promote Delphi.
In reality, there was no way to implement the Unicode support without some
inconvenience. As one of the contributors to this paper, who requested that I refer to him
simply as Steve, noted, "I think PChars and Strings should never have changed meaning.
... Having said that, any choice the developers of Delphi made would have been criticized.
It was a bit of a no-win situation."
Delphi Unicode Migration for Mere Mortals: Stories & Advice from the Front Lines
Embarcadero Technologies - 2 -
In the end, changing the meaning of String, Char, and PChar was determined to be the
least disruptive path, though not without consequences. On the plus side, Embarcadero
instantly enabled RAD Studio developers to build world class applications that treat both
the graphical interfaces and the data they help manipulate in a globally-conscious manner,
removing substantial barriers to building and deploying applications in an increasingly
global marketplace.
But there was a down side as well. The changes to String, Char, and PChar introduced
potential problems, significant or otherwise, for the migration of applications, libraries,
shared units, and time-test techniques from earlier versions of Delphi/C++Builder.
Let's be realistic about this. Nearly every upgrade of an existing application can potentially
encounter migration issues that require changes to the existing code or require upgrades
to newer versions of third-party component sets or libraries. The same is true when
upgrading to Delphi 2009 or later. Some upgrades will be easier, and some will be more
challenging.
And now we get to real point of this paper. Because of the changes to several
fundamental data types, data types that we have relied upon since Delphi 1 (Char and
PChar) or Delphi 2 (String), it is fair to say that migrating an existing application to Delphi
2009 or later requires more effort than any previous migration.
Contributor Roger Connell of Innova Solutions Pty Ltd offered this observation, "While
[the Delphi team has], in my view, done a sterling job [adding Unicode support, this] has
been the most challenging (in fact the only really challenging) Delphi migration."
Fortunately, there are solutions for every challenge you will encounter, and this paper is
here to help.
I began this project by asking the Delphi community for their input. Specifically, I asked
developers who successfully migrated their existing applications to Delphi 2009 and later
to share their insights, advice, and stories of Unicode migration. What I received in
response was fascinating.
The developers who responded represent nearly every category of developer you can
imagine. Some are independent developers while others are members of a development
team. Some produce vertical market products, some build in-house applications, and
some publish highly popular third-party component sets and tools used by application
developers. Yet others are highly respected authorities on Delphi, developers who speak
at conferences and write the books most of us have read.
Their stories, advice, and approaches were equally varied. While some described
migration projects that were rather straightforward, others found the migration process
difficult, especially in the cases of applications that have been around for a long time, and
included a wide variety of techniques and solutions.
Delphi Unicode Migration for Mere Mortals: Stories & Advice from the Front Lines
Embarcadero Technologies - 3 -
Regardless of whether a particular migration was smooth or challenging, a set of common
approaches, practical solutions, and issues to consider emerged, and I look forward to
sharing those with you.
But the story does not end with the publication of this white paper. I hope to continue to
collect Unicode migration success stories, and update this paper sometime in the future.
As a result, if you are inspired by what you read, and have a story of your own that
complements or extends what you read here, consider becoming a contributor yourself. I'll
say more about this at the end of this paper.
In the next section, I provide a brief summary of basic Unicode definitions and
descriptions. If you are already familiar with Unicode, have a basic understanding of UTF-8
and UTF-16, and know the difference between code pages and code points, you should
either skip this section, or quickly skim if for terms you are unfamiliar with.
But before we continue, there is one more point that I want to make. RAD Studio's support
for Unicode has two complementary, though distinct, implications for those applications
you build. The first is related to how strings are treated differently in code written in Delphi
2009 and later versus how they are treated in earlier versions of Delphi. The second relates
to localization, the process of adapting software to the language and culture of a market.
This paper is designed specifically to address the first of these two concerns.
Implementing support for multiple languages and character sets is beyond the scope of
this paper, and will not be discussed further.
WHAT IS UNICODE?
Unicode is a standard specification for encoding all of the characters and symbols of all of
the worlds written languages for storage, retrieval, and display by digital computers.
Similar to the ANSI (American National Standards Institute) code standard character set,
which represents both control characters (such as tab, line feed, and form feed) and
printable characters of the 26 character Latin alphabet, Unicode assigns at least one
unique number to every character.
Also like the ANSI code standard, Unicode represents many types of symbols, such as
those for currency, scientific and mathematical notation, and other types of exotic
characters. In order to reference such a large number of symbols (there are currently more
than a million), Unicode characters can require up to 4 bytes (32 bits) of data. By
comparison, the ANSI code standard is based on 8-bit encoding, which limits it to 255
different characters at a time.
Each control character, character, or symbol in Unicode is assigned a numeric value, called
its code point. The code point for a given character, once assigned by the Unicode
Delphi Unicode Migration for Mere Mortals: Stories & Advice from the Front Lines
Embarcadero Technologies - 4 -
Technical Committee, is immutable. For example, the code point for ‘A’ is 65 ($0041 hex,
which in Unicode notation is represented as U+0041). Each character is also assigned a
unique, immutable name, which in this case is ‘LATIN CAPITAL LETTER A.’ Both of these
can never be changed, ensuring that today’s encoding can be relied upon indefinitely.
Each code point can be represented in either one, two, or four bytes, with the bulk of
common code points (64K worth) being capable of being represented in two bytes or less.
In Unicode terms, these first 64K symbols are referred to as the basic multilingual plane, or
BMP (you'll want to remember these initials, as they will come up a lot in this paper).
To make things somewhat more complicated, the Unicode standard allows some
characters to be represented by two or more consecutive code points. These characters
are referred to as composite, or decomposable, characters.
For example, the character ö can be represented as $00F6. This character is referred to as
a precomposed character. However, it can also be represented by the o character ($006F)
followed by the diaeresis (¨) character ($0308). The Unicode processing rules compose
these two characters together to make a single character.
This is demonstrated in the following code segment:
var
s: String;
begin
ListBox1.Items.Clear;
s := #$00F6;
ListBox1.Items.Add('ö');
ListBox1.Items.Add(s);
ListBox1.Items.Add((IntToStr(Ord('ö'))));
s := #$006F + #$0308;
ListBox1.Items.Add(s);
The purpose of composite characters is to permit a finer grain analysis of the contents of a
Unicode file. For example, a researcher who wanted to count the frequency of the use of
the diaeresis (¨) diacritic, regardless of which character it appeared over, could decompose
all characters that use it, thereby making the counting process straightforward.
Although all currently assigned code points (as well as all imaginable future code points)
can be reliably represented by four bytes, it does not make sense in all cases to represent
each character with this much memory. Most English speakers, for example, use a rather
small set of characters (less than 100 or so).
As a result, Unicode also specifies a number of different encoding standards for
representing code points, each offering trade-offs in consistency, processing, and storage
requirements. Of these, the ones that you will run into most often in Delphi are UTF-8,
UTF-16, and UTF-32. (UTF stands for Unicode Transformation Format or UCS
剩余41页未读,继续阅读
cjxhd
- 粉丝: 0
- 资源: 8
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论1