This file is intended to clarify why and how we are using Unicode in gocr. It's
probably only interesting if you intend to do something similar in a project of
yours or to develop gocr.
History
0.1 initial version
---
Why to use Unicode? While in this early development stage gocr doesn't
recognize much more than the ASCII characters, we hope that someday it will
support many different languages with different character sets; that it will
recognize mathematical expressions, and so on. Even in this early stage, we are
trying to support other Latin languages --- accented characters. Since once we
aren't using ASCII characters anymore we are subject to the character set
loaded in the machine if we use the 0x80-0xFF characters, we had to solve the
problem.
Against what Andrew Tanenbaum once said, "The good thing about standards is
that there are so many to choose from", we decided to not invent a new one and
stick to one of the current; Unicode is the most famous, so we chose it.
To my dismay, Unicode's support, at this time, sucks. There are few libraries
around to deal with it, contrary to what one would expect. The libraries I
found, though very good, did not provide the kind of support we needed in gocr:
to work internally with hundreds of different characters. They were all focused
in handling external files, user interface --- i18n, in short --- something that
I'm sure is much more needed and used than what gocr needs.
That's why we wrote our own Unicode code. We implemented only what we needed,
and in a practical way to the developer --- composing characters, etc. Since no
one I know will want the output of their scanned and OCR document in Unicode or
UTF-8 format (though I hope that one format will eventually be used in every OS
and computer around, and ASCII will go to a museum, and though gocr can output
in one of these formats too), we had to output in some format more friendly;
the choices are existing character maps, TeX, SGML and HTML initially, and who
knows what else later. Once we can recognize the text and keep the formatting,
these formats will be desired even more.
How to implement it (careful: developer's view)? Fortunatedly, there is partial
support for it now. The wchar_t type defined in <stddef.h> is a standard (only
sometimes 16, sometimes 32, perhaps even 64 somewhere). Do we need the libc's
string functions? If we do, they also exist for wchar_t. Some conversion
functions were needed: ASCII -> Unicode, Unicode -> everything else.
The ASCII -> Unicode conversion (done by the compose() function) is written to
be called by the ocr engine, when it recognizes a character. You can also use
the Unicode codes #defined in unicode.h, but the compose function allows a
simpler use. It's recommended to use the symbols itself for ASCII codes (don't
need to LATIN_CAPITAL_LETTER_A, use 'A').
The Unicode -> etc conversion (done by decode()) is a bit more difficult
sometimes, since previous symbols may interact with the current one. For
example, if you're converting to TeX, two characters that are in math mode will
call two times math mode; for example, "\( \pi \) \( \iota \)", instead of
"\( \pi \iota \)". Possibly a wider conversion function, decode_text(), which
deals with the entire text at once should be provided; this function will also
create headers, etc.
没有合适的资源?快使用搜索试试~ 我知道了~
资源详情
资源评论
资源推荐
收起资源包目录
gocr-0.49.rar_GOCR _OCR_go_ocr开源代码 (116个子文件)
gocr.1 5KB
gocr.1~ 5KB
.#Makefile.1.22 5KB
.#Makefile.1.6 676B
AUTHORS 243B
make.bat 2KB
BUGS 2KB
ocr0.c 292KB
pgm2asc.c 113KB
ocr0n.c 67KB
barcode.c 52KB
unicode.c 48KB
detect.c 40KB
remove.c 28KB
pnm.c 18KB
database.c 17KB
pixel.c 17KB
box.c 14KB
lines.c 13KB
gocr.c 11KB
otsu.c 11KB
output.c 11KB
list.c 9KB
pcx.c 6KB
jconv.c 4KB
progress.c 3KB
tga.c 3KB
ocr1.c 3KB
job.c 2KB
configure 136KB
create_db 1KB
CREDITS 784B
.cvsignore 68B
.cvsignore 29B
.cvsignore 20B
.cvsignore 20B
.cvsignore 20B
.cvsignore 16B
example.dtd 2KB
font2.fig 2KB
inverse.fig 660B
color.fig 660B
ex.fig 624B
rotate45.fig 317B
unicode_defs.h 55KB
gocr.h 12KB
pgm2asc.h 3KB
list.h 3KB
ocr0.h 2KB
unicode.h 2KB
progress.h 2KB
output.h 1KB
otsu.h 1003B
config.h 980B
pnm.h 887B
amiga.h 841B
pcx.h 232B
barcode.h 165B
tga.h 160B
ocr1.h 99B
version.h 64B
HISTORY 13KB
gocr.html 21KB
gpl.html 20KB
config.h~ 1KB
version.h~ 64B
Makefile.in 6KB
Makefile.in 3KB
configure.in 3KB
config.h.in 980B
Makefile.in 690B
Makefile.in 463B
INSTALL 2KB
install-sh 68B
config.h.in~ 987B
Makefile.in~ 395B
handwrt1.jpg 42KB
matrix.jpg 5KB
Makefile 6KB
Makefile 5KB
Makefile 3KB
Makefile 676B
Makefile 480B
barcode.c.orig 25KB
ocr-a-subset.png 3KB
ocr-b.png 2KB
ocr-a.png 1KB
5x8.png 1KB
5x7.png 1KB
4x6.png 977B
README 7KB
handwrt1.readme 225B
REVIEW 21KB
score 987B
gocr_chk.sh 4KB
gocr_chk.sh~ 4KB
gocr.spec 4KB
gocr.tcl 16KB
font1.tex 2KB
text.tex 1KB
共 116 条
- 1
- 2
weixin_42653672
- 粉丝: 93
- 资源: 1万+
下载权益
C知道特权
VIP文章
课程特权
开通VIP
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论1