gocr-0.49.rar_GOCR_OCR_go_ocr开源代码资源-CSDN文库

共116个文件

c：22个

h：17个

txt：11个

版权申诉

ocr

ocr开源代码

5星 · 超过95%的资源 85 浏览量 2022-09-19 16:54:33 上传评论收藏 399KB RAR 举报

资源详情

资源评论

资源推荐

收起资源包目录

gocr-0.49.rar_GOCR _OCR_go_ocr开源代码（116个子文件）

gocr.1 5KB

gocr.1~ 5KB

.#Makefile.1.22 5KB

.#Makefile.1.6 676B

AUTHORS 243B

make.bat 2KB

BUGS 2KB

ocr0.c 292KB

pgm2asc.c 113KB

ocr0n.c 67KB

barcode.c 52KB

unicode.c 48KB

detect.c 40KB

remove.c 28KB

pnm.c 18KB

database.c 17KB

pixel.c 17KB

box.c 14KB

lines.c 13KB

gocr.c 11KB

otsu.c 11KB

output.c 11KB

list.c 9KB

pcx.c 6KB

jconv.c 4KB

progress.c 3KB

tga.c 3KB

ocr1.c 3KB

job.c 2KB

configure 136KB

create_db 1KB

CREDITS 784B

.cvsignore 68B

.cvsignore 29B

.cvsignore 20B

.cvsignore 16B

example.dtd 2KB

font2.fig 2KB

inverse.fig 660B

color.fig 660B

ex.fig 624B

rotate45.fig 317B

unicode_defs.h 55KB

gocr.h 12KB

pgm2asc.h 3KB

list.h 3KB

ocr0.h 2KB

unicode.h 2KB

progress.h 2KB

output.h 1KB

otsu.h 1003B

config.h 980B

pnm.h 887B

amiga.h 841B

pcx.h 232B

barcode.h 165B

tga.h 160B

ocr1.h 99B

version.h 64B

HISTORY 13KB

gocr.html 21KB

gpl.html 20KB

config.h~ 1KB

version.h~ 64B

Makefile.in 6KB

Makefile.in 3KB

configure.in 3KB

config.h.in 980B

Makefile.in 690B

Makefile.in 463B

INSTALL 2KB

install-sh 68B

config.h.in~ 987B

Makefile.in~ 395B

handwrt1.jpg 42KB

matrix.jpg 5KB

Makefile 6KB

Makefile 5KB

Makefile 3KB

Makefile 676B

Makefile 480B

barcode.c.orig 25KB

ocr-a-subset.png 3KB

ocr-b.png 2KB

ocr-a.png 1KB

5x8.png 1KB

5x7.png 1KB

4x6.png 977B

README 7KB

handwrt1.readme 225B

REVIEW 21KB

score 987B

gocr_chk.sh 4KB

gocr_chk.sh~ 4KB

gocr.spec 4KB

gocr.tcl 16KB

font1.tex 2KB

text.tex 1KB

共 116 条

This file is intended to clarify why and how we are using Unicode in gocr. It's probably only interesting if you intend to do something similar in a project of yours or to develop gocr. History 0.1 initial version --- Why to use Unicode? While in this early development stage gocr doesn't recognize much more than the ASCII characters, we hope that someday it will support many different languages with different character sets; that it will recognize mathematical expressions, and so on. Even in this early stage, we are trying to support other Latin languages --- accented characters. Since once we aren't using ASCII characters anymore we are subject to the character set loaded in the machine if we use the 0x80-0xFF characters, we had to solve the problem. Against what Andrew Tanenbaum once said, "The good thing about standards is that there are so many to choose from", we decided to not invent a new one and stick to one of the current; Unicode is the most famous, so we chose it. To my dismay, Unicode's support, at this time, sucks. There are few libraries around to deal with it, contrary to what one would expect. The libraries I found, though very good, did not provide the kind of support we needed in gocr: to work internally with hundreds of different characters. They were all focused in handling external files, user interface --- i18n, in short --- something that I'm sure is much more needed and used than what gocr needs. That's why we wrote our own Unicode code. We implemented only what we needed, and in a practical way to the developer --- composing characters, etc. Since no one I know will want the output of their scanned and OCR document in Unicode or UTF-8 format (though I hope that one format will eventually be used in every OS and computer around, and ASCII will go to a museum, and though gocr can output in one of these formats too), we had to output in some format more friendly; the choices are existing character maps, TeX, SGML and HTML initially, and who knows what else later. Once we can recognize the text and keep the formatting, these formats will be desired even more. How to implement it (careful: developer's view)? Fortunatedly, there is partial support for it now. The wchar_t type defined in <stddef.h> is a standard (only sometimes 16, sometimes 32, perhaps even 64 somewhere). Do we need the libc's string functions? If we do, they also exist for wchar_t. Some conversion functions were needed: ASCII -> Unicode, Unicode -> everything else. The ASCII -> Unicode conversion (done by the compose() function) is written to be called by the ocr engine, when it recognizes a character. You can also use the Unicode codes #defined in unicode.h, but the compose function allows a simpler use. It's recommended to use the symbols itself for ASCII codes (don't need to LATIN_CAPITAL_LETTER_A, use 'A'). The Unicode -> etc conversion (done by decode()) is a bit more difficult sometimes, since previous symbols may interact with the current one. For example, if you're converting to TeX, two characters that are in math mode will call two times math mode; for example, "\( \pi \) \( \iota \)", instead of "\( \pi \iota \)". Possibly a wider conversion function, decode_text(), which deals with the entire text at once should be provided; this function will also create headers, etc.