使用Scikit-learn库中的SVM机器学习对银行支票上的CMC7码进行Python光学字符识别

共35个文件

npy：11个

py：9个

pyc：7个

版权申诉

scikit-learn

支持向量机

机器学习

python

169 浏览量 2024-02-05 11:31:51 上传评论收藏 421KB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

OCR-CMC7-master.zip （35个子文件）

OCR-CMC7-master

CMC7training.py 1KB

backup

trainingData.txt 670KB

tests

test4.jpeg 80KB

test3.jpeg 78KB

test1.jpeg 78KB

test2.jpeg 82KB

utils

parameters.py 693B

Matrix01.py 2KB

CMC7.py 1KB

Matrix.py 664B

CMC7data.py 643B

__pycache__

Matrix.cpython-35.pyc 1KB

Picture.cpython-35.pyc 1KB

parameters.cpython-35.pyc 587B

__init__.cpython-35.pyc 157B

Matrix01.cpython-35.pyc 3KB

toTrainingData.cpython-35.pyc 801B

CMC7.cpython-35.pyc 1KB

Picture.py 932B

brain

brain.pkl_10.npy 704B

brain.pkl_07.npy 80B

brain.pkl_01.npy 132B

brain.pkl_08.npy 2KB

brain.pkl_11.npy 527KB

brain.pkl_09.npy 184B

brain.pkl_03.npy 38KB

brain.pkl_02.npy 38KB

brain.pkl_06.npy 184B

brain.pkl_05.npy 80B

brain.pkl_04.npy 704B

brain.pkl 1KB

toDataFile.py 634B

extractCM7.py 956B

README.md 5KB

trainingfiles

trainingData.txt 670KB

# OCR-CMC7 ## Rafael CARTENET Python Optical Character Recognition of CMC7 codes on Bank Checks using SVM machine learning from Scikit-learn library ## Libraries used : - sklearn - numpy - PIL ## Purpose From a jpeg A4 scan containing a bank check, extract the CMC7 code. The goal was to computerize the extraction of the code CMC7 of a bank check which is used for validity. (mainly used in Europe only) More about CMC7 : https://en.wikipedia.org/wiki/Magnetic_ink_character_recognition ## Method ### Modelling Data I started with 60 models of bank checks with the same format, same bank. The goal was to, from a new given bank check scan, extract the CMC7 code. I first found out the box where the CMC7 code is situated, and chose the coordinates of the top left corner and the bottom right one. On the 60 bank checks, the CMC7 is inside that box, with white background and black digits/symbols. Once i did that i transformed the cropped picture (according to the box) into a 0 1 Matrix according to the gray scale : For each box i computed the lighter pixel and the darker pixel, this way i could cancel the contrast effect, that can be different from a bank check to an other. I defined D the half value between the darker pixel and the lower pixel, which would then be my decision value. If a pixel is lower than D, then the pixel would become 0, if it's above, 1. Once it's done i got a matrix of 0 and 1 where the 1 represents the black and the 0 represents white. This way it's much easier to manage the digits. ### Extract the digits On the bank checks i used, there were 35 digits, but it's just a parameter. Analysing the different 01 Matrix i got, i found out that the hight and the width of the digits are always the same, because of course it's the same font. I made two functions that allowed me to detect my digits : - cutleft (deletes the left columns full of zero of a given 01 Matrix) - removebottom (deletes the bottom lines full of zero of a given 01 Matrix) I cutleft my 01Matrix so i know that after that, the first column is the beginning of the first digit. Then, i know how wide is a digit so i just cut a new 01Matrix with the columns containing the first digit and only this digit from the "mother" 01 Matrix. I then removebottom this Matrix, which allows me to make sure that the digit is always aligned with the bottom left corner. As i know the height of a digit, i simply deleted the lines above that won't be useful. Once i did that i extracted the first 01Matrix, corresponding to the first digit matrix and cropped the "mother" 01 Matrix. Then i just repeated this process as many times as the number of digits ! Once you extracted a the digit, you can simply convert it in a list of 0 and 1 (of length digitH x digitW). This is gonna be your input later for the machine learning. Which means that each bank check is gonna give you as many vectors as number of digits on your CMC7 code. ### Creation of Labeled data Once i was able to extract nbDigits vectors representing my digits, i created Labeled Data, which means the classification of each vector. I created a TrainingFile.txt according to this format : `0 1 0 1 ..... 0 1 0 0 1 X` `1 1 0 0 ..... 1 1 1 0 0 Y` 0 1 0 1 ..... 0 1 1 1 1 Z` Where X Y Z represents the value of the digit associated to the 0 1 list. ### Training the model I used SVM method from scikit learn to classify digits. I divide my labeled data in two groups, training data and testing data. I first use my training data to train the model and then calculate the efficiency of the model using the remaining datas. Example of report : Tests Report : Total digits data : 2065 (59 bank checks) Nb of training digits (70.0%) : 1446 Nb of testing digits (30.0%) : 619 Accuracy : 100.0% ## Files ### toDataFile.py Generate Labeled Data according to bank checks that you put in the trainingfiles sub directory. Filenames must follow the name defined in parameters.py Each digit is represented as a digitW*digitH vector, filed with 0 and 1, depending if the pixel is closer to black or white. Use : `python toDataFile.py` ### CMC7training.py Using the previously created Labeled Data, learn using SVM how to classify the different digits. Train using trainProportion percent of the labeled datas, and the remaining ones to test the accuracy of the model. Use : `python CMC7training.py` ### extractCMC7.py Extract the CMC7 code from a jpeg scan, using the previously trained SVM model. Use : `python extractCMC7.py filename.jpeg` ### Settings Check parameters.py

评论收藏

内容反馈

版权申诉