# OCR-CMC7
## Rafael CARTENET
Python Optical Character Recognition of CMC7 codes on Bank Checks using SVM machine learning from Scikit-learn library
## Libraries used :
- sklearn
- numpy
- PIL
## Purpose
From a jpeg A4 scan containing a bank check, extract the CMC7 code. The goal was to computerize the extraction of the code CMC7 of a bank check which is used for validity. (mainly used in Europe only)
More about CMC7 : https://en.wikipedia.org/wiki/Magnetic_ink_character_recognition
## Method
### Modelling Data
I started with 60 models of bank checks with the same format, same bank. The goal was to, from a new given bank check scan, extract the CMC7 code.
I first found out the box where the CMC7 code is situated, and chose the coordinates of the top left corner and the bottom right one. On the 60 bank checks, the CMC7 is inside that box, with white background and black digits/symbols.
Once i did that i transformed the cropped picture (according to the box) into a 0 1 Matrix according to the gray scale :
For each box i computed the lighter pixel and the darker pixel, this way i could cancel the contrast effect, that can be different from a bank check to an other.
I defined D the half value between the darker pixel and the lower pixel, which would then be my decision value. If a pixel is lower than D, then the pixel would become 0, if it's above, 1.
Once it's done i got a matrix of 0 and 1 where the 1 represents the black and the 0 represents white. This way it's much easier to manage the digits.
### Extract the digits
On the bank checks i used, there were 35 digits, but it's just a parameter. Analysing the different 01 Matrix i got, i found out that the hight and the width of the digits are always the same, because of course it's the same font.
I made two functions that allowed me to detect my digits :
- cutleft (deletes the left columns full of zero of a given 01 Matrix)
- removebottom (deletes the bottom lines full of zero of a given 01 Matrix)
I cutleft my 01Matrix so i know that after that, the first column is the beginning of the first digit. Then, i know how wide is a digit so i just cut a new 01Matrix with the columns containing the first digit and only this digit from the "mother" 01 Matrix. I then removebottom this Matrix, which allows me to make sure that the digit is always aligned with the bottom left corner. As i know the height of a digit, i simply deleted the lines above that won't be useful.
Once i did that i extracted the first 01Matrix, corresponding to the first digit matrix and cropped the "mother" 01 Matrix. Then i just repeated this process as many times as the number of digits !
Once you extracted a the digit, you can simply convert it in a list of 0 and 1 (of length digitH x digitW). This is gonna be your input later for the machine learning. Which means that each bank check is gonna give you as many vectors as number of digits on your CMC7 code.
### Creation of Labeled data
Once i was able to extract nbDigits vectors representing my digits, i created Labeled Data, which means the classification of each vector. I created a TrainingFile.txt according to this format :
`0 1 0 1 ..... 0 1 0 0 1 X`
`1 1 0 0 ..... 1 1 1 0 0 Y`
0 1 0 1 ..... 0 1 1 1 1 Z`
Where X Y Z represents the value of the digit associated to the 0 1 list.
### Training the model
I used SVM method from scikit learn to classify digits. I divide my labeled data in two groups, training data and testing data.
I first use my training data to train the model and then calculate the efficiency of the model using the remaining datas.
Example of report :
Tests Report :
Total digits data : 2065 (59 bank checks)
Nb of training digits (70.0%) : 1446
Nb of testing digits (30.0%) : 619
Accuracy : 100.0%
## Files
### toDataFile.py
Generate Labeled Data according to bank checks that you put in the trainingfiles sub directory.
Filenames must follow the name defined in parameters.py
Each digit is represented as a digitW*digitH vector, filed with 0 and 1, depending if the pixel is closer to black or white.
Use :
`python toDataFile.py`
### CMC7training.py
Using the previously created Labeled Data, learn using SVM how to classify the different digits.
Train using trainProportion percent of the labeled datas, and the remaining ones to test the accuracy of the model.
Use :
`python CMC7training.py`
### extractCMC7.py
Extract the CMC7 code from a jpeg scan, using the previously trained SVM model.
Use :
`python extractCMC7.py filename.jpeg`
### Settings
Check parameters.py
没有合适的资源?快使用搜索试试~ 我知道了~
使用 Scikit-learn 库中的 SVM 机器学习对银行支票上的 CMC7 码进行 Python 光学字符识别
共35个文件
npy:11个
py:9个
pyc:7个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 169 浏览量
2024-02-05
11:31:51
上传
评论
收藏 421KB ZIP 举报
温馨提示
使用 Scikit-learn 库中的 SVM 机器学习对银行支票上的 CMC7 码进行 Python 光学字符识别 库包: - sklearn - numpy - PIL 目标: 从包含银行支票的 jpeg A4 扫描图像中提取 CMC7 代码。目的是用计算机提取银行支票的 CMC7 代码,该代码用于检验支票的有效性。(主要用于欧洲) 关于 CMC7 的更多信息 : https://en.wikipedia.org/wiki/Magnetic_ink_character_recognition
资源推荐
资源详情
资源评论
收起资源包目录
OCR-CMC7-master.zip (35个子文件)
OCR-CMC7-master
CMC7training.py 1KB
backup
trainingData.txt 670KB
tests
test4.jpeg 80KB
test3.jpeg 78KB
test1.jpeg 78KB
test2.jpeg 82KB
utils
parameters.py 693B
Matrix01.py 2KB
CMC7.py 1KB
Matrix.py 664B
CMC7data.py 643B
__pycache__
Matrix.cpython-35.pyc 1KB
Picture.cpython-35.pyc 1KB
parameters.cpython-35.pyc 587B
__init__.cpython-35.pyc 157B
Matrix01.cpython-35.pyc 3KB
toTrainingData.cpython-35.pyc 801B
CMC7.cpython-35.pyc 1KB
Picture.py 932B
brain
brain.pkl_10.npy 704B
brain.pkl_07.npy 80B
brain.pkl_01.npy 132B
brain.pkl_08.npy 2KB
brain.pkl_11.npy 527KB
brain.pkl_09.npy 184B
brain.pkl_03.npy 38KB
brain.pkl_02.npy 38KB
brain.pkl_06.npy 184B
brain.pkl_05.npy 80B
brain.pkl_04.npy 704B
brain.pkl 1KB
toDataFile.py 634B
extractCM7.py 956B
README.md 5KB
trainingfiles
trainingData.txt 670KB
共 35 条
- 1
资源评论
sjx_alo
- 粉丝: 1w+
- 资源: 1199
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功