开发和使用机器学习模型自动转录230万个手写职业代码的经验教训

版权申诉

35 浏览量 2022-02-01 01:27:42 上传评论收藏 839KB PDF 举报

开发和使用机器学习模型自动转录230万个手写职业代码的经验教训_Lessons learned developing and using a machine learning model to automatically transcribe 2.3 million handwritten occupation codes.pdf 这篇论文讲述了开发和使用机器学习模型自动转录230万个手写职业代码的过程中所获得的经验教训。在历史数据的数字化进程中，机器学习技术在文本识别领域的高精度表现使其成为手写资料转录的重要工具。然而，将机器学习应用于实际生产环境需要一套完整的端到端流程，能够应对大规模数据集，并且模型需要在少量人工转录的情况下就能达到高准确率。此外，对模型结果的正确性也需要进行验证。论文中提到的Occode机器学习管道是为转录挪威1950年人口普查中的230万个手写职业代码而设计的。他们达到了97%的自动转录准确性，仅3%的代码需要人工验证。通过比较自动转录结果与训练数据中的职业代码分布，他们证实了结果的代表性，这与普查的整体情况相符。从这篇论文中我们可以提取出以下关键知识点： 1. **机器学习在文本识别中的应用**：机器学习模型可以有效地识别手写文本，尤其是在历史文档的数字化过程中。OCR（光学字符识别）技术是这一领域的重要工具。 2. **端到端机器学习管道**：为了在生产环境中使用机器学习，需要一个能够处理大规模数据的高效流程，包括数据预处理、模型训练、验证和部署。 3. **模型训练与准确性**：在有限的人工标注数据上训练模型，要求模型具有高泛化能力。文中模型达到了97%的准确率，这在大规模转录任务中是非常显著的成就。 4. **数据验证**：尽管模型表现良好，但仍有3%的代码需要人工验证，这表明在完全自动化之前，人类审核仍然是必要的环节。 5. **结果的统计一致性**：验证模型输出的职业代码分布与训练数据一致，这是评估模型性能和代表性的关键步骤。 6. **经验分享与开源**：作者认为他们的方法和经验可能对其他计划使用机器学习进行转录的项目有所帮助，并开放源代码供其他人参考和使用，体现了科研界的协作精神。 7. **资源投入与政策支持**：论文提及挪威国家档案馆在数字化方面的投资，反映了政府对于文化遗产保护的重视以及对新技术应用的支持。这些知识点不仅对于历史数据的数字化工作有指导意义，而且对于任何涉及大规模文本识别和机器学习应用的项目都具有参考价值。研究者们通过开源其代码，促进了机器学习技术在历史数据处理领域的进一步发展。

资源推荐

资源详情

资源评论

Lessons learned developing and using a machine learning

model to automatically transcribe 2.3 million handwritten

occupation codes

Bjørn-Richard Pedersen

, Einar Holsbø

, Trygve Andersen

, Nikita Shvetsov

, Johan Ravn

Hilde Leikny Sommerseth

, Lars Ailo Bongo

Norwegian Historical Data Centre, UiT The Arctic University of Norway

Department of Computer Science, UiT The Arctic University of Norway

Medsensio AS, Tromsø, Norway

Corresponding author: lars.ailo.bongo@uit.no

Abstract

Machine learning approaches achieve high accuracy for text recognition and are therefore

increasingly used for the transcription of handwritten historical sources. However, using machine

learning in production requires a streamlined end-to-end pipeline that scales to the dataset size

and a model that achieves high accuracy with few manual transcriptions. The correctness of the

model results must also be verified. This paper describes our lessons learned developing, tuning

and using the Occode end-to-end machine learning pipeline for transcribing 2.3 million

handwritten occupation codes from the Norwegian 1950 population census. We achieve an

accuracy of 97% for the automatically transcribed codes, and we send 3% of the codes for manual

verification. We verify that the occupation code distribution found in our results matches the

distribution found in our training data, which should be representative for the census as a whole.

We believe our approach and lessons learned may be useful for other transcription projects that

plan to use machine learning in production. The source code is available at:

https://github.com/uit-hdl/rhd-codes

Introduction

Over the last few decades, we have witnessed a boom in the digitization of historical documents,

and many national archives are developing services where the public can easily access their

cultural heritage. For example, in the Norwegian state budget for 2019, 140 million NOK were

allocated to the National Archive of Norway. About half of the budget was assigned for further

development of the Digital Archive, which stores and distributes digitized historical documents.

This will increase the number and availability of digitized historical documents. However, there is

an increasing demand from the research community to have these documents in a data format

suitable for research based on data analysis. This demand has yielded the development of more

time and cost efficient systems in recent decades. Of particular interest for our project is the

development of automatic text recognition for population data, typically characterized as

handwritten documents with a tabular structure. These sources form the basis in the construction

of the Norwegian Historical Population Register (HPR; http://www.rhd.uit.no/nhdc/hpr.html).

The HPR will include the records of the 9.7 million people who lived in Norway in the period from

1801 to 1964. We are building life trajectories across multiple generations by linking individual

and contextual attributes derived from population censuses and church records. In the HPR, we

have over 10 million manually transcribed person entities from the 19th and early 20th century

population censuses, and approximately 20 million person entities from church books. While the

censuses have primarily been transcribed by professional transcribers, the church books have been

transcribed by a mix of volunteers and professionals. Selected columns in the church books have

recently been outsourced through a manual transcription agreement between the National Archives

of Norway and three commercial genealogy companies, and 59 million person entities were

transferred to HPR in October 2021. However, we still have an enormous amount of data awaiting

transcription and integration into HPR, so automation would undoubtedly be both time and cost

efficient.

Handwritten text recognition (HTR) methods have gained increased attention over the past two

decades, and different approaches have been proposed for both online and offline automatic and

semi-automatic transcription of handwritten text, character and/or digit recognition (Bottou et al.,

1994; Ghosh & Maghari, 2017; Liu, Nakashima, Sako, & Fujisaw, 2003; Plamondon & Srihari

2000). We describe an offline solution to automatically transcribe handwritten single and multi-

digit numbers from the Norwegian full count 1950 population census. This was the last census

where the information was aggregated manually, and it is therefore an important bridge to the later

electronic censuses. The census manuscripts have been scanned by the National Archives of

Norway. The questionnaires include two stippled rows for each individual registered (Figure 1).

In the first row, the enumerator registered the characteristics of each person, divided into 26

columns, while the second row was kept blank. This row was strictly reserved for the statistical

preparation of the census, and Statistics Norway has greyed out the row and typed a warning

message: “Skriv ikke på denne linjen” (Do not write on this line). The second row consists of

numbers written with a coloured pencil (usually red), by employees at Statistics Norway or their

liaisons in the different Norwegian counties. These numbers represent codes for family position,

de facto/de jure residency, marital status, occupation and education.

Handwritten digit recognition is a textbook example in machine learning, with numerous tutorials

available for different machine learning frameworks. The MNIST dataset of 70,000 handwritten

single digits (Lecun, Bottou, Bengio, & Haffner, 1998; http://yann.lecun.com/exdb/mnist/) was an

early benchmark dataset in machine learning, which was widely used in many research papers ten

years ago, and the best algorithms achieve almost perfect accuracy today (e.g. the 0.23% error rate

found in Cireşan, Meier, & Schmidhuber, 2012). There are numerous solutions for automatic

detection of numbers in images, such as reading street numbers. However, existing solutions

trained on MNIST and other modern datasets may not work well for historical data, because the

digits in these are written in many other ways (Kusetogullari, Yavariabdi, Cheddad, Grahn, & Hall,

2019).

剩余19页未读，继续阅读

评论收藏

内容反馈

版权申诉

易小侠

粉丝: 6597
资源: 9万+

开发和使用机器学习模型自动转录230万个手写职业代码的经验教训_Lessons learned developing and u

VDA_Lessons Learned_2018_德文.rar

软件测试经验与教训(Lessons Learned in sofware testing)

VDA_Band_Lessons_Learned_en2020.rar

C-CPP-pointer-lessons-learned.zip_Lessons Learned

信息安全_数据安全_Lessons Learned 50 Years of Mist.pdf

lessons_learned_in_software_testing

Lessons_learned_doucment

软件测试经验与教训(Lessons Learned in Software Testin

信息安全_数据安全_Lessons_Learned_Running_High_Sta.pdf

软件测试经验与教训(Lessons Learned in Software Testing)Word英文版

lessons_learned_in_software_testing(软件测试经验与教训)

藏经阁-Fault Tolerance in Spark_ Lessons Learned from Production-25

In_a_COVID-19_World_-_Lessons_Learned_with_OT_CyberSecurity.pdf

信息安全_数据安全_Lessons_Learned_From_30_Years_of.pdf

Lessons Learned in Software Testing A Context-Driven Approach

java_lessons.zip_Lessons

Lessons Learned in Software Testing 英文版

lessons learned deep learning.pdf

信息安全_数据安全_First_Steps_in_RF：Lessons_Learned.pdf

PID Control in the Third Millennium Lessons Learned and New Approaches_15RMB.pdf

java_lessons.zip_Object Lessons

软件测试经验与教训 Lessons Learned in Software Testing 293个实践 适合开阔视野 吸取经验

Aeberhard 等。 - 2015 - Experience, Results and Lessons Learned fr

Benefits and Challenges of Model-based Software Engineering-Lessons Learned.pdf

藏经阁-Lessons Learned From Managing Thousands of Apache Spark Clus

my_sh_lessons .rar

Win32Project1_lessons2s_wxwidget测试_wxWidgets_Win32Project1.ilk_源

最新资源

软件测试经验与教训 Lessons Learned in Software Testing 293个实践适合开阔视野吸取经验