2
and contextual attributes derived from population censuses and church records. In the HPR, we
have over 10 million manually transcribed person entities from the 19th and early 20th century
population censuses, and approximately 20 million person entities from church books. While the
censuses have primarily been transcribed by professional transcribers, the church books have been
transcribed by a mix of volunteers and professionals. Selected columns in the church books have
recently been outsourced through a manual transcription agreement between the National Archives
of Norway and three commercial genealogy companies, and 59 million person entities were
transferred to HPR in October 2021. However, we still have an enormous amount of data awaiting
transcription and integration into HPR, so automation would undoubtedly be both time and cost
efficient.
Handwritten text recognition (HTR) methods have gained increased attention over the past two
decades, and different approaches have been proposed for both online and offline automatic and
semi-automatic transcription of handwritten text, character and/or digit recognition (Bottou et al.,
1994; Ghosh & Maghari, 2017; Liu, Nakashima, Sako, & Fujisaw, 2003; Plamondon & Srihari
2000). We describe an offline solution to automatically transcribe handwritten single and multi-
digit numbers from the Norwegian full count 1950 population census. This was the last census
where the information was aggregated manually, and it is therefore an important bridge to the later
electronic censuses. The census manuscripts have been scanned by the National Archives of
Norway. The questionnaires include two stippled rows for each individual registered (Figure 1).
In the first row, the enumerator registered the characteristics of each person, divided into 26
columns, while the second row was kept blank. This row was strictly reserved for the statistical
preparation of the census, and Statistics Norway has greyed out the row and typed a warning
message: “Skriv ikke på denne linjen” (Do not write on this line). The second row consists of
numbers written with a coloured pencil (usually red), by employees at Statistics Norway or their
liaisons in the different Norwegian counties. These numbers represent codes for family position,
de facto/de jure residency, marital status, occupation and education.
Handwritten digit recognition is a textbook example in machine learning, with numerous tutorials
available for different machine learning frameworks. The MNIST dataset of 70,000 handwritten
single digits (Lecun, Bottou, Bengio, & Haffner, 1998; http://yann.lecun.com/exdb/mnist/) was an
early benchmark dataset in machine learning, which was widely used in many research papers ten
years ago, and the best algorithms achieve almost perfect accuracy today (e.g. the 0.23% error rate
found in Cireşan, Meier, & Schmidhuber, 2012). There are numerous solutions for automatic
detection of numbers in images, such as reading street numbers. However, existing solutions
trained on MNIST and other modern datasets may not work well for historical data, because the
digits in these are written in many other ways (Kusetogullari, Yavariabdi, Cheddad, Grahn, & Hall,
2019).