Our handwritten past, automatically transcribed
The historical documents lovingly preserved in the world's archives still have a lot to tell us - and unlocking the secrets of the handwritten ones has recently become easier. EU-funded researchers have made innovative automated transcription and indexing technology freely available, boosting Europe's research capabilities.
© Funny Studio #91744874 2019, source:stock.adobe.com
Different scripts, difficult layout, diverse languages and linguistic variants until very recently, says Günter Mühlberger of the University of Innsbruck, it was barely imaginable that a historical manuscript might one day be searchable in much the same way as a contemporary document.
And yet, this seemingly impossible dream has come true following breakthroughs within the past five years or so, he explains.
Mühlberger is the coordinator of the EU-funded e-infrastructure project READ. ‘We have achieved fundamental progress in the domains of handwritten text recognition, layout analysis and keyword spotting,’ he reports.
READ provides a growing community of users with access to innovative technology and thereby with an easier way to explore an important part of European and international cultural heritage.
‘We are the only research infrastructure that makes these technologies directly available worldwide to anyone with an interest in historical documents,’ Mühlberger notes. Known as Transkribus, the service platform developed by members of the READ consortium builds on earlier work in predecessor project Transcriptorium.
Registration is all that is required, Mühlberger adds. Some 14 000 users had seized this opportunity by August 2018.
Sign up ...
Transkribus automatically transcribes scanned handwritten documents, and it does so in many languages including script running from right to left, such as text in Arabic or Hebrew. Training the software to recognise a particular hand greatly improves the accuracy of the transcription, Mühlberger underlines.
For the best results, an initial set of 50 to 100 pages has to be processed with user assistance, he explains. Transkribus can already achieve character error rates as low as 3.5 % or so for handwriting, Mühlberger says, and the project's research in the area of pattern recognition, artificial intelligence and natural language processing continues.
‘But, of course, training the software in this way is not realistic in an archive with a lot of different handwriting,’ he comments as he introduces another key feature. The platform also includes a keyword spotting or indexing function, which works even if the transcription is patchy.
Search operations can therefore be carried out independently of the automated transcription. ‘This technology has transformational potential for any archive in Europe,’ Mühlberger notes. ‘With the training data included in Transkribus, it is already possible to search documents from the Middle Ages until today in a convenient way.’
...and never look back
‘A newly developed interface enables users to upload their own documents, work with these files, perform text recognition and so on,’ Mühlberger says. ‘And they can export their data in several standard formats. We receive a lot of very, very positive feedback.’
Transkribus is used by researchers, for instance, for work focusing on figures as diverse as Dutch 17th century Admiral Michiel de Ruyter and 20th century French philosopher Michel Foucault, or for the study of specific types of documents, such as manuscripts in Gothic lettering.
It also appeals to many other users. The national archives of Finland and the Netherlands, for example, are considering ways to integrate the technology into their digitisation work flow.
‘Our claim is that we revolutionise access to historical documents, and I believe we have really managed to make a major contribution in the digitisation of historical archives,’ Mühlberger recaps on the achievements so far. The partners are exploring options to safeguard the sustainability of the e-infrastructure after the project ends in June 2019.
© Günter Mühlberger - Screenshot of Transkribus, 2018