Our handwritten past, automatically transcribed

The historical documents lovingly preserved in the world's archives still have a lot to tell us - and unlocking the secrets of the handwritten ones has recently become easier. EU-funded researchers have made innovative automated transcription and indexing technology freely available, boosting Europe's research capabilities.

Countries
Countries
  Algeria
  Argentina
  Australia
  Austria
  Bangladesh
  Belarus
  Belgium
  Benin
  Bolivia
  Bosnia and Herzegovina
  Brazil
  Bulgaria
  Burkina Faso
  Cambodia
  Cameroon
  Canada
  Cape Verde
  Chile
  China
  Colombia
  Costa Rica
  Croatia
  Cyprus
  Czechia
  Denmark
  Ecuador
  Egypt
  Estonia
  Ethiopia
  Faroe Islands
  Finland
  France
  French Polynesia
  Georgia

Countries
Countries
  Algeria
  Argentina
  Australia
  Austria
  Bangladesh
  Belarus
  Belgium
  Benin
  Bolivia
  Bosnia and Herzegovina
  Brazil
  Bulgaria
  Burkina Faso
  Cambodia
  Cameroon
  Canada
  Cape Verde
  Chile
  China
  Colombia
  Costa Rica
  Croatia
  Cyprus
  Czechia
  Denmark
  Ecuador
  Egypt
  Estonia
  Ethiopia
  Faroe Islands
  Finland
  France
  French Polynesia
  Georgia


  Infocentre

Published: 30 January 2019  
Related theme(s) and subtheme(s)
Cultural Heritage
Research policyHorizon 2020
Countries involved in the project described in the article
Austria  |  Finland  |  France  |  Germany  |  Greece  |  Spain  |  Switzerland  |  United Kingdom
Add to PDF "basket"

Our handwritten past, automatically transcribed

Image

© Funny Studio #91744874 2019, source:stock.adobe.com

Different scripts, difficult layout, diverse languages and linguistic variants… until very recently, says Günter Mühlberger of the University of Innsbruck, it was barely imaginable that a historical manuscript might one day be searchable in much the same way as a contemporary document.

And yet, this seemingly impossible dream has come true following breakthroughs within the past five years or so, he explains.

Mühlberger is the coordinator of the EU-funded e-infrastructure project READ. ‘We have achieved fundamental progress in the domains of handwritten text recognition, layout analysis and keyword spotting,’ he reports.

READ provides a growing community of users with access to innovative technology and thereby with an easier way to explore an important part of European and international cultural heritage.

‘We are the only research infrastructure that makes these technologies directly available worldwide to anyone with an interest in historical documents,’ Mühlberger notes. Known as Transkribus, the service platform developed by members of the READ consortium builds on earlier work in predecessor project Transcriptorium.

Registration is all that is required, Mühlberger adds. Some 14 000 users had seized this opportunity by August 2018.

Sign up ...

Transkribus automatically transcribes scanned handwritten documents, and it does so in many languages – including script running from right to left, such as text in Arabic or Hebrew. Training the software to recognise a particular hand greatly improves the accuracy of the transcription, Mühlberger underlines.

For the best results, an initial set of 50 to 100 pages has to be processed with user assistance, he explains. Transkribus can already achieve character error rates as low as 3.5 % or so for handwriting, Mühlberger says, and the project's research in the area of pattern recognition, artificial intelligence and natural language processing continues.

‘But, of course, training the software in this way is not realistic in an archive with a lot of different handwriting,’ he comments as he introduces another key feature. The platform also includes a keyword spotting or indexing function, which works even if the transcription is patchy.

Search operations can therefore be carried out independently of the automated transcription. ‘This technology has transformational potential for any archive in Europe,’ Mühlberger notes. ‘With the training data included in Transkribus, it is already possible to search documents from the Middle Ages until today in a convenient way.’

...and never look back

‘A newly developed interface enables users to upload their own documents, work with these files, perform text recognition and so on,’ Mühlberger says. ‘And they can export their data in several standard formats. We receive a lot of very, very positive feedback.’

Transkribus is used by researchers, for instance, for work focusing on figures as diverse as Dutch 17th century Admiral Michiel de Ruyter and 20th century French philosopher Michel Foucault, or for the study of specific types of documents, such as manuscripts in Gothic lettering.

It also appeals to many other users. The national archives of Finland and the Netherlands, for example, are considering ways to integrate the technology into their digitisation work flow.

‘Our claim is that we revolutionise access to historical documents, and I believe we have really managed to make a major contribution in the digitisation of historical archives,’ Mühlberger recaps on the achievements so far. The partners are exploring options to safeguard the sustainability of the e-infrastructure after the project ends in June 2019.

Image

© Günter Mühlberger - Screenshot of Transkribus, 2018

Project details

  • Project acronym: READ
  • Participants: Austria (Coordinator), Switzerland, Germany, Greece, Spain, Finland, France, UK
  • Project N°: 674943
  • Total costs: € 8 220 716
  • EU contribution: € 8 220 716
  • Duration: January 2016 to June 2019

See also

 

Convert article(s) to PDF

No article selected


loading


Search articles

Notes:
To restrict search results to articles in the Information Centre, i.e. this site, use this search box rather than the one at the top of the page.

After searching, you can expand the results to include the whole Research and Innovation web site, or another section of it, or all Europa, afterwards without searching again.

Please note that new content may take a few days to be indexed by the search engine and therefore to appear in the results.

Print Version
Share this article
See also
Project website
Project details