Blog

European Commission Digital

CEF TELECOM GRANT BENEFICIARY

ParaCrawl taps the World Wide Web for language resources

EU funding supports ParaCrawl, the largest collection of language resources for many European languages – significantly improving machine translation quality


Members of the ParaCrawl consortium. Photo courtesy of ParaCrawl.


Quick facts

Need for multilingual communication

The European Commission’s Machine Translation (MT) tool, eTranslation, is software that automates the translation from one language to another. The first and foremost objective for the tool was to help European public administrations and officials in cross-border communication about EU policy and legislation. Consequently, eTranslation’s MT engines were carefully trained on formal, legal and administrative texts in the Union’s 24 official languages, Icelandic and Norwegian. But as the need for MT capabilities extends beyond formal texts, the Commission is now expanding the tool’s capabilities towards more informal, generic language.

Originally, eTranslation’s capabilities were trained with translations carried out by EU translators over the past decades, but modern technology has enabled the automated collection of translated texts from new sources, such as multilingual websites. Today, the language resources gathered from the internet with European Commission funding make up the largest collection for many European languages, significantly contributing to eTranslation and the machine translation community as a whole. Not only will the language resources be used to improve eTranslation, which will help to operate pan-European digital services in a multilingual environment, but the results will also be freely available to anyone interested in building better language tools in Europe. 

Results and benefits

To train eTranslation to understand informal texts, the Commission needed informal language resources - in this case written parallel corpora - which are translated texts mainly between English and another European language. This is especially important in applications, such as online dispute resolution, where informal language dominates. For example, the word for chair in “the chair is broken”, should in French be translated to “la chaise” (the piece of furniture), instead of “le président” (the chairman), for the translation to be intelligible.

When the Commission put out a call for language resources, a consortium called ParaCrawl suggested to crawl the World Wide Web for multilingual content from websites. Since the Commission decided to co-fund ParaCrawl led by The University of Edinburgh, the consortium has already made five releases of parallel corpora.

The value of ParaCrawl’s work was recently showcased at the Conference on Machine Translation (WMT), which took place in July 2019. After eTranslation’s MT engines were trained with ParaCrawl’s language resources, translation quality for different languages increased between 1.1-3.5 BLEU points, the unit used to measure MT system quality. What makes the result even more significant is that at the time, eTranslation was trained with ParaCrawl v3, while the current release is ParaCrawl v5. ParaCrawl v5 is more than twice the size of v3 – and much cleaner. Even though ParaCrawl’s language resources were available to other conference participants too, eTranslation’s strong MT capabilities secured the tool top rankings in all of the four translation tasks it took part in.

Overview of ParaCrawl v5 corpora sizes in terms of English word counts.


Taking a helicopter view on ParaCrawl, the project is contributing towards several high-level goals and objectives of the European Commission. For the Connecting Europe Facility (CEF) programme, ParaCrawl is helping the Commission further develop its tool, eTranslation, while freely and openly sharing the same resources with anyone else interested in making multilingual communication easier and better. It is also helping to achieve the Digital Single Market in Europe by bringing down the language barriers that are blocking cross-border e-commerce and easy access to products and services. The vast language resources collected, even for low-resourced languages, contribute towards Europeans being able to communicate and consume online services in their preferred language.

ParaCrawl has also experienced global interest in its language resources from the private sector, e-commerce in particular. A global corporation paid for the creation of additional corpora between non-English language pairs, now also included as bonus releases in ParaCrawl v5. Furthermore, the Japanese telecommunication's company, NTT, recently ran ParaCrawl's open-source software to create "JParaCrawl", the largest publicly available English-Japanese parallel corpus.

To make sure that ParaCrawl’s work benefits as many as possible, ParaCrawl’s language resources and open-source tools are publicly available for the MT research community and MT system owners to experiment and train their engines. As the language resources are simple pairs of translated texts, they can be used to develop any MT system regardless of technology: 

ParaCrawl’s open-source data collection pipeline

The consortium uses open-source software and state-of-the-art methods in the process that starts from crawling web content. Consortium members are specialised in different tasks along the data collection pipeline:

  • Crawling
  • Text extraction
  • Language detection
  • Multilingual content identification
  • Document and sentence alignment
  • Cleaning and anonymisation
  • Evaluation

Aligning documents and sentences to find translation matches from an enormous pool of content requires significant data processing and computing power. Only 0.01% of content finds a matching translation, yet ParaCrawl still managed to find and clean close to 37 million matching English-German sentence pairs with 930 million words. Even for a low-resource language, such as Maltese, ParaCrawl found and cleaned over 177,000 sentence pairs with 4 million words.

ParaCrawl has established its position as an important contributing member to the wider MT community. ParaCrawl was mentioned over 200 in the WMT 2019 proceedings, and their data was used in the conference’s shared task on parallel corpus filtering. The outputs of the task – to which many leading universities, research centres, global corporations and departments of national defence contributed – will be fed back to improve ParaCrawl’s data collection pipeline. Conversely, ParaCrawl’s ever-growing parallel corpus is provided back to the MT community.

Next steps

ParaCrawl continues on its path, determined to create the largest parallel corpora for many more languages. ParaCrawl is looking forward to expanding to low-resource languages, such as Basque, Catalan/Valencian and Galician. The consortium will also use new data sources, such as patents and website archives, and extend beyond HTML to PDFs and word processing formats.

As ParaCrawl’s resources are used to train eTranslation, the users of eTranslation can expect noticeable improvements in the translation quality of informal texts in the near future.


How can eTranslation help you?

eTranslation is one of the European Commission’s digital building blocks offered by the Connecting Europe Facility (CEF) programme. eTransation’s services can be consumed in two ways. Officials are encouraged to use eTranslation’s web browser service for ad-hoc translation of documents and text snippets. Public authorities are offered support services for integrating the tool into their digital public services to create multilingual content.

We would be happy to help you get started, visit us at the links below to learn more.