Blog

European Commission Digital

CEF TELECOM GRANT BENEFICIARY

ParaCrawl taps the World Wide Web for language resources

EU funding supports ParaCrawl, the largest collection of language resources for many European languages – significantly improving machine translation quality


Members of the ParaCrawl consortium. Photo courtesy of ParaCrawl.


Quick facts

Need for multilingual communication

The European Commission’s Machine Translation (MT) tool, eTranslation, is software that automates the translation from one language to another. The first and foremost objective for the tool was to help European public administrations and officials in cross-border communication about EU policy and legislation. Consequently, eTranslation’s MT engines were carefully trained on formal, legal and administrative texts in the Union’s 24 official languages, Icelandic and Norwegian. But as the need for MT capabilities extends beyond formal texts, the Commission is now expanding the tool’s capabilities towards more informal, generic language.

Originally, eTranslation’s capabilities were trained with translations carried out by EU translators over the past decades, but modern technology has enabled the automated collection of translated texts from new sources, such as multilingual websites. Today, the language resources gathered from the internet with European Commission funding make up the largest collection for many European languages, significantly contributing to eTranslation and the machine translation community as a whole. Not only will the language resources be used to improve eTranslation, which will help to operate pan-European digital services in a multilingual environment, but the results will also be freely available to anyone interested in building better language tools in Europe. 

Results and benefits

To train eTranslation to understand informal texts, the Commission needed informal language resources - in this case written parallel corpora - which are translated texts mainly between English and another European language. This is especially important in applications, such as online dispute resolution, where informal language dominates. For example, the word for chair in “the chair is broken”, should in French be translated to “la chaise” (the piece of furniture), instead of “le président” (the chairman), for the translation to be intelligible.

When the Commission put out a call for language resources, a consortium called ParaCrawl suggested to crawl the World Wide Web for multilingual content from websites. Since the Commission decided to co-fund ParaCrawl led by The University of Edinburgh, the consortium has already made five releases of parallel corpora.

The value of ParaCrawl’s work was recently showcased at the Conference on Machine Translation (WMT), which took place in July 2019. After eTranslation’s MT engines were trained with ParaCrawl’s language resources, translation quality for different languages increased between 1.1-3.5 BLEU points, the unit used to measure MT system quality. What makes the result even more significant is that at the time, eTranslation was trained with ParaCrawl v3, while the current release is ParaCrawl v5. ParaCrawl v5 is more than twice the size of v3 – and much cleaner. Even though ParaCrawl’s language resources were available to other conference participants too, eTranslation’s strong MT capabilities secured the tool top rankings in all of the four translation tasks it took part in.

Overview of ParaCrawl v5 corpora sizes in terms of English word counts.


Taking a helicopter view on ParaCrawl, the project is contributing towards several high-level goals and objectives of the European Commission. For the Connecting Europe Facility (CEF) programme, ParaCrawl is helping the Commission further develop its tool, eTranslation, while freely and openly sharing the same resources with anyone else interested in making multilingual communication easier and better. It is also helping to achieve the Digital Single Market in Europe by bringing down the language barriers that are blocking cross-border e-commerce and easy access to products and services. The vast language resources collected, even for low-resourced languages, contribute towards Europeans being able to communicate and consume online services in their preferred language.

ParaCrawl has also experienced global interest in its language resources from the private sector, e-commerce in particular. A global corporation paid for the creation of additional corpora between non-English language pairs, now also included as bonus releases in ParaCrawl v5. Furthermore, the Japanese telecommunication's company, NTT, recently ran ParaCrawl's open-source software to create "JParaCrawl", the largest publicly available English-Japanese parallel corpus.

To make sure that ParaCrawl’s work benefits as many as possible, ParaCrawl’s language resources and open-source tools are publicly available for the MT research community and MT system owners to experiment and train their engines. As the language resources are simple pairs of translated texts, they can be used to develop any MT system regardless of technology: 

ParaCrawl’s open-source data collection pipeline

The consortium uses open-source software and state-of-the-art methods in the process that starts from crawling web content. Consortium members are specialised in different tasks along the data collection pipeline:

  • Crawling
  • Text extraction
  • Language detection
  • Multilingual content identification
  • Document and sentence alignment
  • Cleaning and anonymisation
  • Evaluation

Aligning documents and sentences to find translation matches from an enormous pool of content requires significant data processing and computing power. Only 0.01% of content finds a matching translation, yet ParaCrawl still managed to find and clean close to 37 million matching English-German sentence pairs with 930 million words. Even for a low-resource language, such as Maltese, ParaCrawl found and cleaned over 177,000 sentence pairs with 4 million words.

ParaCrawl has established its position as an important contributing member to the wider MT community. ParaCrawl was mentioned over 200 in the WMT 2019 proceedings, and their data was used in the conference’s shared task on parallel corpus filtering. The outputs of the task – to which many leading universities, research centres, global corporations and departments of national defence contributed – will be fed back to improve ParaCrawl’s data collection pipeline. Conversely, ParaCrawl’s ever-growing parallel corpus is provided back to the MT community.

Next steps

ParaCrawl continues on its path, determined to create the largest parallel corpora for many more languages. ParaCrawl is looking forward to expanding to low-resource languages, such as Basque, Catalan/Valencian and Galician. The consortium will also use new data sources, such as patents and website archives, and extend beyond HTML to PDFs and word processing formats.

As ParaCrawl’s resources are used to train eTranslation, the users of eTranslation can expect noticeable improvements in the translation quality of informal texts in the near future.


How can eTranslation help you?

eTranslation is one of the European Commission’s digital building blocks offered by the Connecting Europe Facility (CEF) programme. eTransation’s services can be consumed in two ways. Officials are encouraged to use eTranslation’s web browser service for ad-hoc translation of documents and text snippets. Public authorities are offered support services for integrating the tool into their digital public services to create multilingual content.

We would be happy to help you get started, visit us at the links below to learn more.





How can electronic identity provide a safer internet?


The European Commission is pleased to announce the release of this video that shows how electronic identification (eID) can provide a safer internet for European citizens. 

The role of eID in an increasingly digitalised society

Today’s world is increasingly globalised and connected, as more people are living, working and travelling across borders. While there is freedom of movement and business, the administrative burden of conducting secure, online cross-border transactions with public and private sector entities is still high. Over the past years, EU Member States have been working hard to enable citizens and businesses to access online public services using their national eID. Such services may include declaring taxes, consulting health services and opening a bank online, in full compliance with the Anti-Money Laundering DirectiveAll these efforts have being reinforced by the eIDAS Regulation (EU) 910/2014 on electronic identification, authentication and trust services, which aims at making national eID schemes interoperable across Europe in order to facilitate cross-border access to online public services. Currently, 11 Member States have already notified their eID scheme for mutual recognition across borders and others will follow in due course.

As there are more services available online, there is an increasing need for users to provide personal data to authenticate themselves to online platforms and services. eID plays an important role in this regard, as it allows citizens and businesses to control their daily digital lives, and it is the medium that gives users access to online services, as it provides trustworthy information about the identity of the user. The trustworthy information, in turn, can help overcome the challenges derived from the increasing digitalisation in society by ensuring respect of privacy and a high level of protection of personal data. The process of doing so is explored in greater depth in the following section.

How eID can provide a safer internet

eID can help solve certain challenges resulting from the increasing digitalisation in society, such as tackling disinformation caused by online fake news circulated by bots or forged identities and protect children from accessing websites that have age restrictions in place (Figure 1). 

Figure 1: Perception of the threat of online disinformation and children's access to age-restricted content

In fact, the use of eID can seek to foster online accountability through the use of trustworthy identification and authentication means in line with the eIDAS Regulation, thus encouraging a more responsible behaviour online. Actually, by logging into a service with an eID, two types of information can be transmitted and shared with the service a user would like to access. These types of information are the following:

  1. Personal information, i.e. the name or date of birth, etc. to identify who the user is;
  2. A digital authentication, where the user proves that the person using the eID is effectively him/her. This may occur by means of credentials using a login and a password or through the possession of a token or device, such as smartphones. 

Figure 2: Showcase of the type of information that can be transmitted via eID.

As the user keeps full control of his/her eID, he/she can choose to share a minimal set of information with the service, for example his/her nationality or country of residence. In this case, only this type of information is shared with service providers and not other personal information. In fact, by enabling and supporting the use of trusted credentials or attributes securely associated with verified identity, users will be able to minimise and control the sharing of personal data, in line the General Data Protection Regulation (EU) 2016/679.  As a result, eID can help to overcome online misinformation and problems of protection of minors to age-restricted content. In case of the former, a user can check whether an account is linked to a real human and not to a bot or a fake profile that spreads disinformation; whilst, in case of the latter, it can securely verify whether children are old enough to access specific content online. 

The Connecting Europe Facility (CEF) eID Building Block primarily supports the Member States in the roll-out of the eIDAS Network (the technical infrastructure which connects national eID schemes). CEF eID is a set of services (including software, documentation, training and support) provided by the European Commission and endorsed by the Member States, which helps public administrations and private Service Providers to extend the use of their online services to citizens from other European countries.



Blockchain - ESSIF Stakeholder meeting

The European Blockchain Partnership is pleased to invite you to the second edition of the European Self Sovereign Identity Framework (ESSIF) user group stakeholder meeting, taking place in Brussels on Wednesday 15 January 2020 from 9:30-13:00.

Learn more on the progress of this project as part of the European Blockchain Service Infrastructure, the potential next steps, possible interaction with the emerging SSI-market and most important the interchange of ideas on how to collaborate on a self-sovereign identity for citizens and organizations in Europe and the world. Registration open until 10 January.

The European Blockchain Services Infrastructure (EBSI) is a joint initiative from the European Commission and the European Blockchain Partnership to deliver EU-wide cross-border public services using blockchain technology. The EBSI will be materialised as a network of distributed nodes across Europe (the blockchain), leveraging an increasing number of applications focused on specific use cases. In 2020, EBSI will become a CEF Building Block, providing reusable software, specifications and services to support adoption by EU institutions and European public administrations.

What does this mean for you? Check out the video!




CEF Big Data Building Block presented at European Court of Auditors event

On 27-28 November 2019, the European Court of Auditors organised the conference on “Big and open data for EU supreme audit institutions”.

The conference looked at the current status of big and open data in audit institutions in the EU Member States.

Big data is high-volume, high-velocity and high-variety information that requires new forms of processing to enable enhanced decision-making, insight discovery and process optimisation.

Around 10 national representatives presented including topics ranging from the usage of text mining in the auditing process by Estonia to interactive visualisation of government open data by Denmark. Please find attached the Draft Agenda for more info.

This event launched TINA, the network of EU surimi audit initiations (SAIs) to work on Technology and Innovation for Audit, enabling all Supreme Audit Institutions to progress together towards auditing more efficiently and effectively in the context of increasingly digital business processes.

All presentations from the event are available on the European Court of Auditors website.

Big data is a valuable resource 

Two presentations looked at the Connecting Europe Facility (CEF) Big Data Test Infrastructure Building Block.

  • Daniele Rizzi, European Commission Policy Officer - Data Policy and Innovation on “EU Open Data Policy”
  • Marc Vanderperren, European Commission Head of Sector - Data, Information and Knowledge Management, on “Paving the way for EU data-driven administration”

Big Data Test Infrastructure (BDTI) helps public administrations improve the experience of the citizen, make government more efficient and boost business and the wider economy through big data.

BDTI is a big data platform that offers virtual environments, allowing public organisations to experiment with big data sources, methods and tools. Users can launch pilot projects on big data and data analytics, through a selection of software tools. BDTI allows sharing data sources across policy domains and organisations and having access to best practices and methodologies on big data.

BDTI provides:

  • Soſtware: Big data and analytics soſtware catalogue
  • Stakeholder Management Services: A knowledge base, onboarding & stakeholder follow-up with the big data community
  • A dedicated service desk
  • Testing Services: big data platform and data catalogue and data exchange APIs


Да! CEF eTranslation now translates to and from Russian!


The Connecting Europe Facility (CEF) eTranslation Building Block can now process requests between English (EN) and Russian (RU). eTranslation provides high-quality machine translation for the 24 official languages of the EU, Icelandic and Norwegian.

The EN↔RU engine is the first of a series of engines that will translate in a more colloquial style than the more formal language engines, built on the copious translation memories produced by the translators of the EU institutions over the past decades.

You can find it by choosing the “General text” domain on the interface, as shown below.


CEF eTranslation provides a machine translation system that is aware of the correct terminology and style for different contexts;

  • Draws upon decades worth of work by EU translators (over 1 billion sentences in the 24 official EU languages (plus Icelandic and Norwegian) thus capable of understanding specific EU policy and legal terminology).
  • Designed to retain the format of structured documents during translation.
  • Can translate multiple documents to multiple languages at once.

It also allows for easy access to the machine translation service for both people and machines and guarantees continuous service of high quality, with due consideration for the confidentiality and security of data during the translation process.

You can use eTranslation as a stand-alone translation service (for government officials in public administrations in the EU) or integrated into any public sector website, from local administrations to pan-EU projects.

The Commission is also working towards other non-EU languages coverage towards economically, scientifically and socially relevant languages (such as Arabic, Chinese, Japanese and Turkish).











How can eID provide a safer internet?

How can eID provide a safer internet? Watch this video to find out.

Electronic identification (eID) has the potential to provide a safer internet by helping tackle online disinformation and protect minors and children from accessing websites that have age restrictions in place.

To read more about the topic, please   .

If you want to find out more about CEF eID, please  .




Call for contributions: feedback on ML proposals for Europeana Collections and data contributions for CEF eTranslation

©Adobe Stock

Access to digital cultural heritage across languages is a big priority for the Europeana community. A recent EU Presidency meeting in Helsinki highlighted again why this is crucial. Participants at this meeting also discussed the opportunities brought by new technology such as Europeana's machine translation service - reusing the Connecting Europe Facility (CEF) eTranslation Building Block - as well as the challenges that Europeana and the wider cultural heritage sector have to tackle on the way to build multilingual systems that can benefit our users and stakeholders.

Following this meeting, the Europeana team invite everyone in the Europeana community (and beyond) to contribute in two ways:

  • First, in order to prepare for the meeting we have put together a technical discussion paper that lays down several proposals for improving the multilingual aspects of the Europeana Collections portal. The team is happy to invite interested stakeholders to send their feedback on these proposals, either by sending a message or commenting on the document directly.
  • Second, the European Commission welcomes data contributions for the CEF eTranslation service. The quality of this service based on Artificial Intelligence technology is improved when trained with suitable data. Currently, cultural heritage is under-represented in the training resources that the eTranslation-related initiatives have mustered. The EC would like to change that, and you can help by contributing your own data into their training pool!

Any dataset is welcome, though multilingual data are of course highly prized. This can be done directly via the ELRC-SHARE platform, but if you have questions you can also contact Europeana  directly.




Europe Meets eArchiving

Europe Meets eArchiving!


On 3 - 4 December 2019, eArchiving specialists and enthusiasts attended the workshop "Meet eArchiving" in Brussels, Belgium. Visit the workshop's webpage to see the agenda, speakers and see the presentations and (soon) recordings.

Meet eArchiving provided an opportunity for those interested in data archiving and preservation to learn how to apply eArchiving standards and specifications beyond National Archives. This workshop looked specifically at the Connecting Europe Facility (CEF) eArchiving Building Block.

 
Over 120 people joined for the two-day workshop including 21 speakers and 
538 participants via the live stream from places like Australia, Canada, the U.S., Singapore & Georgia. 

The CEF eArchiving Building Block provides long-term information assurance by providing core specifications, reference software, training and service desk support for digital archiving, including digital preservation. The eArchiving Building Block is based on the outcomes of the E-ARK project (2014 - 2017). After the successful conclusion of the E-ARK project in 2017, the European Commission included its outcomes into the CEF programme to become the basis of the eArchiving Building Block. 

Through the CEF programme, the European Commission promotes the reuse of strategic digital infrastructure - known as Building Blocks - that enable key capabilities across borders, such as electronic achieving, user authentication or electronic procurement etc.

During this workshop, participants reiterated their support for the adoption of eArchiving under the CEF programme, which took place in 2018. Participants noted that the adoption of this Building Block marked a milestone in driving forward eArchiving across the EU Member States, promoting the sustainable long-term management of data within organisations and fostering a single European data space, underpinning a Digital Single Market in Europe.

Participants provided detailed feedback in breakout sessions

Participants also remarked on how the standards promoted through the eArchiving Building Block are being used in National Archives and other public institutions. This is a significant boost for all kinds of Data Producers meaning they do not have to develop their own standards.

However, during the workshop participants informed representatives of the Commission that they felt there is a need for EU-level legislation on eArchiving. Several participants also noted that SMEs are not aware of the potential benefits of developing software products based on the eArchiving standard.

If you are interested in using eArchiving for a project of your own, we would be happy to help you get started.

The CEF programme supports a number of Building Blocks: Big Data Test Infrastructure, Context Broker, Archiving, eDelivery, eID, eInvoicing, eSignature and eTranslation. A European blockchain infrastructure (the European Blockchain Services Infrastructure (EBSI) will soon become a fully operational Building Block, and the Once Only Principle (OOP) is a preparatory action under CEF.












Saint-Quentin uses CEF tool to address stakeholders’ concerns about the use of water in the city’s green spaces

CEF Context Broker is an indispensable tool for medium-sized cities to develop new services on a budget for high impact and end-user satisfaction.

 


Quick facts

  • Project: Smart watering system
  • Location: Saint-Quentin, France, with 56,000 inhabitants
  • Challenge: How to optimise irrigation with the right amount water, at the right time and place
  • Solution: Data-based smart watering system for the city’s green spaces
  • CEF Building Block used: Context Broker

Saving water based on (big) data

Saint-Quentin, a medium-sized city in northern France, is taking a fresh approach to improving its public services by combining Internet of Things (IoT) technologies, open data and expertise from local SMEs. The objectives are to find new ways of achieving the city’s goals in sustainable social development and in increasing accountability to citizens regarding their concerns, such as the environment and the conservation of water.

Hence, the first co-created public service was a smart watering system to address stakeholder concerns and the city’s green spaces department’s need to modernise and optimise its operations. As in many cities, Saint-Quentin’s irrigation methods were based on intuition and procedures, rather than data on soil humidity, water penetration or temperature. The resulting smart solution, successfully tested at the Philippe Roth sports field in Saint-Quentin, was made up of an ecosystem of various technologies, such as sensors, robot lawnmowers and sprinklers. The European Commission's Context Broker building block integrated all devices and their data with its data consolidation and analysis capabilities.

Context Broker helped Stéphane Siméon, responsible for Saint-Quentin’s green spaces, to optimise daily operations, reduce water consumption and minimise manual intervention by staff, all the while improving the customer experience provided to local sports associations.

Facing the challenge

Alexandre Chaffotte, Innovation Manager of the city of Saint-Quentin, says: “The data is mostly available to improve the watering process, but the main challenge is to aggregate and analyse the data as they all come in different standards and often lack relevant context information, for example, when and where the data was recorded.” He continues, “we have to make sure that the solution is interoperable with the already existing infrastructure (sprinklers and robot lawnmowers), and that the project evaluates existing standard data models and open communication protocols (APIs). This is crucial to guarantee interoperability for the sustainable deployment of such a solution.”

Context Broker in the IoT Booster ecosystem

The solution was provided by Faubourg Numérique, a local SME that operates as a business facilitator and the provider of an innovative interoperability platform, called the IoT Booster. With the IoT Booster it is easy to connect to and manage the heterogeneous IoT devices installed and used on the field, such as the existing sprinklers and lawnmowers, as well as new soil sensors and valve controllers. It allows to consolidate and analyse data from sensors, and to take action with sprinklers and lawnmowers.

Faubourg Numérique developed the IoT Booster to tackle the interoperability and scalability challenges brought by the big data nature of the project. The cornerstone of the solution is the CEF Context Broker building block, which provides the data consolidation and analysis capabilities needed to offer services based on open data mashups. The devices on the field are connected to Context Broker via the FIWARE IoT Agent, which features a lightweight text-based protocol specifically designed to enable communication between Context Broker and IoT devices. Context Broker also provides a single standard API (NGSI-LD) for smart applications to access the consolidated data for decision making.

High-level solution architecture with CEF Context Broker.


Faubourg Numérique also integrated the ​OASC Minimum Interoperability Mechanisms around ​the CEF Context Broker and the FIWARE IoT back-end architecture. An important part of this technical task was to select, integrate and harmonise existing data models provided by organisations, such as FIWARE DM and schema.org, in order to virtualise gathered data and its dynamic relationships.

The CEF Context Broker is fully compatible with the complementary solutions provided by FIWARE, as it is also based on FIWARE’s work. In fact, FIWARE’s Orion Context Broker is the core software component of CEF Context Broker and the reference implementation of the NGSI specifications, which are supported by the European Commission. Furthermore, Faubourg Numérique operates the local ​FIWARE iHub, providing solutions and support based on FIWARE technologies.


More efficiency on a smaller budget

For the IoT Booster, Context Broker provides interoperability and open access to the local public service ecosystem. This allows cities to develop and integrate new, complex services faster and in an interactive way involving a variety of stakeholders, such as municipal staff, city representatives, local businesses and citizens. The Context Broker is the perfect tool for Saint-Quentin and other medium-sized cities to develop services on a small public budget for high impact on city services and end-users’ satisfaction.

The IoT Booster smart application uses consolidated data from Context Broker to provide information for decision making. The user interface specifies which sections of the field need watering and even calculates the exact amount of water required. This ensures the right amount of water, in the right place, at the right time. To maximise the benefits of automation, the watering schedule was orchestrated with robot lawnmowers, as well as upcoming sports events and team practices.

IoT Booster app user interface for the city’s green spaces service team.


The pilot at the Philippe Roth sports field successfully proved the value of Context Broker in achieving interoperability between IoT devices for the purpose of building holistic solution ecosystems and data-based services. Saint-Quentin plans to further continue on the project to achieve full interconnection and integration of devices for the autonomous maintenance of green spaces with the following expected benefits:

  • Operational efficiency with the time needed to manage watering cut in half
  • Remote control of operations
  • Reducing water consumption by an estimated 30%
  • Visibility and support in decision making
  • Early detection of problems, such as leaks

Furthermore, based on the Philippe Roth pilot, Saint-Quentin now knows the requirements to create other similar solutions based on solid and sustainable foundations.


Benefits for other cities

For Saint-Quentin this is just the beginning of a new era. In the coming months, the learnings and technical deliverables from the watering system will be globally shared by the city of Saint-Quentin and Faubourg Numérique for the benefit of other medium-sized cities.

If you are interested in using the Context Broker for a project of your own, we would be happy to help you. The documentation and support services provided by CEF are described on our website and available to all. Visit us at Context Broker to learn more.