EU Science Hub

DGT-Translation Memory

 

Introduction

Since November 2007 the European Commission's Directorate-General for Translation has made its multilingual Translation Memory for the Acquis Communautaire, DGT-TM, publicly accessible in order to foster the European Commission’s general effort to support multilingualism, language diversity and the re-use of Commission information.

This page, which is meant for technical users, provides a description of this unique linguistic resource as well as instructions on where to download it and how to produce bilingual aligned corpora for any of the 276 language pairs or 552 language pair directions. Here is an example of one sentence translated into 22 languages.

 

DGT's Translation Memory

This extraction of aligned sentences can be used to produce a parallel multilingual corpus of the European Union’s legislative documents (Acquis Communautaire) in 24 EU languages. The aligned translation units have been provided by the Directorate-General for Translation of the European Commission by extraction from one of its large shared translation memories in EURAMIS (European advanced multilingual information system). This memory contains most, although not all, of the documents which make up the Acquis Communautaire, as well as some other documents which are not part of the Acquis.

Description of the Data - Pre-processing

Before the documents were aligned, the source material was pre-processed to reduce the number of entries of low value for the translators (short sentences, long sentences, obvious mismatches, etc.) (further details). This means that the contents of the documents might have changed. The documents were aligned in accordance with the segmentation rules used in the Directorate-General for Translation of the European Commission. The extraction keeps only the EUR-Lex document number (NumDoc) from which other information (e.g. year and document type) can be derived. For further information on the Numdoc structure, see the information provided by EUR-Lex.

Statistics for the DGT Translation Memory

The DGT Translation Memory is currently available in 24 languages. For statistics on the total number of translation units, words and characters available for each language, you can download the file DGT-TM_Statistics.pdf.

For the number of aligned translation units for each language pair and further statistics regarding the release DGT-TM-2011, see the DGT-TM reference publication. For the later releases, statistics files are included in the first zip file of each release.

Conditions for Use

I. Intellectual property and conditions of use of databases

The DGT-TM database is the exclusive property of the European Commission. The Commission cedes its non-exclusive rights free of charge and world-wide for the entire duration of the protection of those rights to the re-user, for all kinds of use which comply with the conditions laid down in the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42.

Any re-use of the database or of the structured elements contained in it is required to be identified by the re-user, who is under an obligation to state the source of the documents used: the website address, the date of the latest update and the fact that the European Commission retains ownership of the data.

II. Conditions for use of software

The DGT-TM database is distributed with the software necessary for its exploitation/extraction. Use of such software must be carried out in accordance with the conditions laid down in the EUPL licence.

III. Responsibility

The database and the accompanying software are made available, without any guarantee, explicit or tacit. The Commission cannot be held responsible for any loss, injury or damage the re-user may suffer due to the re-use. The Commission does not however guarantee the absence of any irregularities which may be present in the databases, within the structured data they contain or the software itself. The Commission does not guarantee the on-going distribution of said databases and software.

The Commission cannot be held responsible for any loss, injury or damage caused to third parties as a result of the re-use. The re-user shall bear sole responsibility for the re-use of the data collection, the structured elements it contains and the software. Re-use must not mislead third parties in respect of the contents of the database and the structured elements it contains, it’s the source of the contents or the date of the last update thereto.

This disclaimer is not intended to limit the liability of the Commission in violation of any requirements laid down in applicable national law or to exclude its liability in cases where this is not permitted by the applicable law.

IV. Definitions

Definitions of terms used by the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42, are supplemented by the following definitions:

Re-user: Any natural or legal person who re-uses the documents, in accordance with the conditions laid down in the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42.

Databases: A collection of independent works, data or other materials arranged in a systematic or methodical way and individually accessible by electronic means or in any other way.

 

Difference between the DGT Translation Memory and the other resources available here

Some of the multilingual parallel resources available via the JRC's Language Technology resources page are clearly distinct, but others are similar or they overlap. Especially DGT-TM (first released in 2007) and the JRC-Acquis (released in 2006) will overlap to a large extent as they are mostly based on the Acquis Communautaire of the European Union. Commonalities and differences between the various text resources, as well as background information such as the motivation for their release and information on usage conditions, are summarised in the journal article 'An overview of the European Union’s highly multilingual parallel corpora', published in 2014.

 

Download the DGT Translation Memory

The distribution consists of a collection of zip files (see below), each not larger than 100 MB. Each zip file contains tmx-files identified by the EUR-Lex number of the underlying Acquis Communautaire documents and a file list in txt specifying the languages in which the documents are available.

 

 

 

How to produce bilingual extractions

The multilingual extraction has English as the source language. Users can extract any language pair as follows, using the extraction tool TMXtract:

 

More details / Reference publication

For a more detailed description of the DGT-TM, including more statistics on the resource, see the following publication. When making reference to DGT-TM in scientific publications, please refer to:

For a contrastive overview of DGT-TM and the other multilingual text resources offered for download on this site, you can read the following journal article:

DGT-TM has been registered with the International Standard Natural Language Resource number (ISLRN) 710-653-952-884-4.

 

Acknowledgement and Contact

For more information, you can contact the following persons:

 

Directorate-General for Translation (DGT)
        Patrick Schlüter (Email address: Patrick.Schluter@ec.europa.eu)
        Unit DGT.R.3 Informatics
        Jean-Monnet Building A2/137
        L-2920 Luxembourg
        More information on DGT

 

Joint Research Centre (JRC)
        Ralf Steinberger (Email address format: Firstname.Lastname@jrc.ec.europa.eu)
        IPSC - GlobeSec - OPTIMA
        Via E. Fermi 2749, T.P. 267
        I-21027 Ispra (VA)