DGT-Translation Memory

Introduction
DGT's Translation Memory
Description of the Data - Pre-processing
Statistics for the DGT Translation Memory
Conditions for Use
Difference between the DGT Translation Memory and the other resources available here
Download the DGT Translation Memory
How to produce bilingual extractions
More details / Reference publication
Acknowledgement and Contact
ISLRN: 10-653-952-884-4.

Introduction

Since November 2007 the European Commission's Directorate-General for Translation has made its multilingual Translation Memory for the Acquis Communautaire, DGT-TM, publicly accessible in order to foster the European Commission’s general effort to support multilingualism, language diversity and the re-use of Commission information.

This page, which is meant for technical users, provides a description of this unique linguistic resource as well as instructions on where to download it and how to produce bilingual aligned corpora for any of the 276 language pairs or 552 language pair directions. Here is an example of one sentence translated into 22 languages.

view details

The Acquis Communautaire is the entire body of European legislation, comprising all the treaties, regulations and directives adopted by the European Union (EU). Since each new country joining the EU is required to accept the whole Acquis Communautaire, this body of legislation has been translated into 24 official languages. As a result, the Acquis now exists as parallel texts in the following 24 languages: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French, Irish, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish. For Irish, there is very little data since the Acquis is not translated on a regular basis. There is also less Croatian data because Croatia only joined the EU in 2013.

Parallel texts are texts and their manually produced translations. They are also referred to as bi-texts. A translation memory is a collection of small text segments and their translations (referred to as translation units, TU). These TUs can be sentences or parts of sentences. Translation memories are used to support translators by ensuring that pieces of text that have already been translated do not need to be translated again.

Both translation memories and parallel texts are important linguistic resources that can be used for a variety of purposes, including:

training automatic systems for statistical machine translation (SMT);
producing monolingual or multilingual lexical and semantic resources such as dictionaries and ontologies;
training and testing multilingual information extraction software;
checking translation consistency automatically;
testing and benchmarking alignment software (for sentences, words, etc.).

The value of a parallel corpus grows with its size and with the number of languages for which translations exist. While parallel corpora for some languages are abundant, there are few or no parallel corpora for most language pairs. To our knowledge, the Acquis Communautaire is the biggest parallel corpus in existence, taking into consideration both its size and the number of languages covered. The most outstanding advantage of the Acquis Communautaire - apart from it being freely available - is the number of rare language pairs (e.g. Maltese-Estonian, Slovenian-Finnish, etc.).

The first version of DGT-TM was released in 2007 and included documents published up to the year 2006. In April 2012, DGT-TM-2011 was released, which contains data from 2007 until 2010. Since then, data is released annually (e.g. 2011 data is released in 2012 with the name of DGT-TM-2012). While the alignments between TUs and their translations were verified manually for DGT-TM-2007, the TUs since DGT-TM-2011 were aligned automatically. The data format is the same for all releases.

DGT's Translation Memory

This extraction of aligned sentences can be used to produce a parallel multilingual corpus of the European Union’s legislative documents (Acquis Communautaire) in 24 EU languages. The aligned translation units have been provided by the Directorate-General for Translation of the European Commission by extraction from one of its large shared translation memories in EURAMIS (European advanced multilingual information system). This memory contains most, although not all, of the documents which make up the Acquis Communautaire, as well as some other documents which are not part of the Acquis.

view details

In order to reduce the size, the extraction uses English as the source language. The sequence in the extracted files is not necessarily the same as in the underlying documents, and redundancies of text segments like "Article 1" are inevitable. The documents are in the widely used Translation Memory eXchange (TMX) format. In order to be backwards compatible, the header mentions TMX format 1.1, but the files are also compliant with TMX 1.4b. The texts are encoded in UTF-16 Little Endian. The source language of the documents and sentences is not known, but many of the documents were originally written in English and then translated into the other languages.

Description of the Data - Pre-processing

Before the documents were aligned, the source material was pre-processed to reduce the number of entries of low value for the translators (short sentences, long sentences, obvious mismatches, etc.) ( further details). This means that the contents of the documents might have changed. The documents were aligned in accordance with the segmentation rules used in the Directorate-General for Translation of the European Commission. The extraction keeps only the EUR-Lex document number (NumDoc) from which other information (e.g. year and document type) can be derived. For further information on the Numdoc structure, see the information provided by EUR-Lex.

The corpus is also available as a parsebank, i.e. it has been automatically annotated for part-of-speech, morphosyntax, lemma, and dependency annotations with UD-PIPE. The DGT-UD parsebank can be downloaded from the CLARIN.SI repository under http://hdl.handle.net/11356/1197, where you also find links to this corpus installed under two concordancers.

Statistics for the DGT Translation Memory

The DGT Translation Memory is currently available in 24 languages. For statistics on the total number of translation units, words and characters available for each language, you can download the file DGT-TM_Statistics.pdf .

For the number of aligned translation units for each language pair and further statistics regarding the release DGT-TM-2011, see the DGT-TM reference publication. For the later releases, statistics files are included in the first zip file of each release.

Conditions for Use

I. Intellectual property and conditions of use of databases

The DGT-TM database is the exclusive property of the European Commission. The Commission cedes its non-exclusive rights free of charge and world-wide for the entire duration of the protection of those rights to the re-user, for all kinds of use which comply with the conditions laid down in the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42.

Any re-use of the database or of the structured elements contained in it is required to be identified by the re-user, who is under an obligation to state the source of the documents used: the website address, the date of the latest update and the fact that the European Commission retains ownership of the data.

II. Conditions for use of software

The DGT-TM database is distributed with the software necessary for its exploitation/extraction. Use of such software must be carried out in accordance with the conditions laid down in the EUPL licence.

III. Responsibility

The database and the accompanying software are made available, without any guarantee, explicit or tacit. The Commission cannot be held responsible for any loss, injury or damage the re-user may suffer due to the re-use. The Commission does not however guarantee the absence of any irregularities which may be present in the databases, within the structured data they contain or the software itself. The Commission does not guarantee the on-going distribution of said databases and software.

The Commission cannot be held responsible for any loss, injury or damage caused to third parties as a result of the re-use. The re-user shall bear sole responsibility for the re-use of the data collection, the structured elements it contains and the software. Re-use must not mislead third parties in respect of the contents of the database and the structured elements it contains, it’s the source of the contents or the date of the last update thereto.

This disclaimer is not intended to limit the liability of the Commission in violation of any requirements laid down in applicable national law or to exclude its liability in cases where this is not permitted by the applicable law.

IV. Definitions

Definitions of terms used by the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42, are supplemented by the following definitions:

Re-user: Any natural or legal person who re-uses the documents, in accordance with the conditions laid down in the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42.

Databases: A collection of independent works, data or other materials arranged in a systematic or methodical way and individually accessible by electronic means or in any other way.

Difference between the DGT Translation Memory and the other resources available here

Some of the multilingual parallel resources available via the JRC's Language Technology resources page are clearly distinct, but others are similar or they overlap. Especially DGT-TM (first released in 2007) and the JRC-Acquis (released in 2006) will overlap to a large extent as they are mostly based on the Acquis Communautaire of the European Union. Commonalities and differences between the various text resources, as well as background information such as the motivation for their release and information on usage conditions, are summarised in the journal article ' An overview of the European Union’s highly multilingual parallel corpora', published in 2014.

Download the DGT Translation Memory

The distribution consists of a collection of zip files (see below), each not larger than 100 MB. Each zip file contains tmx-files identified by the EUR-Lex number of the underlying Acquis Communautaire documents and a file list in txt specifying the languages in which the documents are available.

view details

There is no need to unzip the files as the extraction program will access the data in the zip files directly. The texts for the different languages are spread over the various zip files so that you will need to download all files if you want the full parallel corpus. Downloading only a subset of the zip files is possible, but it will result in producing only a subset of the parallel corpus.

You also need to download the extraction program and copy it into a suitable directory on your computer. The program is distributed as a Java jar file. Under Windows operating system it can run with a graphical user interface. On any operating system supporting the Java runtime of version 1.5 or newer it works in a machine-independent command line version.

You can download the files by clicking on the links below.

DGT-TM Version 1 (Released in 2007)	Size
Volume_1.zip	98 MB
Volume_2.zip	98 MB
Volume_3.zip	98 MB
Volume_4.zip	98 MB
Volume_5.zip	98 MB
Volume_6.zip	98 MB
Volume_7.zip	98 MB
Volume_8.zip	98 MB
Volume_9.zip	98 MB
Volume_10.zip	98 MB
Volume_11.zip	98 MB
Volume_12.zip	72 MB
Total size	1.08 GB

DGT-TM-release 2011	Size
Vol_2004_1.zip	98 MB
Vol_2004_2.zip	46 MB
Vol_2005_1.zip	98 MB
Vol_2005_2.zip	98 MB
Vol_2005_3.zip	91 MB
Vol_2006_1.zip	97 MB
Vol_2006_2.zip	97 MB
Vol_2006_3.zip	98 MB
Vol_2006_4.zip	98 MB
Vol_2006_5.zip	12 MB
Vol_2007_1.zip	98 MB
Vol_2007_2.zip	97 MB
Vol_2007_3.zip	74 MB
Vol_2008_1.zip	97 MB
Vol_2008_2.zip	98 MB
Vol_2008_3.zip	96 MB
Vol_2008_4.zip	73 MB
Vol_2009_1.zip	98 MB
Vol_2009_2.zip	98 MB
Vol_2009_3.zip	98 MB
Vol_2009_4.zip	12 MB
Vol_2010_1.zip	98 MB
Vol_2010_2.zip	92 MB
Vol_2010_3.zip	96 MB
Vol_2010_4.zip	12 MB
Total size	1.96GB

DGT-TM-release 2012	Size
Vol_2011_1.zip	98 MB
Vol_2011_2.zip	98 MB
Vol_2011_3.zip	98 MB
Vol_2011_4.zip	61 MB
Total size	354 MB

DGT-TM-release 2013	Size
Vol_2012_1.zip	96 MB
Vol_2012_2.zip	95 MB
Vol_2012_3.zip	96 MB
Vol_2012_4.zip	96 MB
Vol_2012_5.zip	96 MB
Vol_2012_6.zip	89 MB
Total size	568 MB

DGT-TM-release 2014	Size
Vol_2013_1.zip	102MB
Vol_2013_2.zip	102MB
Vol_2013_3.zip	102MB
Vol_2013_4.zip	102MB
Vol_2013_5.zip	102MB
Vol_2013_6.zip	102MB
Vol_2013_7.zip	60MB
Total size	657 MB

DGT-TM-release 2015	Size
Vol_2014_1.zip	100 MB
Vol_2014_2.zip	100 MB
Vol_2014_3.zip	83 MB
Total size	283 MB

DGT-TM-release 2016	Size
Vol_2015_1.zip	102MB
Vol_2015_2.zip	102MB
Vol_2015_3.zip	102MB
Vol_2015_4.zip	102MB
Vol_2015_5.zip	102MB
Vol_2015_6.zip	99MB
Vol_2015_7.zip	32MB
Total size	642 MB

DGT-TM-release 2017	Size
Vol_2016_1.zip	102MB
Vol_2016_2.zip	102MB
Vol_2016_3.zip	102MB
Vol_2016_4.zip	99MB
Vol_2016_5.zip	102MB
Vol_2016_6.zip	97MB
Vol_2016_7.zip	99MB
Vol_2016_8.zip	102MB
Vol_2016_9.zip	64MB
Total size	848 MB

DGT-TM-release 2018	Size
Vol_2017_1.zip	254MB
Vol_2017_2.zip	173MB
Total size	427 MB

DGT-TM-release 2019	Size
Vol_2018_1.zip	254MB
Vol_2018_2.zip	173MB
Vol_2018_3 .zip	173MB
Total size	600 MB

DGT-TM-release 2020	Size
Vol_2019_1.zip	247MB
Vol_2019_2.zip	148MB
Total size	395 MB

DGT-TM-release 2021	Size
Vol_2020_1.zip	127MB
Vol_2020_2.zip	126MB
Vol_2020_3 .zip	112MB
Vol_2020_4 .zip	111MB
Vol_2020_5 .zip	107MB
Total size	583MB

How to produce bilingual extractions

The multilingual extraction has English as the source language. Users can extract any language pair as follows, using the extraction tool TMXtract:

view details

For the Windows Operating System:

Download the TMXtract.jar file;
Open TMXtract by double clicking on the TMXtract.jar file (if it fails refer to this link to fix the problem http://stackoverflow.com/questions/394616/running-jar-file-in-windows#394628)
Select Input files (Volume_1.zip, etc.; multiple selection is possible);
Specify Output file (the result is always 1 file);
Choose Source and Target language;
Click on Start.

For other Operating Systems or in Windows command line:

Download the zip files and the extraction tool TMXtract (jar file) onto your computer. The files should be in the same directory;
Start a command shell;
Invoke the program by the command java -jar TMXtract.jar <Source> <Target> <Output file> [ <Input files> ...];
The progress of the extraction will be displayed on the console, e.g. on Solaris:

More details / Reference publication

For a more detailed description of the DGT-TM, including more statistics on the resource, see the following publication. When making reference to DGT-TM in scientific publications, please refer to:

Steinberger Ralf, Andreas Eisele, Szymon Klocek, Spyridon Pilos & Patrick Schlüter (2012). DGT-TM: A freely Available Translation Memory in 22 Languages. Proceedings of the 8th international conference on Language Resources and Evaluation (LREC'2012), Istanbul, 21-27 May 2012.

For a contrastive overview of DGT-TM and the other multilingual text resources offered for download on this site, you can read the following journal article:

Steinberger Ralf, Mohamed Ebrahim, Alexandros Poulis, Manuel Carrasco-Benitez, Patrick Schlüter, Marek Przybyszewski & Signe Gilbro (2014). An overview of the European Union's highly multilingual parallel corpora . Language Resources and Evaluation Journal (LRE). December 2014, Volume 48, Issue 4, pp 679-707. DOI: 10.1007/s10579-014-9277-0.

DGT-TM has been registered with the International Standard Natural Language Resource number (ISLRN) 710-653-952-884-4.

Acknowledgement and Contact

For more information, you can contact the following persons:

Directorate-General for Translation (DGT)
Patrick Schlüter (Email address: Patrick [dot] Schluterec [dot] europa [dot] eu (Patrick[dot]Schluter[at]ec[dot]europa[dot]eu))
Unit DGT.R.3 Informatics
Jean-Monnet Building A2/137
L-2920 Luxembourg
More information on DGT

Joint Research Centre (JRC)
Ralf Steinberger (Email address: JRC-EMM-SUPPORTec [dot] europa [dot] eu (JRC-EMM-SUPPORT[at]ec[dot]europa[dot]eu))
I.3 Competence Centre on Text Mining and Analysis
Via E. Fermi 2749, T.P. 440
I-21027 Ispra (VA)

view details

The Directorate-General for Translation (DGT) is one of the biggest translation services in the world. It is also the largest single department in the European Commission with a total number of around 2500 staff members and a total production of some 2 million pages a year. Various computer tools are available to translators, who use them according to their translation needs and personal preferences. Irrespective of their preferred working methods, all translators need the possibility to reuse previously translated texts (translation memories, electronic archives, ….). To perform its tasks, DG Translation has a wide variety of language resources at the disposal of its staff: terminology in many different forms (multilingual libraries, terminology databases, electronic dictionaries, etc.), translation memories enabling genuine data sharing; texts as such to be retrieved from internal archiving systems and other sources; and machine translation, which, at the European Commission, is used as a browsing tool to view the gist of a text and also to be used as a genuine translation aid.

The Joint Research Centre ( JRC) is also a Directorate-General of the European Commission. The JRC has for many years worked on highly multilingual text analysis applications. The JRC has contributed to the dissemination of the DGT Translation Memory and it has itself produced and disseminated a number of further highly multilingual linguistic resources: the JRC-Acquis, JRC-Names, the JRC Eurovoc Indexer JEX, and a series of further linguistic resources.

The JRC is the creator of the Europe Media Monitor (EMM) family of news aggregation and analysis applications. EMM aggregates news from about thousands of news portals world-wide in about 50 languages (status 2012). EMM's news analysis tools always show the latest news from around the world as its pages are updated every five minutes. As EMM not only displays the news articles, but it also groups related articles, classifies the articles into hundreds of news categories and displays automatically extracted meta-information together with the news items, EMM has many users from around the world, resulting in up to 1.2 million hits per day. Much information is available via RSS feeds, allowing EMM output to be combined with third-party tools. The JRC is scientifically very active, as can be seen from the large number of international scientific publications in the field of multilingual text mining and media monitoring. JRC's publicly accessible media monitoring applications are:

NewsBrief: Breaking News detection and display of the very latest thematically organised news from around the world; Grouping of related news; breaking news detection; RSS feeds and automatic email alerting; 50 languages.
MedISys : EMM's Medical Information System selects the health-related EMM news in 50 languages and additionally gathers documents from about 250 medical web sites. MedISys displays the medical news according to diseases, symptoms, organisations and themes and has statistics-based early warning functions for each category. A second, restricted site offers more functionality to EU public health organisations.
NewsExplorer: Summary of the news in 21 languages for each 24-hour period; grouping of related news into clusters; linking of daily clusters over time and across languages; visualisation of time lines and of geographical news coverage; information extraction to detect and disambiguate persons, organisations and locations; quotation recognition; individual, daily-updated pages for over one million names; detection of quotations by and about people; automatic generation of social networks.