This extraction of aligned sentences can be used to produce a parallel multilingual corpus of the European Union’s legislative documents (Acquis Communautaire) in 24 EU languages. The aligned translation units have been provided by the
Directorate-General for Translation of the European Commission by extraction from one of its large shared translation memories in EURAMIS (European advanced multilingual information system). This memory contains most, although not all, of the documents which make up the Acquis Communautaire, as well as some other documents which are not part of the Acquis.
Before the documents were aligned, the source material was pre-processed to reduce the number of entries of low value for the translators (short sentences, long sentences, obvious mismatches, etc.) (
further details). This means that the contents of the documents might have changed. The documents were aligned in accordance with the segmentation rules used in the Directorate-General for Translation of the European Commission. The extraction keeps only the EUR-Lex document number (NumDoc) from which other information (e.g. year and document type) can be derived. For further information on the Numdoc structure, see the information provided by EUR-Lex.
The corpus is also available as a parsebank, i.e. it has been automatically annotated for part-of-speech, morphosyntax, lemma, and dependency annotations with
UD-PIPE. The DGT-UD parsebank can be downloaded from the CLARIN.SI repository under , where you also find links to this corpus installed under two concordancers. http://hdl.handle.net/11356/1197
The DGT Translation Memory is currently available in
24 languages. For statistics on the total number of translation units, words and characters available for each language, you can download the file . DGT-TM_Statistics.pdf
For the number of aligned translation units for each
language pair and further statistics regarding the release DGT-TM-2011, see the DGT-TM reference publication. For the later releases, statistics files are included in the first zip file of each release.
I. Intellectual property and conditions of use of databases
The DGT-TM database is the exclusive property of the European Commission. The Commission cedes its non-exclusive rights free of charge and world-wide for the entire duration of the protection of those rights to the re-user, for all kinds of use which comply with the conditions laid down in the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42.
Any re-use of the database or of the structured elements contained in it is required to be identified by the re-user, who is under an obligation to state the source of the documents used: the website address, the date of the latest update and the fact that the European Commission retains ownership of the data.
II. Conditions for use of software
The DGT-TM database is distributed with the software necessary for its exploitation/extraction. Use of such software must be carried out in accordance with the conditions laid down in the EUPL licence.
The database and the accompanying software are made available, without any guarantee, explicit or tacit. The Commission cannot be held responsible for any loss, injury or damage the re-user may suffer due to the re-use. The Commission does not however guarantee the absence of any irregularities which may be present in the databases, within the structured data they contain or the software itself. The Commission does not guarantee the on-going distribution of said databases and software.
The Commission cannot be held responsible for any loss, injury or damage caused to third parties as a result of the re-use. The re-user shall bear sole responsibility for the re-use of the data collection, the structured elements it contains and the software. Re-use must not mislead third parties in respect of the contents of the database and the structured elements it contains, it’s the source of the contents or the date of the last update thereto.
This disclaimer is not intended to limit the liability of the Commission in violation of any requirements laid down in applicable national law or to exclude its liability in cases where this is not permitted by the applicable law.
Definitions of terms used by the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42, are supplemented by the following definitions:
Re-user: Any natural or legal person who re-uses the documents, in accordance with the conditions laid down in the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42.
Databases: A collection of independent works, data or other materials arranged in a systematic or methodical way and individually accessible by electronic means or in any other way.
Some of the multilingual parallel resources available via the JRC's Language Technology resources page are clearly distinct, but others are similar or they overlap. Especially DGT-TM (first released in 2007) and the JRC-Acquis (released in 2006) will overlap to a large extent as they are mostly based on the Acquis Communautaire of the European Union. Commonalities and differences between the various text resources, as well as background information such as the motivation for their release and information on usage conditions, are summarised in the journal article '
An overview of the European Union’s highly multilingual parallel corpora', published in 2014.
The distribution consists of a collection of zip files (see below), each not larger than 100 MB. Each zip file contains tmx-files identified by the EUR-Lex number of the underlying Acquis Communautaire documents and a file list in txt specifying the languages in which the documents are available.
The multilingual extraction has English as the source language. Users can extract any language pair as follows, using the extraction tool TMXtract:
For a more detailed description of the DGT-TM, including more statistics on the resource, see the following publication. When making reference to DGT-TM in scientific publications, please refer to:
For a contrastive overview of DGT-TM and the other multilingual text resources offered for download on this site, you can read the following journal article:
DGT-TM has been registered with the
International Standard Natural Language Resource number (ISLRN) 710-653-952-884-4.
For more information, you can contact the following persons:
Directorate-General for Translation (DGT)
Patrick Schlüter (Email address: Patrick.Schluter@ec.europa.eu)
Unit DGT.R.3 Informatics
Jean-Monnet Building A2/137
More information on DGT
Joint Research Centre (JRC)
Ralf Steinberger (Email address: ) JRC-EMM-SUPPORT@ec.europa.eu
I.3 Competence Centre on Text Mining and Analysis
Via E. Fermi 2749, T.P. 440
I-21027 Ispra (VA)
Directorate-General for Translation (DGT) is one of the biggest translation services in the world. It is also the largest single department in the European Commission with a total number of around 2500 staff members and a total production of some 2 million pages a year. Various computer tools are available to translators, who use them according to their translation needs and personal preferences. Irrespective of their preferred working methods, all translators need the possibility to reuse previously translated texts (translation memories, electronic archives, ….). To perform its tasks, DG Translation has a wide variety of language resources at the disposal of its staff: terminology in many different forms (multilingual libraries, terminology databases, electronic dictionaries, etc.), translation memories enabling genuine data sharing; texts as such to be retrieved from internal archiving systems and other sources; and machine translation, which, at the European Commission, is used as a browsing tool to view the gist of a text and also to be used as a genuine translation aid.
Joint Research Centre ( JRC) is also a Directorate-General of the European Commission. The JRC has for many years worked on highly multilingual text analysis applications. The JRC has contributed to the dissemination of the DGT Translation Memory and it has itself produced and disseminated a number of further highly multilingual linguistic resources: the JRC-Acquis, JRC-Names, the JRC Eurovoc Indexer JEX, and a series of further linguistic resources.
The JRC is the creator of the
Europe Media Monitor (EMM) family of news aggregation and analysis applications. EMM aggregates news from about thousands of news portals world-wide in about 50 languages (status 2012). EMM's news analysis tools always show the latest news from around the world as its pages are updated every five minutes. As EMM not only displays the news articles, but it also groups related articles, classifies the articles into hundreds of news categories and displays automatically extracted meta-information together with the news items, EMM has many users from around the world, resulting in up to 1.2 million hits per day. Much information is available via RSS feeds, allowing EMM output to be combined with third-party tools. The JRC is scientifically very active, as can be seen from the large number of international scientific publications in the field of multilingual text mining and media monitoring. JRC's publicly accessible media monitoring applications are:
NewsBrief: Breaking News detection and display of the very latest thematically organised news from around the world; Grouping of related news; breaking news detection; RSS feeds and automatic email alerting; 50 languages.
: EMM's Medical Information System selects the health-related EMM news in 50 languages and additionally gathers documents from about 250 medical web sites. MedISys displays the medical news according to diseases, symptoms, organisations and themes and has statistics-based early warning functions for each category. A second, restricted site offers more functionality to EU public health organisations. MedISys
NewsExplorer: Summary of the news in 21 languages for each 24-hour period; grouping of related news into clusters; linking of daily clusters over time and across languages; visualisation of time lines and of geographical news coverage; information extraction to detect and disambiguate persons, organisations and locations; quotation recognition; individual, daily-updated pages for over one million names; detection of quotations by and about people; automatic generation of social networks.