DGT-Acquis

 

What is the DGT-Acquis?

The DGT-Acquis is a family of several multingual parallel corpora extracted from the Official Journal of the European Union (OJ) in Formex 4 (XML) format, consisting of documents from the middle of 2004 to the end of 2011 in up to 23 languages.

The following OJ series are included:

 

 

Official Journal series

Year

L
Legislation

C
Information and notices

2004

L 2004

 

C 2004

CA 2004

CE 2004

2005

L 2005

LM 2005

C 2005

CA 2005

CE 2005

2006

L 2006

LM 2006

C 2006

CA 2006

CE 2006

2007

L 2007

LM 2007

C 2007

CA 2007

CE 2007

2008

L 2008

LM 2008

C 2008

CA 2008

CE 2008

2009

L 2009

LM 2009

C 2009

CA 2009

CE 2009

2010

L 2010

LM 2010

C 2010

CA 2010

CE 2010

2011

L 2011

LM 2011

C 2011

CA 2011

CE 2011

 

Description of the Data - Data Format

The original data of the OJ has been processed in several steps. In each step, the result of the previous step was refined to a finer granularity: (1) original data, (2) file level in Formex4 format, (3) file level in plain text and (4) paragraph level. The result of each step is a corpus packaged as a self-contained Multilingual Dataset Format (muset) file. Even though the musets are independent, they are linked to each other so that, for example, one can find the source document of any given text segment. Data users can choose the data with the most appropriate processing level for their own needs.

The table in next section (statistics) describes the data and provides some statistics.

The original data (da1-ox) includes both the XML and the tiff files. This opens the option to make use of the data for other types of applications (e.g. to work on optical character recognition, and more). The original data also allows users who want to re-process the whole data set using their own tools and methods.

The file level formats (da1-fx in Formex 4 format and da1-ft in plain text format) are relevant for users who need access to the full texts, e.g. to analyse the discourse structure, to consider the context of each sentence, etc.

The paragraph level format (da1-pc) is relevant for people who do not need access to the full text, but who are mostly interested in smaller segments and their translations, e.g. to produce dictionaries or to work on (machine) translation.

Unfortunately, at this time, we cannot provide any statistics on this data and we cannot provide more information on how the data was produced. 

 

Statistics on the corpus

ID

Title

Granularity

Format

Structure

Zipped

Statistics

Comments

da1-ox

Original data

original

formex4

tree

81 GB

3,901,048 files

original filenames; with TIFF files

da1-fx

File level in Formex4

file

formex4

tree

9 GB

3,537,876 files

standardised filenames; without TIFF files

da1-ft

File level in plain text format

file

text

tree

5 GB

3,537,872 files

XML marking removed

da1-pc

Paragraph level in column-file format

paragraph

column-file

table

3 GB

4,900,254 segments

one table

  • ID.
    muset identity. Example, da1-ox. It is composed of:
  • Title
    The title of the Multilingual Dataset Format (muset). Using these links one can download individual language files.
  • Granularity
    Details on this can be found in sections 10.4.5 and 10.5.4 of the document 'Open Architecture for multilingual parallel texts' (Carrasco Benitez, 2008).
            original
            file
            paragraph
            sentence
            sub-sentence
  • Format
            Formex 4 (XML) format
            text: plain text.
           column-file: each column in the table is in one file. The file a1.txt[.zip] contain the filenames of the segment provenance.
  • Structure
           tree: tree of directories and files; the data is in the original context.
           table: one table with all the data; the data is out of context (Details in section 10.6 of the document 'Open Architecture for multilingual parallel texts' (Carrasco Benitez, 2008) (Carrasco Benitez, 2008)).
  • Zipped
    Size of the muset zipped into one file; available on request from the DGT contact person mentioned at the end of this page.
  • Statistics
    Main statistics, such as the number of files or segments.
  • Comments.
    General comments.

 

What is the difference with DGT-Acquis, JRC-Acquis and DGT-TM

There is no simple answer to that question.

Both JRC-Acquis and DGT-Acquis are paragraph-aligned parallel corpora, i.e. corpora consisting of full text documents with added meta-information on which paragraphs are aligned with which others in the other languages. Since the JRC-Acquis contains data since the 1950s up to the year 2006 and DGT-Acquis contains data starting in 2004, there is no overlap for data since 2007 and up to 2003. There will be some overlap for the data covering the years 2004 to 2006. If you need to avoid overlapping document sets of both sources, try using the Eur-Lex document identifiers. The processing steps (data preparation and alignment) to produce both data sets were entirely different. The format is not the same, and the quality of both resources is expected to be different, as well.

DGT-Acquis and the translation memory DGT-TM are of a different nature. While the DGT-Acquis parallel corpus contains full documents with additional segmentation information, DGT-TM is a translation memory, i.e. a collection of Translation Units (sentences and the like). In parallel corpora, one can thus see each sentence in its context, while in translation memories, each sentence is in isolation, i.e. out of context. As for their overlap, DGT-TM is based exclusively on the L-Series of the Official Journal, while DGT-Acquis also contains the LM, C, CA and CE collections (see the table of documents included in DGT-Acquis, on this page under "What is the DGT-Acquis?"). Again, the processing steps (data preparation and alignment) to produce both data sets were entirely different. The format is not the same, and the quality of both resources is expected to be different, as well.

 

Conditions for Use

I. Intellectual property and conditions of use of data

The DGT-Acquis data is the exclusive property of the European Commission. The Commission cedes its non-exclusive rights free of charge and world-wide for the entire duration of the protection of those rights to the re-user, for all kinds of use which comply with the conditions laid down in the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42.

Any re-use of the data or of the structured elements contained in it is required to be identified by the re-user, who is under an obligation to state the source of the documents used: the website address, the date of the latest update and the fact that the European Commission retains ownership of the data.

II. Conditions for use of software

The DGT-Acquis data is distributed with the software necessary for its exploitation/extraction. Use of such software must be carried out in accordance with the conditions laid down in the EUPL licence.

III. Responsibility

The data and the accompanying software are made available, without any guarantee, explicit or tacit. The Commission cannot be held responsible for any loss, injury or damage the re-user may suffer due to the re-use. The Commission does not however guarantee the absence of any irregularities which may be present in the data, within the structured data they contain or the software itself. The Commission does not guarantee the on-going distribution of said data and software.

The Commission cannot be held responsible for any loss, injury or damage caused to third parties as a result of the re-use. The re-user shall bear sole responsibility for the re-use of the data collection, the structured elements it contains and the software. Re-use must not mislead third parties in respect of the contents of the data and the structured elements it contains, it's the source of the contents or the date of the last update thereto. This disclaimer is not intended to limit the liability of the Commission in violation of any requirements laid down in applicable national law or to exclude its liability in cases where this is not permitted by the applicable law.

IV. Definitions

Definitions of terms used by the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42, are supplemented by the following definitions:

Re-user: Any natural or legal person who re-uses the documents, in accordance with the conditions laid down in the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42.

Databases: A collection of independent works, data or other materials arranged in a systematic or methodical way and individually accessible by electronic means or in any other way.

 

Download the DGT-Acquis

There are three downloading options (see Section Description of the Data - Data Format above for details):

  • Full corpus (da1-ox and da1-fx): Please contact the DGT contact person mentioned at the end of the page if you want to receive this data. These packages are very big (several GBs; see the table in Section Statistics on the corpus above).
  • One language per collection per zip file for "file level in plain text format" (da1-ft): You can browse and select the DGT-Acquis files you are interested in.
  • One language per zip file for "paragraph level in column-file format" (da1-pc): You can download these files by clicking on the links in the table below.

 

How to produce bilingual extractions

Download the above required zipped files. Each file contains one language. The file data.a1.txt.zip contains the filenames indicating the source of the segment.

Here is a Unix example to produce a bilingual file with contents in English and French without the empty strings in either language, separated by the character '|' :

paste -d'|' data.en.txt data.fr.txt | sed '/^|/d ; /|$/d' > bilang.txt

 

Acknowledgement and contact

For more information, you can contact the following persons:

Directorate-General for Translation (DGT)
        M.T. Carrasco Benitez (Email address: manuel.carrasco-benitez@ec.europa.eu)
        Unit DGT.R.3 Informatics
        Jean-Monnet Building A2/137
        L-2920 Luxembourg
        More information on DGT.

Joint Research Centre (JRC)
        Ralf Steinberger (Email address format: Firstname.Lastname@jrc.ec.europa.eu)
        IPSC - GlobeSec - OPTIMA
        Via E. Fermi 2749, T.P. 267
        I-21027 Ispra (VA)

The Directorate-General for Translation (DGT) is one of the biggest translation services in the world. It is also the largest single department in the European Commission with a total number of around 2500 staff members and a total production of some 2 million pages a year. Various computer tools are available to translators, who use them according to their translation needs and personal preferences. Irrespective of their preferred working methods, all translators need the possibility to reuse previously translated texts (translation memories, electronic archives, ….). To perform its tasks, DG Translation has a wide variety of language resources at the disposal of its staff: terminology in many different forms (multilingual libraries, terminology databases, electronic dictionaries, etc.), translation memories enabling genuine data sharing; texts as such to be retrieved from internal archiving systems and other sources; and machine translation, which, at the European Commission, is used as a browsing tool to view the gist of a text and also to be used as a genuine translation aid.

The Joint Research Centre (JRC) is also a Directorate-General of the European Commission. The JRC has for many years worked on highly multilingual text analysis applications. The JRC has contributed to the dissemination of the DGT Translation Memory and it has itself produced and disseminated a number of further highly multilingual linguistic resources: the JRC-Acquis, JRC-Names, the JRC Eurovoc Indexer JEX, and a series of further linguistic resources.

The JRC is the creator of the Europe Media Monitor (EMM) family of news aggregation and analysis applications. EMM aggregates news from about thousands of news portals world-wide in about 50 languages (status 2012). EMM's news analysis tools always show the latest news from around the world as its pages are updated every five minutes. As EMM not only displays the news articles, but it also groups related articles, classifies the articles into hundreds of news categories and displays automatically extracted meta-information together with the news items, EMM has many users from around the world, resulting in up to 1.2 million hits per day. Much information is available via RSS feeds, allowing EMM output to be combined with third-party tools. The JRC is scientifically very active, as can be seen from the large number of international scientific publications in the field of multilingual text mining and media monitoring. JRC's four publicly accessible media monitoring applications are:

  • NewsBrief: Breaking News detection and display of the very latest thematically organised news from around the world; Grouping of related news; breaking news detection; RSS feeds and automatic email alerting; 50 languages.
  • MedISys: EMM's Medical Information System selects the health-related EMM news in 50 languages and additionally gathers documents from about 250 medical web sites. MedISys displays the medical news according to diseases, symptoms, organisations and themes and has statistics-based early warning functions for each category. A second, restricted site offers more functionality to EU public health organisations.
  • NewsExplorer: Summary of the news in 20 languages for each 24-hour period; grouping of related news into clusters; linking of daily clusters over time and across languages; visualisation of time lines and of geographical news coverage; information extraction to detect and disambiguate persons, organisations and locations; quotation recognition; individual, daily-updated pages for over one million names; detection of quotations by and about people; automatic generation of social networks.

      

 

Keywords:
JRC Institutes