EU Science Hub

DGT-Acquis

Description of the Data - Data Format

The original data of the OJ has been processed in several steps. In each step, the result of the previous step was refined to a finer granularity: (1) original data, (2) file level in Formex4 format, (3) file level in plain text and (4) paragraph level. The result of each step is a corpus packaged as a self-contained Multilingual Dataset Format (muset) file. Even though the musets are independent, they are linked to each other so that, for example, one can find the source document of any given text segment. Data users can choose the data with the most appropriate processing level for their own needs.

 

Statistics on the corpus

ID

Title

Granularity

Format

Structure

Zipped

Statistics

Comments

da1-ox

Original data

original

formex4

tree

81 GB

3,901,048 files

original filenames; with TIFF files

da1-fx

File level in Formex4

file

formex4

tree

9 GB

3,537,876 files

standardised filenames; without TIFF files

da1-ft

File level in plain text format

file

text

tree

5 GB

3,537,872 files

XML marking removed

da1-pc

Paragraph level in column-file format

paragraph

column-file

table

3 GB

4,900,254 segments

one table

  • ID.
    muset identity. Example, da1-ox. It is composed of:
  • Title
    The title of the Multilingual Dataset Format (muset). Using these links one can download individual language files.
  • Granularity
    Details on this can be found in sections 10.4.5 and 10.5.4 of the document 'Open Architecture for multilingual parallel texts' (Carrasco Benitez, 2008).
            original
            file
            paragraph
            sentence
            sub-sentence
  • Format
            Formex 4 (XML) format
            text: plain text.
           column-file: each column in the table is in one file. The file a1.txt contain the filenames of the segment provenance.
  • Structure
           tree: tree of directories and files; the data is in the original context.
           table: one table with all the data; the data is out of context (Details in section 10.6 of the document 'Open Architecture for multilingual parallel texts' (Carrasco Benitez, 2008) (Carrasco Benitez, 2008)).
  • Zipped
    Size of the muset zipped into one file; available on request from the DGT contact person mentioned at the end of this page.
  • Statistics
    Main statistics, such as the number of files or segments.
  • Comments.
    General comments.

 

What is the difference with DGT-Acquis, JRC-Acquis and DGT-TM

There is no simple answer to that question. For a detailed answer, you can read the following article:

Both JRC-Acquis and DGT-Acquis are paragraph-aligned parallel corpora, i.e. corpora consisting of full text documents with added meta-information on which paragraphs are aligned with which others in the other languages. Since the JRC-Acquis contains data since the 1950s up to the year 2006 and DGT-Acquis contains data starting in 2004, there is no overlap for data since 2007 and up to 2003. There will be some overlap for the data covering the years 2004 to 2006. If you need to avoid overlapping document sets of both sources, try using the Eur-Lex document identifiers. The processing steps (data preparation and alignment) to produce both data sets were entirely different. The format is not the same, and the quality of both resources is expected to be different, as well.

DGT-Acquis and the translation memory DGT-TM are of a different nature. While the DGT-Acquis parallel corpus contains full documents with additional segmentation information, DGT-TM is a translation memory, i.e. a collection of Translation Units (sentences and the like). In parallel corpora, one can thus see each sentence in its context, while in translation memories, each sentence is in isolation, i.e. out of context. As for their overlap, DGT-TM is based exclusively on the L-Series of the Official Journal, while DGT-Acquis also contains the LM, C, CA and CE collections (see the table of documents included in DGT-Acquis, on this page under "What is the DGT-Acquis?"). Again, the processing steps (data preparation and alignment) to produce both data sets were entirely different. The format is not the same, and the quality of both resources is expected to be different, as well.

 

Conditions for Use

I. Intellectual property and conditions of use of data

The DGT-Acquis data is the exclusive property of the European Commission. The Commission cedes its non-exclusive rights free of charge and world-wide for the entire duration of the protection of those rights to the re-user, for all kinds of use which comply with the conditions laid down in the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42.

Any re-use of the data or of the structured elements contained in it is required to be identified by the re-user, who is under an obligation to state the source of the documents used: the website address, the date of the latest update and the fact that the European Commission retains ownership of the data.

II. Conditions for use of software

The DGT-Acquis data is distributed with the software necessary for its exploitation/extraction. Use of such software must be carried out in accordance with the conditions laid down in the EUPL licence.

III. Responsibility

The data and the accompanying software are made available, without any guarantee, explicit or tacit. The Commission cannot be held responsible for any loss, injury or damage the re-user may suffer due to the re-use. The Commission does not however guarantee the absence of any irregularities which may be present in the data, within the structured data they contain or the software itself. The Commission does not guarantee the on-going distribution of said data and software.

The Commission cannot be held responsible for any loss, injury or damage caused to third parties as a result of the re-use. The re-user shall bear sole responsibility for the re-use of the data collection, the structured elements it contains and the software. Re-use must not mislead third parties in respect of the contents of the data and the structured elements it contains, it's the source of the contents or the date of the last update thereto. This disclaimer is not intended to limit the liability of the Commission in violation of any requirements laid down in applicable national law or to exclude its liability in cases where this is not permitted by the applicable law.

IV. Definitions

Definitions of terms used by the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42, are supplemented by the following definitions:

Re-user: Any natural or legal person who re-uses the documents, in accordance with the conditions laid down in the Commission Decision of 12 December 2011 on the re-use of Commission documents, published in Official Journal of the European Union L330 of 14 December 2011, pages 39 to 42.

Databases: A collection of independent works, data or other materials arranged in a systematic or methodical way and individually accessible by electronic means or in any other way.

 

Download the DGT-Acquis

There are three downloading options (see Section Description of the Data - Data Format above for details):

  • Full corpus (da1-ox and da1-fx): Please contact the DGT contact person mentioned at the end of the page if you want to receive this data. These packages are very big (several GBs; see the table in Section Statistics on the corpus above).
  • One language per collection per zip file for "file level in plain text format" (da1-ft): You can browse and select the DGT-Acquis files you are interested in.
  • One language per zip file for "paragraph level in column-file format" (da1-pc): You can download these files by clicking on the links in the table below.

 

How to produce bilingual extractions

Download the above required zipped files. Each file contains one language. The file data.a1.txt.zip contains the filenames indicating the source of the segment.

Here is a Unix example to produce a bilingual file with contents in English and French without the empty strings in either language, separated by the character '|' :

paste -d'|' data.en.txt data.fr.txt | sed '/^|/d ; /|$/d' > bilang.txt

 

Acknowledgement and contact

For more information, you can contact the following persons:

Directorate-General for Translation (DGT)
        M.T. Carrasco Benitez (Email address: manuel.carrasco-benitez@ec.europa.eu)
        Unit DGT.R.3 Informatics
        Jean-Monnet Building A2/137
        L-2920 Luxembourg
        More information on DGT.

Joint Research Centre (JRC)
        Ralf Steinberger (Email address format: Firstname.Lastname@jrc.ec.europa.eu)
        IPSC - GlobeSec - OPTIMA
        Via E. Fermi 2749, T.P. 267
        I-21027 Ispra (VA)

When making reference to the DGT-Acquis in sicneitific publications, please quote the following paper: