EU Science Hub

DCEP: Digital Corpus of the European Parliament

 

Introduction

The Digital Corpus of the European Parliament (DCEP) contains the majority of the documents published on the European Parliament's official website. It comprises a variety of document types, from press releases to session and legislative documents related to European Parliament's activities and bodies. The current version of the corpus contains documents that were produced between 2001 and 2012.

Format and Structure of the Data

DCEP is available as full-text documents and as sentence-aligned data. DCEP includes alignment information for the full documents, as well as for sentences, produced separately for each language pair. DCEP is accompanied by tools that allow to produce sentence-aligned corpora separately for each of the 276 language pairs. The sentence-aligned data is in plain text format, i.e. XML/TMX output is not supported.

Document types

The following document types are included in the current version of the DCEP corpus:

 

Document typeBrief description
AGENDAAgenda of the plenary session meetings
COMPARLDraft Agenda of the part-session
IM-PRESS and PRESSGeneral texts and articles on parliamentary news seen from a national angle, specific to one or several Member States, presentation of events in the EP
IMP-CONTRIBVarious press documents including technical announcements, events (hearings, workshops) produced by the Parliamentary Committees
MOTIONMotions for resolutions put to the vote in plenary
PVMinutes of plenary sittings
REPORTReports of the parliamentary committees
RULES-EPThe Rules of Procedure of the EP laying down the rules for the internal operation and organisation of EP
TA (Adopted Texts)The motions for resolutions and reports tabled by Members and by the parliamentary committees are put to the vote in plenary, with or without a debate. After the vote, the final texts as adopted are published and forwarded to the authorities concerned
WQ (Written Question)Written questions are texts from Members of the EP which request an answer in writing
WQA (Written Question Answer)The written answer to the parliamentary written questions
OQ (Oral Question)Oral questions are asked in plenary sitting and included in the day's debates
QT (Questions for Question Time)Questions for question time are asked during the period set aside for questions during plenary sittings.

 

Statistics

For details on the statistics on the DCEP, click here. The tables contain a summary of the corpus size in

  • Number of documents;
  • Number of words;
  • Number of unique words.

Words have been counted with the wc Unix utility after removing the mark-up. On the other hand, unique words have been counted on tokenised text whereby only words composed from alphabetical characters have been taken into consideration. The first Table presents figures per language and document type while the second one contains statistics per language pair.

DCEP is the largest single release of documents published by an institution of the European Union. It contains various document types in 23 languages (253 language pairs). Here are some statistics:

  • Total number of documents : 1.5 million;
  • Total number of words: 1.37 billion;
  • Total number of English segments: 7.7 million;
  • The best-represented language in terms of number of words is English (103,458,996);
  • French and Spanish miss less than 10%.

More statistics are available in the publication DCEP - Digital Corpus of the European Parliament (LREC'2014).

 

Usage Conditions

I. Intellectual property and conditions of use of data

The DCEP data is the exclusive property of the European Parliament. The Parliament cedes its non-exclusive rights free of charge and world-wide for research purposes for the entire duration of the protection of those rights to the re-user.

Any re-use of the data or of the structured elements contained in it is required to be identified by the re-user, who is under an obligation to state the source of the documents used: the website address, the date of the latest update and the fact that the European Parliament retains ownership of the data.

II. Conditions for use of software

The DCEP data is distributed with the software necessary for its exploitation/extraction. Use of such software must be carried out in accordance with the conditions laid down in the EUPL license.

III. Responsibility

The data and the accompanying software are made available, without any guarantee, explicit or tacit. The Parliament cannot be held responsible for any loss, injury or damage the re-user may suffer due to the re-use. The Parliament does not however guarantee the absence of any irregularities which may be present in the data, within the structured data they contain or the software itself. The Parliament does not guarantee the on-going distribution of said data and software.

The Parliament cannot be held responsible for any loss, injury or damage caused to third parties as a result of the re-use. The re-user shall bear sole responsibility for the re-use of the data collection, the structured elements it contains and the software. Re-use must not mislead third parties in respect of the contents of the data and the structured elements it contains, the source of the contents or the date of the last update thereto. This disclaimer is not intended to limit the liability of the Parliament in violation of any requirements laid down in applicable national law or to exclude its liability in cases where this is not permitted by the applicable law.

 

Download the DCEP corpus

Please chose the downloading options on the DCEP download page.

 

How to produce bilingual corpora

In order to extract the bilingual corpus, which has been aligned at the sentence level using the HunAlign sentence aligner, please follow the readme page.

 

Acknowledgement and contact

DCEP has been created and published by the Machine Translation team of the European Parliament's Directorate-General for Translation (DGTRAD), represented by Najeh Hajlaoui (Machine Translation Expert at the European Parliament). DGTRAD was supported by Jaakko Väyrynen and Ralf Steinberger from the European Commission's Joint Research Centre. The sentence alignment was produced by Dániel Varga, researcher at Budapest University of Technology and Economics, with a customised version of the HunAlign software.

For more information you can send an e-mail to machinetranslation@ep.europa.eu .

 

References - Relevant publications

For a more detailed description of DCEP and when making reference to DCEP in scientific publications, please refer to:

To compare DCEP with the other linguistic resources distributed by EU institutions, see: