DCEP is available as full-text documents and as sentence-aligned data. DCEP includes alignment information for the full documents, as well as for sentences, produced separately for each language pair. DCEP is accompanied by tools that allow to produce sentence-aligned corpora separately for each of the 276 language pairs. The sentence-aligned data is in plain text format, i.e. XML/TMX output is not supported.
The following document types are included in the current version of the DCEP corpus:
|Document type||Brief description|
|AGENDA||Agenda of the plenary session meetings|
|COMPARL||Draft Agenda of the part-session|
|IM-PRESS and PRESS||General texts and articles on parliamentary news seen from a national angle, specific to one or several Member States, presentation of events in the EP|
|IMP-CONTRIB||Various press documents including technical announcements, events (hearings, workshops) produced by the Parliamentary Committees|
|MOTION||Motions for resolutions put to the vote in plenary|
|PV||Minutes of plenary sittings|
|REPORT||Reports of the parliamentary committees|
|RULES-EP||The Rules of Procedure of the EP laying down the rules for the internal operation and organisation of EP|
|TA (Adopted Texts)||The motions for resolutions and reports tabled by Members and by the parliamentary committees are put to the vote in plenary, with or without a debate. After the vote, the final texts as adopted are published and forwarded to the authorities concerned|
|WQ (Written Question)||Written questions are texts from Members of the EP which request an answer in writing|
|WQA (Written Question Answer)||The written answer to the parliamentary written questions|
|OQ (Oral Question)||Oral questions are asked in plenary sitting and included in the day's debates|
|QT (Questions for Question Time)||Questions for question time are asked during the period set aside for questions during plenary sittings.|
For details on the statistics on the DCEP, click here. The tables contain a summary of the corpus size in
- Number of documents;
- Number of words;
- Number of unique words.
Words have been counted with the wc Unix utility after removing the mark-up. On the other hand, unique words have been counted on tokenised text whereby only words composed from alphabetical characters have been taken into consideration. The first Table presents figures per language and document type while the second one contains statistics per language pair.
DCEP is the largest single release of documents published by an institution of the European Union. It contains various document types in 23 languages (253 language pairs). Here are some statistics:
- Total number of documents : 1.5 million;
- Total number of words: 1.37 billion;
- Total number of English segments: 7.7 million;
- The best-represented language in terms of number of words is English (103,458,996);
- French and Spanish miss less than 10%.
More statistics are available in the publication DCEP - Digital Corpus of the European Parliament (LREC'2014).
I. Intellectual property and conditions of use of data
The DCEP data is the exclusive property of the European Parliament. The Parliament cedes its non-exclusive rights free of charge and world-wide for research purposes for the entire duration of the protection of those rights to the re-user.
Any re-use of the data or of the structured elements contained in it is required to be identified by the re-user, who is under an obligation to state the source of the documents used: the website address, the date of the latest update and the fact that the European Parliament retains ownership of the data.
II. Conditions for use of software
The DCEP data is distributed with the software necessary for its exploitation/extraction. Use of such software must be carried out in accordance with the conditions laid down in the EUPL license.
The data and the accompanying software are made available, without any guarantee, explicit or tacit. The Parliament cannot be held responsible for any loss, injury or damage the re-user may suffer due to the re-use. The Parliament does not however guarantee the absence of any irregularities which may be present in the data, within the structured data they contain or the software itself. The Parliament does not guarantee the on-going distribution of said data and software.
The Parliament cannot be held responsible for any loss, injury or damage caused to third parties as a result of the re-use. The re-user shall bear sole responsibility for the re-use of the data collection, the structured elements it contains and the software. Re-use must not mislead third parties in respect of the contents of the data and the structured elements it contains, the source of the contents or the date of the last update thereto. This disclaimer is not intended to limit the liability of the Parliament in violation of any requirements laid down in applicable national law or to exclude its liability in cases where this is not permitted by the applicable law.
Please chose the downloading options on the DCEP download page.
In order to extract the bilingual corpus, which has been aligned at the sentence level using the HunAlign sentence aligner, please follow the readme page.
DCEP has been created and published by the Machine Translation team of the European Parliament's Directorate-General for Translation (DGTRAD), represented by Najeh Hajlaoui (Machine Translation Expert at the European Parliament). DGTRAD was supported by Jaakko Väyrynen and Ralf Steinberger from the European Commission's Joint Research Centre. The sentence alignment was produced by Dániel Varga, researcher at Budapest University of Technology and Economics, with a customised version of the HunAlign software.
For more information you can send an e-mail to firstname.lastname@example.org .
For a more detailed description of DCEP and when making reference to DCEP in scientific publications, please refer to:
To compare DCEP with the other linguistic resources distributed by EU institutions, see: