< Terug naar vorige pagina

Publicatie

Dutch Parallel Corpus: A Balanced Copyright-Cleared Parallel Corpus

Tijdschriftbijdrage - Tijdschriftartikel

This paper presents the Dutch Parallel Corpus, a high-quality parallel corpus for Dutch, French and English consisting of more than ten million words. The corpus contains five different text types and is balanced with respect to text type and translation direction. All texts included in the corpus have been cleared from copyright. We discuss the importance of parallel corpora in various research domains and contrast the Dutch Parallel Corpus with existing parallel corpora. The Dutch Parallel Corpus distinguishes itself from other parallel corpora by having a balanced composition and by its availability to the wide research community, thanks to its copyright clearance. All texts in the corpus are sentence-aligned and further enriched with basic linguistic annotations (lemmas and word class information). Approximately 25,000 words of the Dutch-English part have been manually aligned at the sub-sentential level. Rich metadata facilitates the navigability of the corpus and enables users to select the texts that satisfy their needs. The entire corpus is released as full texts in XML format and is also available via a web interface, which supports basic and complex search queries and presents the results as parallel concordances. The corpus will be distributed by the Flemish-Dutch Human Language Technology Agency (TST-Centrale).
Tijdschrift: Meta: Journal des Traducteurs
ISSN: 0026-0452
Issue: 2
Volume: 56
Pagina's: 374 - 390
Jaar van publicatie:2011
BOF-keylabel:ja
IOF-keylabel:ja
CSS-citation score:2
Authors from:Higher Education