ACE 2004 Development Corpus

This corpus contains the English training data prepared for the 2004 Time Expression Recognition and Normalization (TERN) evaluation. The evaluation was held in August 2004, and the corresponding workshop in September 2004. Evaluation participants received this data for training purposes; the corpus is now publicly available and distributed by the LDC. The corpus consists of 862 documents containing in total 306k words and nearly 9k TIMEX2 expressions. The documents are divided into three subsets:

  1. ACE2002: this data was originally prepared for the ACE 2002 Relation Detection and Characterization (RDC) evaluation; it was then re-annotated with TIMEX annotations by two annotators and the annotations reconciled.
  2. ACE2003: this contains the training data used in the ACE 2003 evaluation. For the release contained in this corpus, the files were doubly-annotated for TIMEX2 tags and reconciled.
  3. ACE2004: this contains the data prepared for the ACE 2004 evaluation. All of the files were doubly-annotated and reconciled.

The corpus is available at LDC under the catalogue number LDC2005T07.

The tables below show the domains, numbers of documents, and number of words and TIMEX expressions in each corpus subset. The words counts are those provided by corpus developers (an informal analysis indicates that our word counts are slightly different).

ACE2002 Subset

Domain Domain Code #Docs #Words #TIMEX2 Comments
Broadcast News BN 85 17922 628
Newspaper NP 17 14682 337
Newswire NW 78 34134 926
Total 180 66738 1891

ACE2003 Subset

Domain Domain Code #Docs #Words #TIMEX2 Comments
Broadcast News BN 147 34681 1050
Newswire NW 102 58592 1547
Total 249 93273 2597

ACE2004 Subset

Domain Domain Code #Docs #Words #TIMEX2 Comments
Arabic Treebank (translated) AT 58 13466 526 No document creation date available
Broadcast News BN 222 61621 1848
Chinese Treebank (translated) CT 37 12522 365 No document creation date available
Newswire NW 116 58543 1711
Total 433 146152 4450
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License