ACE 2005 Development Corpus

This corpus constitutes the training data for the 2005 Automatic Content Extraction exercise. Training data files were dually annotated by two annotators working independently. Discrepancies between the two versions of each file were then adjudicated by a senior annotator or team leader, resulting in a gold standard file. After adjudication, TIMEX2 values were normalized (for English only). The corpus is available at LDC under the catalogue number LDC2006T06.

The distribution of files across domains in the corpus is as follows:

Domain Domain Code #Docs #Words #TIMEX2 Comments
Broadcast Conversation BC 60 40415 626
Broadcast News BN 226 55967 1455
Conversational Telephone Speech CTS 39 39845 409
Newswire NW 106 48399 1235
Usenet Newsgroups UN 49 37366 741
Weblog WL 119 37897 1003
Total 599 259889 5469
