ACE 2005 Development Corpus
This corpus constitutes the training data for the 2005 Automatic Content Extraction exercise. Training data files were dually annotated by two annotators working independently. Discrepancies between the two versions of each file were then adjudicated by a senior annotator or team leader, resulting in a gold standard file. After adjudication, TIMEX2 values were normalized (for English only). The corpus is available at LDC under the catalogue number LDC2006T06.
The distribution of files across domains in the corpus is as follows:
ACE 2005 Development corpus
Domain | Domain Code | #Docs | #Words | #TIMEX2 | Comments |
---|---|---|---|---|---|
Broadcast Conversation | BC | 60 | 40415 | 626 | |
Broadcast News | BN | 226 | 55967 | 1455 | |
Conversational Telephone Speech | CTS | 39 | 39845 | 409 | |
Newswire | NW | 106 | 48399 | 1235 | |
Usenet Newsgroups | UN | 49 | 37366 | 741 | |
Weblog | WL | 119 | 37897 | 1003 | |
Total | 599 | 259889 | 5469 |
page revision: 2, last edited: 11 Jan 2008 21:22