Jump to content

ACL Data Collection Initiative

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Cosmia Nebula (talk | contribs) at 23:57, 25 March 2025 (Format). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
Data Collection Initiative
Founded1989; 36 years ago (1989)
Headquarters

The ACL Data Collection Initiative (ACL/DCI) is a project established in 1989 by the Association for Computational Linguistics (ACL) to create and distribute large text and speech corpora for computational linguistics research. The initiative aimed to address the growing need for substantial text databases that could support research in areas such as natural language processing, speech recognition, and computational linguistics.

Objectives

The ACL/DCI had several key objectives:

  • To acquire a large and diverse text corpus from various sources
  • To transform the collected texts into a common format based on the Standard Generalized Markup Language (SGML)
  • To make the corpus available for scientific research at low cost with minimal restrictions
  • To provide a common database that would allow researchers to replicate or extend published results
  • To reduce duplication of effort among researchers in obtaining and preparing text data

These objectives were designed to address the growing demand for very large amounts of text arising from applications in recognition and analysis of text and speech.[1]

History

By the late 1980s, researchers in computational linguistics and speech recognition faced a significant problem: the lack of large-scale, accessible text corpora for developing statistical models and testing algorithms. Existing generally available text databases were too small to meet the needs of developing applications in text and speech recognition. The initiative was formed to meet this need by collecting, standardizing, and distributing large quantities of text data with minimal restrictions for scientific research. As stated by Liberman (1990), "research workers have been severely hampered by the lack of appropriate materials, and specially by the lack of a large enough body of text on which published results can be replicated or extended by others."[1]

The ACL/DCI committee was established in February 1989. The committee included members from academic and industrial research laboratories in the United States and Europe.[2]

The initiative was chaired by Mark Liberman from the University of Pennsylvania (formerly of AT&T Bell Laboratories). Other committee members included representatives from organizations such as Bellcore, IBM T.J. Watson Research Center, Cambridge University, Virginia Polytechnic Institute & State University, Northeastern University, University of Pennsylvania, SRI International, MCC, Xerox PARC, ISSCO, and University of Pisa.[2]

The project operated initially without dedicated funding, relying on volunteer efforts from committee members and their affiliated institutions. Key supporters included AT&T Bell Labs, Bellcore, IBM, Xerox, and the University of Pennsylvania, which allowed the use of their computing facilities for ACL/DCI-related work.[1]

Data

As of 1990, the ACL/DCI had collected hundreds of millions of words of diverse text. The collection included:[1]

The initiative started with North American English text but expanded to include Canadian French and planned to include Japanese, Chinese, and other Asian languages.[1]

Format

The ACL/DCI corpus was coded in a standard form based on SGML (Standard Generalized Markup Language, ISO 8879),[1] consistent with the recommendations of the Text Encoding Initiative (TEI), of which the DCI was an affiliated project. The TEI was a joint project of the ACL, the Association for Computers and the Humanities, and the Association for Literary and Linguistic Computing, aiming to provide a common interchange format for literary and linguistic data.

The initiative planned to add annotations reflecting consensually approved linguistic features like part of speech and various aspects of syntactic and semantic structure over time.[1]

Examples

Wall Street Journal Corpus

As an example of the use of ACL/DCI, consider the Wall Street Journal (WSJ) corpus for speech recognition research. The WSJ corpus was used as the basis for the DARPA Spoken Language System (SLS)[4] community's Continuous Speech Recognition (CSR) Corpus.[5] The WSJ corpus became a standard benchmark for evaluating speech recognition systems and has been used in numerous research papers.

The WSJ CSR Corpus provided DARPA with its first general-purpose English, large vocabulary, natural language, high perplexity corpus containing speech (400 hours) and text (47 million words).[5]

The text was preprocessed to remove ambiguity in the word sequence that a reader might choose, ensuring that the unread text used to train language models was representative of the spoken test material. The preprocessing included converting numbers into orthographics, expanding abbreviations, resolving apostrophes and quotation marks, and marking punctuation.[5]

Distribution

Materials from the ACL/DCI collection were distributed to research groups on a non-commercial basis. By 1990, about 25 research groups and individual researchers had received tapes containing various portions of the collected material.[1]

To obtain the data, researchers had to sign an agreement not to redistribute the data or make direct commercial use of it. However, commercial application of "analytical materials" derived from the text, such as statistical tables or grammar rules, was explicitly permitted.[1]

The initiative first distributed data via 12-inch reels of 9-track tape, then via CD-ROMs.[1]

See also

References

  1. ^ a b c d e f g h i j Liberman, Mark Y. (1990). "The ACL data collection initiative". Proceedings of the 5th Jerusalem Conference on Information Technology. IEEE. pp. 781–786.
  2. ^ a b Liberman, Mark (1989). "Text on Tap: the ACL/DCI". Speech and Natural Language: Proceedings of a Workshop Held at Cape Cod, Massachusetts, October 15-18, 1989. pp. 173–178.
  3. ^ MacWhinney, Brian; Snow, Catherine (1990-06). "The Child Language Data Exchange System: an update". Journal of Child Language. 17 (2): 457–472. doi:10.1017/S0305000900013866. ISSN 0305-0009. PMC 9807025. PMID 2380278. {{cite journal}}: Check date values in: |date= (help)
  4. ^ Sears, J. Allen (1988-11-01). "The DARPA spoken language systems program: Past, present, and future". The Journal of the Acoustical Society of America. 84 (S1): S188 – S188. doi:10.1121/1.2026042. ISSN 0001-4966.
  5. ^ a b c Paul, Douglas B.; Baker, Janet (1992). "The Design for the Wall Street Journal-based CSR Corpus". Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992.