Automated Similarity Judgment Program

Request review at WP:AFC

ASJP

The Automated Similarity Judgment Program (ASJP) is a collaborative project applying computational approaches to comparative linguistics using a database of word lists. The database is open access and consists of 40-item basic-vocabulary lists for well over half of the world's languages.^[1] It is continuously being expanded. In addition to isolates and languages of demonstrated genealogical groups, the database includes pidgins, creoles, mixed languages, and constructed languages. Words of the database are transcribed into a simplified standard orthography (ASJPcode).^[2] The database has been used to estimate dates at which language families have diverged into daughter languages by a method related to but still different from glottochronology^[3], to determine the homeland (Urheimat) of a proto-language ^[4], to investigate sound symbolism ^[5], to evaluate different phylogenetic methods ^[6], and several other purposes.

History

Original goals

ASJP was originally developed as a means for objectively evaluating the similarity of words with the same meaning from different languages, with the ultimate goal of classifying languages computationally, based on the lexical similarities observed. In the first ASJP paper^[7] two semantically identical words from compared languages were judged similar if they showed at least two identical sound segments. Similarity between the two languages was calculated as a percentage of the total number of words compared that were judged as similar. This method was applied to 100-item word list for 250 languages from language families including Austro-Asiatic, Indo-European, Mayan, and Muskogean.

The ASJP Consortium

The ASJP Consortium, founded around 2008, came to involve around 25 professional linguists and other interested parties working as volunteer transcribers and/or extending aid to the project in other ways. Their individual contributions are acknowledged on the page providing sources for the ASJP word lists.^[8] The main driving force behind the founding of the consortium was Cecil H. Brown. Søren Wichmann is daily curator the project. A third central member of the consortium is Eric W. Holman, who has created most of the software used in the project.

Shorter word lists

While word lists used were originally based on the 100-item Swadesh list, it was statistically determined that a subset of 40 of the 100 items produced just as good if not slightly better classificatory results than the whole list.^[9] So subsequently word lists gathered contain only 40 items (or less, when attestations for some are lacking).

Levenshtein Distance

In papers published since 2008, ASJP has employed a similarity judgment program based on Levenshtein distance (LD). This approach was found to produce better classificatory results measured against expert opinion than the method used initially. LD is defined as the minimum number of successive changes necessary to convert one word into another, where each change is the insertion, deletion, or substitution of a symbol. Within the Levenshtein approach, differences in word length can be corrected for by dividing LD by the number of symbols of the longer of the two compared words. This produces normalized LD (LDN). An LDN divided (LDND) between the two languages is calculated by dividing the average LDN for all the word pairs involving the same meaning by the average LDN for all the word pairs involving different meanings. This second normalization is intended to correct for chance similarity.^[10]

References

Vorlage:Reflist

External links

ASJP official home page

↑ Wichmann, Søren, André Müller, Viveka Velupillai, Annkathrin Wett, Cecil H. Brown, Zarina Molochieva, Julia Bishoffberger, Eric W. Holman, Sebastian Sauppe, Pamela Brown, Dik Bakker, Johann-Mattis List, Dmitry Egorov, Oleg Belyaev, Matthias Urban, Harald Hammarström, Agustina Carrizo, Robert Mailhammer, Helen Geyer, David Beck, Evgenia Korovina, Pattie Epps, Pilar Valenzuela, and Anthony Grant. 2012. The ASJP Database (version 15). http://email.eva.mpg.de/~wichmann/languages.htm
↑ Brown, Cecil H., Eric W. Holman, Søren Wichmann, and Viveka Velupillai. 2008. Automated classification of the world's languages: A description of the method and preliminary results. STUF – Language Typology and Universals 61.4: 285-308.
↑ Holman, Eric W., Cecil H. Brown, Søren Wichmann, André Müller, Viveka Velupillai, Harald Hammarström, Sebastian Sauppe, Hagen Jung, Dik Bakker, Pamela Brown, Oleg Belyaev, Matthias Urban, Robert Mailhammer, Johann-Mattis List, and Dmitry Egorov. 2011. Automated dating of the world’s language families based on lexical similarity. Current Anthropology 52.6: 841-875.
↑ Wichmann, Søren, André Müller, and Viveka Velupillai. 2010. Homelands of the world’s language families: A quantitative approach. Diachronica 27.2: 247-276.
↑ Wichmann, Søren, Holman, Eric W., and Cecil H. Brown. 2010. Sound symbolism in basic vocabulary. Entropy 12.4: 844-858.
↑ Pompei, Simone, Vittorio Loreto, and Francesca Tria. 2011. On the accuracy of language trees. PLoS ONE 6: e20109.
↑ Brown, Cecil H., Eric W. Holman, Søren Wichmann, and Viveka Velupillai. 2008. Automated classification of the world's languages: A description of the method and preliminary results. STUF – Language Typology and Universals 61.4: 285-308.
↑ https://lingweb.eva.mpg.de/asjp/index.php/ASJP
↑ Holman, Eric W., Søren Wichmann, Cecil H. Brown, Viveka Velupillai, André Müller, and Dik Bakker. 2008. Explorations in automated language classification. Folia Linguistica 42.2: 331-354.
↑ Wichmann, Søren, Eric W. Holman, Dik Bakker, and Cecil H. Brown. 2010. Evaluating linguistic distance measures. Physica A 389: 3632-3639 (doi:10.1016/j.physa.2010.05.011).

[1] Wichmann, Søren, André Müller, Viveka Velupillai, Annkathrin Wett, Cecil H. Brown, Zarina Molochieva, Julia Bishoffberger, Eric W. Holman, Sebastian Sauppe, Pamela Brown, Dik Bakker, Johann-Mattis List, Dmitry Egorov, Oleg Belyaev, Matthias Urban, Harald Hammarström, Agustina Carrizo, Robert Mailhammer, Helen Geyer, David Beck, Evgenia Korovina, Pattie Epps, Pilar Valenzuela, and Anthony Grant. 2012. The ASJP Database (version 15). http://email.eva.mpg.de/~wichmann/languages.htm

[2] Brown, Cecil H., Eric W. Holman, Søren Wichmann, and Viveka Velupillai. 2008. Automated classification of the world's languages: A description of the method and preliminary results. STUF – Language Typology and Universals 61.4: 285-308.

[3] Holman, Eric W., Cecil H. Brown, Søren Wichmann, André Müller, Viveka Velupillai, Harald Hammarström, Sebastian Sauppe, Hagen Jung, Dik Bakker, Pamela Brown, Oleg Belyaev, Matthias Urban, Robert Mailhammer, Johann-Mattis List, and Dmitry Egorov. 2011. Automated dating of the world’s language families based on lexical similarity. Current Anthropology 52.6: 841-875.

[4] Wichmann, Søren, André Müller, and Viveka Velupillai. 2010. Homelands of the world’s language families: A quantitative approach. Diachronica 27.2: 247-276.

[5] Wichmann, Søren, Holman, Eric W., and Cecil H. Brown. 2010. Sound symbolism in basic vocabulary. Entropy 12.4: 844-858.

[6] Pompei, Simone, Vittorio Loreto, and Francesca Tria. 2011. On the accuracy of language trees. PLoS ONE 6: e20109.

[7] Brown, Cecil H., Eric W. Holman, Søren Wichmann, and Viveka Velupillai. 2008. Automated classification of the world's languages: A description of the method and preliminary results. STUF – Language Typology and Universals 61.4: 285-308.

[8] ttps://lingweb.eva.mpg.de/asjp/index.php/ASJP

[9] Holman, Eric W., Søren Wichmann, Cecil H. Brown, Viveka Velupillai, André Müller, and Dik Bakker. 2008. Explorations in automated language classification. Folia Linguistica 42.2: 331-354.

[10] Wichmann, Søren, Eric W. Holman, Dik Bakker, and Cecil H. Brown. 2010. Evaluating linguistic distance measures. Physica A 389: 3632-3639 (doi:10.1016/j.physa.2010.05.011).

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]