User:TurboSuperA+/Essays/Google Ngram

This is an essay.

It contains the advice or opinions of one or more Wikipedia contributors. This page is not an encyclopedia article, nor is it one of Wikipedia's policies or guidelines, as it has not been thoroughly vetted by the community. Some essays represent widespread norms; others only represent minority viewpoints.

Shortcut

WP:NGRAM

This page in a nutshell: Google Ngram is a way to search how often words appear in digitised books Google Books. It has many problems and should be avoided in discussions regarding capitalisation of terms.

Problems editors identified with Google Ngram results:

Anyone can add their self-published book's ISBN to Google Books and self-published books skew the results. [1]
It has a British English and American English corpus [2], yet it has no reliable way of telling which is which. [3]
It is unreliable for books after 2019, "The expansion of the corpus after 2019 is when junk books were dumped into the data set." [4]
Cut-off points can be arbitrarily decided [5], making discussions more complicated.
It is impossible to see the context of how a word is used. [6] "It doesn't filter out headlines, proper nouns, captions, indexes, etc." [7]
Depending on the number of hits, it may omit results. [8]
Results dated after a Wikipedia article on the topic was created should be discounted. [9]
"Google Books sources can be highly imbalanced in certain instances. In one case, I saw a sudden spike in usage of a term in the 1950s and upon investigation found that that spike was entirely due to its use in UN documents. Google had a large number of these documents scanned, and they swamped usage of other print materials in the same era." [10]
"it is an arbitrary corpus interpreted by unreliable OCR that may or may not reflect actual usage trends and is virtually guaranteed to create ghost trends if enough comparisons are generated" [11]
"it allows one to select English fiction as the corpus, but it does not allow one to select English non-fiction as the corpus." [12]
Editors disagree how the results should be interpreted. [13] [14] [15]

Problems with Google Ngram identified by academics:

A 2015 study published in PLOS found that the Google Books corpus suffers from a number of limitations which make it an obscure mask of cultural popularity. A primary issue is that the corpus is in effect a library, containing one of each book. A single, prolific author is thereby able to noticeably insert new phrases into the Google Books lexicon, whether the author is widely read or not. With this understood, the Google Books corpus remains an important data set to be considered more lexicon-like than text-like. Here, we show that a distinct problematic feature arises from the inclusion of scientific texts, which have become an increasingly substantive portion of the corpus throughout the 1900s. The result is a surge of phrases typical to academic articles but less common in general, such as references to time in the form of citations. ... Overall, our findings call into question the vast majority of existing claims drawn from the Google Books corpus, and point to the need to fully characterize the dynamics of the corpus before using these data sets to draw broad conclusions about cultural and linguistic evolution. [16]

See also