Fourmilog: None Dare Call It Reason: January 22, 2009 Archives

« January 21, 2009 | Main | January 23, 2009 »

Thursday, January 22, 2009

Reading List: Damp Squid

Butterfield, Jeremy. Damp Squid. Oxford: Oxford University Press, 2008. ISBN 978-0-19-923906-1.

Dictionaries attempt to capture how language (or at least the words of which it is composed) is used, or in some cases should be used according to the compiler of the dictionary, and in rare examples, such as the monumental Oxford English Dictionary (OED), to trace the origin and history of the use of words over time. But dictionaries are no better than the source material upon which they are based, and even the OED, with its millions of quotations contributed by thousands of volunteer readers, can only sample a small fraction of the written language. Further, there is much more to language than the definitions of words: syntax, grammar, regional dialects and usage, changes due to the style of writing (formal, informal, scholarly, etc.), associations of words with one another, differences between spoken and written language, and evolution of all of these matters and more over time. Before the advent of computers and, more recently, the access to large volumes of machine-readable text afforded by the Internet, research into these aspects of linguistics was difficult, extraordinarily tedious, and its accuracy suspect due to the small sample sizes necessarily used in studies.

Computer linguistics sets out to study how a language is actually used by collecting a large quantity of text (called a corpus), tagged with identifying information useful for the intended studies, and permitting measurement of the statistics of the content of the text. The first computerised corpus was created in 1961, containing the then-staggering number of one million words. (Note that since a corpus contains extracts of text, the word count refers to the total number of words, not the number of unique words—as we'll see shortly, a small number of words accounts for a large fraction of the text.) The preeminent research corpus today is the Oxford English Corpus which, in 2006, surpassed two billion words and is presently growing at the rate of 350 million words a year—ain't the Web grand, or what?

This book, which is a pure delight, compelling page turner, and must-have for all fanatic “wordies”, is a light-hearted look at the state of the English language today: not what it should be, but what it is. Traditionalists and fussy prescriptivists (among whom I count myself) will be dismayed at the battles already lost: “miniscule” and “straight-laced” already outnumber “minuscule” and “strait-laced”, and many other barbarisms and clueless coinages are coming on strong. Less depressing and more fascinating are the empirical research on word frequency (Zipf's Law is much in evidence here, although it is never cited by name)—the ten most frequent words make up 25% of the corpus, and the top one hundred account for fully half of the text—word origins, mutation of words and terms, association of words with one another, idiomatic phrases, and the way context dictates the choice of words which most English speakers would find almost impossible to distinguish by definition alone. This amateur astronomer finds it heartening to discover that the most common noun modified by the adjective “naked” is “eye” (1398 times in the corpus; “body” is second at 1144 occurrences). If you've ever been baffled by the origin of the idiom “It's raining cats and dogs” in English, just imagine how puzzled the Welsh must be by “Bwrw hen wragedd a ffyn” (“It's raining old women and sticks”).

The title? It's an example of an “eggcorn” (p. 58–59): a common word or phrase which mutates into a similar sounding one as speakers who can't puzzle out its original, now obscure, meaning try to make sense of it. Now that the safetyland culture has made most people unfamiliar with explosives, “damp squib” becomes “damp squid” (although, if you're a squid, it's not being damp that's a problem). Other eggcorns marching their way through the language are “baited breath”, “preying mantis”, and “slight of hand”.

Posted at 00:05

Fourmilog: None Dare Call It Reason

John Walker's Fourmilab Change Log

Thursday, January 22, 2009

Reading List: Damp Squid