In my presentation at DH2012 I made a couple of comments on Giles Goodland’s paper ‘OED Online’s Single-Quotations Entries: an Analysis‘, mostly about the sampling method that Goodland employs, and which everyone else has employed so far when trying to say something about the OED that isn’t facilitated by whatever the current online functionality happens to be. In particular I pointed out that selecting an alphabetical range throws in a ton of variables, such as the time frame when that range was compiled (potentially at decades’ distance from other parts of the Dictionary) and the linguistic properties of the words within it. As Goodland himself points out, a range like his (RAB-RESCUSSEE) will include a lot of ‘re-‘ words, where you might expect to find more than the average amount of poetic neologism (for metrical purposes, for instance, it can be useful to have a spare syllable, like ‘re’, to tag onto a word). I added that the dependence of poetic language on various kinds of linguistic repetition might also make words with very common syllables (or onsets, or codas, etc.) relatively more common in poetry than in other kinds of discourse.
So following on from yesterday’s experiment with whole words, today I’ve had a look at the incidence of poetic quotations within bigram, trigram, and 4-gram alphabetical ranges in OED2. Note that what I’m measuring here is not how often a range of words occurs in poetry, but how much a range of words is illustrated by poetic quotations in the OED.
[UPDATE 4.6.13 – I’ve since re-run this script with a newer version of the genre-tagged OED2. The numbers below are therefore out of date and should not be considered accurate, though in general the conclusions remain relevant]
First, here’s a chart of bigram ranges from ab to wr in the dictionary, with the total number of headwords in the range, and then the percentage of tagged quotations in that range that show up as either poetry or verse drama. I’ve filtered out ranges with fewer than 500 headwords in them. [click to enlarge]:
This appears to contradict my speculation that letter ranges with more words in them would be illustrated by relatively more poetic quotations. In fact the correlation coefficient between headword count and percentage poetic quotations in this dataset is an underwhelming r=0.19. I may still be right, or more right, in the restricted case of ‘single-quotation entries’, but I have yet to write a script to test this rigorously [UPDATE: Now I have. I wasn’t].
The graph does show quite a lot of variance based on range, which I think merits some investigation. So lets have a look at the top and bottom ends of the range:
I can think of at least one good reason why the bottom range looks the way it does: all of those ranges house large number of scientific and technical terms (poly-, sys-, epi-, phen-, etc.). The only puzzling member of that list is ‘ir’, which like un- is a productive negation prefix for words starting with ‘r’ (‘un’ scores an above average but not mind-blowing 18%).
The top of the list is interesting for a couple of reasons. ‘W’ is in each bigram, which may be a feature of the same effect that saw nine out of the top ten poetic words coming from Old English.
I was intrigued to see ‘wr’ appear so high in the list, having recently seen Marc Alexander’s poster presentation at DH2012: ‘Sound Symbolism in English: Evidence from the Historical Thesaurus‘, since it occurred to me then that words with stronger than average correspondence between sound and sense might occasion usage in poems that might get the attention of the lexicogapher. The other word-initial bigrams he looked at were ‘st’, ‘gr’, ‘fl’, and ‘sl’, which score above average but not phenomenally high on my list (27, 35, 52, 53; of 136). I’ll have to test out the basic idea on a poetry corpus. But I do wonder about some of my high scoring bigrams and whether and how they might score in terms of ‘sound symbolism’. In particular ‘sw’, ‘da’, ‘gl’, ‘br’, ‘bl’ and ‘sh’ seem to me possibly worth investigating.
Okay, on to trigrams and 4-grams. I’ll show the tops and bottoms of both lists. Some of this sheds extra light on the above discussion. Some may raise some extra questions.
Trigram ranges with >100 headwords:
And, finally, 4-gram ranges with >50 headwords in them: