As part of some computer housekeeping, I’ve made up a little inventory of the Python programs I’ve developed to have a peek inside the massive text file that is the Oxford English Dictionary, 2nd edition. Here’s a list:
1. Field Comparison Tool – Look for Intertextuality in poems
*I discussed some this program in the second half of my presentation at Digital Humanities 2012 in Hamburg this summer. Slides etc. are here.
With a poem (or corpus of poems) as an input file, the program looks up every word in the OED and compares neighboring words in the poem to one of the OED’s fields (etymology, definition, or quotations). For matching words, the program then produces a set of metrics (density, Z-score, MI-score and T-score) to determine the significance. The collocation scores are determined using a corpus of all of the OED’s quotations (17m words), but any comparative corpus (eg. COCA) can be used.
Sample output (using a Galway Kinnell poem):
Near= 10, Field = ET, Scope = A, MI option = Y, MI source = T
POS|WORD(HITS)[NEARNESS]|HITWORD|HTW Z COCA|PAIR MI|PAIR T
41|poems (1) [0.13] |poetic (-8.0)|0.03906111|4.623686|1.725025
45|publish (1) [0.14] |publication (7.0)|0.13329574|4.0547520|0.99398
47|publish (1) [0.2] |publication (5.0)|0.13329574|4.0547520|0.99398
52|publication (1) [0.14] |publish (-7.0)|0.03733089|4.05475|0.99398
53|auction (1) [0.13] |augere (8.0)|-0.0174118764557|-10.0|0
54|mind (1) [1.0] |man (1.0)|4.994058|1.552368|25.076352
55|man (1) [1.0] |mind (-1.0)|1.307456|1.552368|25.076352
57|professor (1) [0.2] |publication (-5.0)|0.1332957|2.96319|1.3960
59|auction (1) [0.5] |augere (2.0)|-0.01741187|-10.0|0
2. Mark Source for OED Quotes
For a given text, highlights all the passages in that text that are used as an illustrative quotation in OED2, and provides the headword(s) as mouse-over alt-text. So far has been run on Hamlet and Paradise Lost. Works with divergent editions.
HORATIO In what particular thought to work I know not; But in the gross and scope of my opinion, This bodes some strange eruption to our state.
[*Quote as it appears in OED2: “This boades some strange erruption to our State.”]
3. Etymology Word Count
*I wrote this to do a blog about Languagehat’s discussion of the new etymology of “To be”.
Returns a rudimentary word count (counts spaces) for the etymology section of every headword in OED2.
4. First quotation and Hapax Finder
For a given author (or text, or set of texts), searches OED2 for all the headwords that have that author or text listed as the first source in the quotation section, for all sense levels. Returns Headword, sense level and number, definition, author, text and date. Also flags first uses in the entire entry. Also flags out when the quotation is a hapax (i.e. OED records only one supporting quotation), either for a sense section or for the entire entry.
HEADWORD|SENSELEVEL|SENSE|FIRSTINENTRY|DEF|AUTHOR|WORK|DATE|QUOT|TOTAL QUOTS IN SECTION|HAPAX
abbreviate, v.2|6|3.d.|N|Of words spoken or written, or symbols of any kind: To contract, so that a part stands for the whole. The common mod. use.|Shaks.|L.L.L.|1588|He clepeth a Calf, Caufe: Halfe, Haufe, neighbour vocatur nebour; neigh abreuiated ne: this is abhominable. |3|N
5. rewrite the dictionary
*I discussed the process for identifying and marking up the OED2’s quotations based on their genre in my presentation in Hamburg this summer. Slides are here.
Writes a new OED2 file, incorporating a <G> genre field for every quotation, with one of the following: POETRY, VD, OTHER, UNKN. About 55% of the quotations have been identified as belonging to one of the three categories.
<Q><D>1697</D> <A>Dryden</A> <W>Virg. Georg.</W> <sc>iv. </sc>194 <G>POETRY</G><T>Pot-herbs..bruis’d with Vervain, were his frugal Fare. </T></Q><Q><D>1762</D> <A>Goldsm.</A> <W>Cit. W.</W> xlvi. (1837) 267 <G>OTHER</G><T>A frugal meal, which consisted of roots and tea. </T></Q><Q><D>1783</D> <A>Crabbe</A> <W>Village</W> <sc>i. </sc>324 <G>POETRY</G><T>The glad parish pays the frugal fee. </T></Q>
6. Date-Genre analysis
Using genre-marked OED2, counts number of quotations marked POETRY, VD, OTHER and UNKN for every year.
7. Rank “Poetic WOrds” According to OED quotes.
Using genre-marked OED2, program counts % of poetry and verse drama among the quotations for all entries and also returns a coca freq for the headword (highest COCA value – not lemmatized for POS).
8. Rank “POetic ranges” of the Alphabet
Using genre-marked OED2, program counts % of poetry and verse-drama quotations in alphabetical ranges, either bigram (e.g. all words starting ab, ac, ad, etc.) or trigram (abb, abd, abe, etc.).
RANGE|%POETRY+VD|NUM HW|NUM POETIC
9. Count up Genres of Nonce words
*this was partially an attempt to reproduce Giles Goodland’s work, which I discussed in my presentation in Hamburg.
Using genre-marked version of OED2, returns all words with only one supporting quotation, the date, author and title of the quotation, and the genre of the quotation if known.
huckery 1377 Langl. P. Pl. POETRY
huckler 1617 Assheton Jrnl. UNKN
hucksterage 1641 Milton Reform. OTHER
10. Count quotation field tag tokens
Actually a series of programs that look at the metadata about the quotations in OED2 and do things like count up tokens, etc. So for instance, one returns a count of all unique author names, one returns a count of all unique author names or work titles if no author is recorded (e.g. The Times), one returns a count of all author and title combinations, etc.
<A>L. V. W. Clark</A>|2
<A>L. V. de Sitter</A>|4
<W>Times Lit. Suppl.</W>|3532