I've been playing around with Greek text this weekend. This is going to seem a bit geeky (a bit?) but I need to give some background.
I was talking with Bob Pritchett a few weeks back. For some reason, the subject of automatic language recognition came up. Apparently one of the methods used involves compiling all consecutive three-character combinations as they appear in a given text (so, “I drove.” would have the strings 'I d', ' dr', 'dro', 'rov', 'ove', 've.') and then examining the occurrences to known frequencies of three-letter combinations in known texts in the language in question. Apparently the success rate is fairly high for an automated procedure.
After thinking about it for awhile, I became curious about combinations of words and authorship or author style. For the Pastoral Epistles, many studies have been done examining word frequencies of the Pastoral Epistles and comparing them to so-called “genuine” Paulines, the Apostolic Fathers, and other things. P.N. Harrison did the definitive work in this area analyzing the Pastoral Epistles in 1922 or so (The Problem of the Pastoral Epistles, see my Bibliography). Donald Guthrie responded to Harrison's work in a monograph published in 1956. But this all involved word frequencies. To my knowledge, nobody has really thought about phrase frequencies (NOTE: see Update III below). It would've been tough to do in the past, but with available electronic texts (see both James Tauber's site and Dr. Maurice Robinson's ByzTxt.com site (nb: byztxt.com no longer exists and now links to indecent and rude material)— I prefer Tauber's data as it has casing, breathing marks, accents and lexemes), high-power processors and some programming skill it seems like these sorts of things are coming into the realm of possibility.
So, I spent today writing some javascript (run via the Windows scripting host) to process James Tauber's MorphGNT data. Keeping track of all the possible combos takes a lot of memory and processing power, so for now I'm limiting myself to the Pastoral Epistles. I compared the three-word combinations on the basis of the lexeme (or “dictionary” form) not on the inflections. Each individual listing does have the actual inflected phrase provided seperately so that one can see exactly what the match is.
The outcome? Of 3269 possible three-word combinations of adjacent words in the Pastoral Epistles, there are 55 that occur more than once. Some of them are meaningful (e.g. πιστὸς ὁ λόγος, “Faithful is the word”), others aren't. Who knows if this is significant; I'll need to get data from other books and devise a methodology to compare before I'm able to even think about conclusions.
After generating the data (I munged it into XML, of course) I whipped out a quick stylesheet to render the concordance as HTML so I could post it as it seems like the sort of thing that might be handy for some folks. So, without further adeiu:
A Concordance of Three-Word Phrases in the Pastoral Epistles
There are some problems/caveats mentioned in introductory note; please read it over. Also, for some reason I've not yet figured out, Firefox doesn't like my CSS stylesheet but IE does. So it'll look better in IE, at least for now.
If you have any ideas or feedback on the data, on the idea of examining phrase frequencies, suggestions for methodology once the data is compiled, or anything else to do with this I'm very interested to hear from you. Please feel free to drop a comment, post about it in your blog & trackback here, or just drop me an email.
Update: I noticed another small bug; it seems I didn't clear my phrase cache at the end of each book. So the phrase μεθ' ὑμῶν Παῦλος really doesn't occur; μεθ' ὑμῶν is at the end of one book, Παῦλος is at the start of another. Whoops.
Update II: Thanks for the clarification on the trigram stuff, Bob. (Now corrected above.) I remember that now that you say it. I think it's obvious that I was thinking about adjacent words since about the time you told me about the concept.
Update III: Stephen Carlson of Hypotyposeis fame links to me with a recent blog post. Apparently he did some similar work 9-10 years ago, and has had his results posted for awhile in the form of a short article (complete with ASCII art!): Authorial Style in the New Testament. I'll have to go over his stuff and see if I can grok it, but I greatly appreciate the pointer — thanks!
Update IV: More background on previous phrase studies. I checked my copy of Harrison's book (it's been awhile since I read it) and note his appendices from pp. 166-178 list phrases held in common between the Pastorals and other groups of books ('genuine' Paulines, Petrines, 1 Clement). He discusses them from pp. 87-93, though it is in his typical dismissive style. And the method isn't nearly as systematic as his examination of words.