Questions for statisticians

— can literary fiction (say, a Shakespeare play) be distinguished from commercial fiction (say, a John Grisham novel) on this basis of their word distributions (how many words are repeated in what ways how many times in these different genres)?

–a related question: will a work of literary fiction (a “classic”) have more “repetitions” than a work of commercial fiction?

— how are we to distinguish words that are repeated thematically (‘nothing’ in Shakespeare) from words that are repeated out of poor writing or another reason (is there a need to make such a distinction).

— given x number of words (the “author’s vocabulary” or “all the vocabulary the author is known to have used in print”) and y number of words (“the book”/ the number of words in his book) can we make an informed guess about how many repetitions that work might contain.

–Do people with larger vocabularies repeat words more or less often than people with smaller vocabularies, or about the same?

–Do early English literary writers (Shakespeare) repeat themselves more than late English literary writers (Joyce); how does it compare to the trend in, say, non-literary epistolary writing over the same period?

–How about across cultures as well as times? Does Virgil, Homer, or Shakespeare make more use of repetitions? How do the repetitions in literary work compare to those in a legal document, or to those in a collection of the letters of a college-aged student.

–How about with respect to speech? do we repeat ourselves more when we speak or when we write? Does Philip Roth repeat words more frequently when he speaks or when he writes?

–Suppose literary word repetitions (‘Nothing’ in King Lear) don’t indicate a ‘deeper meaning’ — what else might such repetitions indicate? Is repetition a rhetorical device, a natural consequence of writing with some purpose in mind, or something else? If I were to right down eighty words randomly would it contain more repetitions than a sonnet of Petrarch that had around the same word count?

