Strike A Pose: Topic Modeling

For anyone who has taken an English, literature, or poetry course the concept of close reading is nothing new. I certainly recall a class discussion surrounding a passage from Shakespeare’s Othello, in which the professor pressed, “What do you see?”

Every student would look at the page and try to squeeze every last meaning from the words, the imagery, the sound. And still the professor would ask us again, “and what ELSE do you see?” Perspiration would form on our foreheads and make its way down to our aching, pressing eyes. When our voices rose from our parched throats, what came out was more of a question than a confident, affirmative answer.

“Is it about sex?”

othello

izquotes.com/quotes-pictures/quote-even-now-now-very-now-an-old-black-ram-is-tupping-your-white-ewe-william-shakespeare-310396.jpg

Always the go-to question when every other line of reasoning seems preposterous, at least we can count on one universal. But this familiar practice has had some scholars scratching their heads on whether or not this is the only type of reading that is productive for the study of literature. In fact, it has scholars such as Frank Moretti questioning whether this type of reading is practical or useful at all. Moretti advocates for what he calls distant reading: “understanding literature not by studying particular texts, but by aggregating and analyzing massive amounts of data” (Schultz). Moretti argues that close reading limits the scope of what can be learned because humans can only read so many books over a lifetime. That is then complicated by the fact that certain books are considered classics while others fall into the literary abyss. If I want to study British women novelists over a span of 200 years, I might read 4-5 books on the subject and come up with some conclusion based on that small sampling. Moretti would say that the “true scope and nature of literature” (Schultz) cannot be determined in this fashion. On the other hand, if I were to take the texts of EVERY British women’s novel and put them into an algorithm so that I start to see patterns emerge from the word sets that I deem important, then we might get a more accurate depiction.

p.gr-assets.com/540×540/fit/hostedimages/1380433369/872049.gif

This is the idea of topic modeling which utilizes distant reading platforms to come up with data sets that can then be analyzed to further literary study. Robert K. Nelson in his “Introduction to Mining the Dispatch” writes, “Topic modeling… allows us to step back from individual documents and look at larger patterns among all the documents, to practice not close but distant reading” (1). What patterns emerge when looking at large data sets? What can the digital age uncover that was before incomprehensible? Once we step back from the text and look not at one passage but one passage of time, what new ideas start to emerge?

Take a distant reading of five different texts on distant reading: Kathryn Schulz’s “What is Distant Reading”; Joshua Rothman’s “An Attempt to Discover the Laws of Literature”; Patricia Cohen’s “Analyzing Literature by Words and Numbers”; Ted Underwood’s “How Not to Do Things with Words”; and Mae Capozzi’s “Reading from a Distance.” By plugging the texts into distant reading and topic modeling tools, we can analyze the visual results to see what they tell us about these texts. Below is the visualization of these texts in Wordle:

wordle

The more often a word is used in Wordle, the bigger the word appears in the visualization. According to our data, the word reading is used most frequently, with Moretti coming in a close second. The other frequently used words are topic, books, literature, distant, literary, words, Humanities and India. It seems fitting that texts about distant reading should use the word reading quite frequently, so that doesn’t come as a surprise. The name Moretti appears multiple times, which tells us that he must be a prominent figure in the discourse around distant reading, and as we have discussed, he is more or less the grandfather of the term itself. If we were not familiar with the name Moretti, then this visualization would point us in a direction of study since the discussions surrounding this man was very significant in quantity. The importance of such quantitative analysis is important because the more discussions around a particular topic, the more that topic came up in whatever was being studied, which suggests that that topic held value for the writers. The translation ,therefore, is if the topic was important to the writer, then it was important to the people of that place and time.

Furthermore, it is no surprise that the terms, distant, literary, digital, humanities, words, texts, literature came up frequently either since they are all words used in the discussion of distant reading, a product of the digital humanities which is transforming the way we view words, texts and literature. The one word that might throw off the research is the word India. If we hadn’t taken such a small sample, the word might have thrown a curve ball at us where we might have deemed it important or significant based on our data results. The truth, however, is not very exciting. Mae Capozzi’s “Reading from a Distance” happens to contain an entire section on topic modeling using the British East India Company as the example. It does, however, shine the light on how bigger picture data still needs a close reading, to be certain that the information uncovered is valid.

Now let’s plug the same information into another distant reading and topic modeling tool called Tagxedo. Here is the visualization of these texts:

Tagxedo

As you can see, this program creates a word cloud as well, and just like in Wordle, the more frequently a word is used, the larger that word appears in the visual. As we can see, the most frequent word was Reading without a doubt, with Moretti once again bringing up the rear. This time, however, the word India is quite small. The other significantly used words are Topic, Literature, books, Humanities, Distant, and Digital. If both visualizations are word clouds which has used the same text to analyze its data, why are they so different? Why is Topic in this graph the same size as Moretti, or why is Literature and Humanities so large in comparison to India? One thing I’ve noticed is that Wordle counts words that start with a capital and lower-case letter as two separate words. It also counts topic and topics as two different words. Therefore, what the tool recognizes as words influences the results of that data. It is important then, when examining results of these sets of data to look for these details which might significantly skew our findings.

Another interesting tool to use is Google Ngrams, a distant modeling program which graphs words or phrases the researcher chooses. Here I chose the words, Literature, Distant, Reading, Moretti, Digital, Humanities, Topic and Literacy. The graph shows how frequently the words have been used in its own database of texts from 1800 to 2015. As you can see, reading and literature have been talked about in texts more and more frequently. Digital and humanities have only been on the rise since the 1960’s. Also, the words Moretti and distant seem to be on a similar curve, suggesting that they are part of the same discourse. The word clouds are great to discover which words are important; they then give you the important words to use in the other topic modeling tools which can lead to a broader and much richer understanding of whatever is being researched.

Google Ngram

Integrating these tools in a classroom can aid in learning about literature in a broader sense. While close reading looks at a text under a microscope, distant reading takes a step back to look at the larger picture. Students can learn a great deal doing projects that integrate both distant and close readings. It might even be interesting to make a claim based on one or two novels and then test that theory by using a distant reading tool. The word clouds such as Wordle are also great for editing papers. If a student’s paper indicates that like and really are the two most frequently used words, then some editing may be in order.

These tools tell us that distant reading is a type of analysis that is best done in conjunction with close reading. While it can uncover intriguing and thoughtful information for analysis, it is in no way a replacement. Both have their value in our continued research and understanding of literature, history and human existence. Where one is weak the other is strong and complement each other to achieve the same end: knowledge. If we see the digital humanities as adding to our capacity to understand instead of our humanistic decline, than maybe they aren’t so bad after all.

Work Cited:

Capozzi, Mae. “Reading from a Distance.” Blog. https://readingfromadistance.wordpress.com/.

Cohen, Patricia. “Analyzing Literature by Words and Numbers.”  The New York Times. 3 Dec. 2010. Web.

Nelson, Robert K. “Introduction to Mining the Dispatch.”

Rothman, Joshua. “An Attempt to Discover the Laws of Literature.” The New Yorker. 20 March 2014. Web.

Schulz, Kathryn. “What is Distant Reading?” The New York Times Sunday Book Review. 24 June 2011. Web.

Underwood, Ted. “How Not to Do Things with Words.” Blog. 25 Aug 2012. Web.

Advertisements

One thought on “Strike A Pose: Topic Modeling

  1. So what do you think is the ‘true scope and nature of literature’? How might distant reading reveal that in a way close reading does/can not? 🙂

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s