Written by: Spencer Greenhalgh
Primary Source: Spencer Greenhalgh
[This post originally appeared on the MSU Digital Humanities blog]
As anticipated on this very blog, I recently spent a week in Indianapolis attending a workshop on computational text analysis at HILT 2016. We spent our time surveying a number of different tools, techniques, and concepts related to text analysis, so I walked away with a greater appreciation for data cleaning, Weka, HathiTrust, metadata, Python, and much more. The most frustrating part of the workshop was that we visited each topic so briefly and that we had so few opportunities to apply these techniques to our own work. I can’t fault the workshop organizers for these decisions—helping participants take a dozen wildly different datasets through deep dives into a particular technique would have been difficult—but I was excited enough by a lot of the concepts we covered that I was itching to try them out myself.
This was the most true of topic modeling, a technique for identifying different “topics” (or themes, or discourses, or…) in the documents of a particular corpus. As we tried out this technique on a corpus of slave narratives, I was amazed at how an algorithm was able to tease out what seemed to be clearly distinct themes within and across these narratives. One of our instructors warned us against being too impressed, explaining that the underlying math was actually really simple. He certainly had a point, and I know the importance of not being blindly wowed by what an algorithm seems to do, but to not think of topic modeling as amazing because it really comes down to conditional probabilities seemed to me akin to choosing to not recognize the wonder of the French language because at its roots, it’s an arbitrary collection of mouth sounds.
That said, neither French nor topic modeling can be really useful or truly amazing for me unless I spend some time figuring out how it works. I went to HILT hoping to learn a couple of neat tricks, but I came away convinced that topic modeling could have some real value for me. Over the past few weeks, I’ve added to my notebook full of dissertation brainstorming scribbles a number of references to topic modeling, and over the next few months, I hope to learn more about the process, dive more into the details, and make this a part of the work that I do.
Latest posts by Spencer Greenhalgh (see all)
- Public data and digital research ethics - September 11, 2017
- Using notebooks for beginning-of-semester planning - September 5, 2017
- A couple of podcasts on screencasting - August 23, 2017