Using DfR and Python to Prep Topic Modeling Data

Written by: Bobby Smiley

Primary Source: Digital Scholarship Collaborative Sandbox

Probably one of the more au courant methods in digital humanities scholarship is topic modeling, explanations of which range from the math-y and abstruse to the pictures and simple language approach. Briefly, topic modeling is one way of algorithmically reading texts at scale—the bigger the number, the better. In my previous draft of this post, I labored over a tortured explanation of topic modeling, striving to be lucid, but failing mightily. Fortunately, an issue of the Journal of Digital Humanities dedicated to topic modeling makes this task much simpler. As one of the contributors, Megan Brett, punchy puts it:

Topic modeling is a form of text mining, a way of identifying patterns in a corpus. You take your corpus and run it through a tool which groups words across the corpus into “topics”

I had originally thought of a complete open-housing of the sausage making process, a sort of keen initiate’s explanation of topic modeling cum narration of my research using this method. But, becoming unwieldy and entirely tl;dr (even more so than what follows), I decided to truncate this post to focus on the most novel, but ostensibly least sexy aspect of my process: data preparation. As with so many things in life, Pareto principle applies here in force, with the burden of time at the outset being hoovered not by hermeneutic activity, but by data cleaning.

Outside library-landia, my research tracks changes in historiography in American religious/Church history (for me the solidus is telling) over the twentieth century. In the past, my approaches sourced material from the lecture notes of Sydney Ahlstrom, a Yale professor of Religious Studies, and arguably one of the most important figures in mid C20 American religious historiography (these notes were the basis of my M.A. thesis), or by nominating a clutch of articles by important authors from selected journals in the field. Both approaches illumined much, but this, at best, was anecdotal. And here’s where topic modeling becomes a helpful analytical alternative.

To start a topic modeling project, you being with a corpus of material, and, as indicated earlier, the larger, the better. For me, this meant turning to the leading journal in the field, Church History, to get a diachronic sense of how historiographical changes could be gauged, or differences in approach could be descried. But how to get the data? Certainly, downloading hundreds of articles would be impractical, not to mention annoying and probably rage inducing. But thanks to Andrew Goldstone, an English professor at Rugters who uses topic modeling in his research, I learned that the JSTOR service, DfR (Data for Research) would enable me to request specified articles from specified runs of specified journals. Indeed, all JSTOR’s principal metadata elements for material can be faceted out and used to construct elaborate searches. The image below is both easier than typing out those elements, and gives you sense of the interface.

When requests are submitted, what’s retrieved and compiled are a series of either XML or CSV files for each article stipulated in the request. (I selected CSV, which is better for topic modeling.) DfR provides bigrams, trigrams, quadgrams, word frequencies, as well as keywords (which, I learned from JSTOR, are generated from topic modeling each article). In my case, the results furnished me with just under 1000 articles (DfR caps unrandomized requests at a thousand). Fantastic, right? But what do to with various types of information derived from 900 plus articles?

Ignoring syntax altogether, topic modeling works by treating each document in a corpus (it is often said) as a “bag of words,” which are sorted into a pre-determined number of buckets (topics) picked by the researcher. But what DfR returned me wasn’t a bag of words, but instead various iterations of word/term frequencies. Potentially useful at first blush, but somewhat useless for this exercise. Or maybe not? Here another HT to Andrew Goldstone, who, explaining the work he and Ted Underwood did for their research, planted an idea I decided to run with: DfR might not give me the article as it appeared originally, but it does give the article in toto, albeit obliquely.

The file containing word frequencies is a list of all the words in the article, but in an entirely unusable arrangement:

The goal, then, is to transform those data into words that match those values (e.g., the, the, the x212). Because topic modeling (or more precisely the most popular algorithm used, LDA, or Latent Dirichlet Allocation) doesn’t need proper word order (that “bag of words”), it doesn’t matter that the transformed table would be basically unreadable, and resemble, charitably, aggressively boring found poetry; what matters here is all the number of words in that list for that article are present. But how to do this?

Here’s where Devin (who seriously deserves second author credit should I publish anything from this endeavor) was able show me how such a thing could be done using Python. The first step was to order data into a legible form, removing “WORDCOUNTS,WEIGHT,” and placing the words and values in format that could be more easily multiplied. The following bit of code accomplished that:

For someone relatively uninitiated this area, working with Devin was a gentle education in the art of assembling code—the precision and inexorable logic enjoin a different way of thinking, something incredibly helpful not only for an exercise like this, but for a humanities person to appreciate and possibly emulate where appropriate.

Once the script is run, crawling over 900 articles, the CSV files are cleaned and sorted in format that can be multiplied for topic modeling:

Using the following script, Devin showed me how the above could be transformed from its current configuration …

… into this

Success! Now, go topic model, Smiley.

The following two tabs change content below.
Bobby Smiley
Bobby Smiley is the Digital Scholarship and American History Librarian at the Michigan State University Libraries. He received his library science degree and a certificate in digital humanities from the Pratt Institute, an M.A. in Religion from Yale, and an undergraduate degree from the University of Wisconsin<Madison. Before joining the MSUL in November, 2013, Bobby was a reference and digital projects intern at the Columbia University Libraries, and had previously worked at the University of Chicago Libraries. At MSU, he collaborates on grant funded digital projects, works with campus partners to coordinate training and instruction for digital humanities workshops, helps build the library's collection of humanities data, and liaises with the History department. His research interests include tracking historiographical trends in American religious history, methods in digital humanities, and exploring how digital humanities and academic librarianship can be usefully conjoined.
Bobby Smiley

Latest posts by Bobby Smiley (see all)