Written by: Spencer Greenhalgh
Primary Source: Spencer Greenhalgh
I’ve written previously about my exploration of a dataset of tweets I collected during the April 2016 General Conference of the LDS (Mormon) Church. Over the past couple of months, I’ve been trying to see whether topic modeling—a form of automated text analysis that identifies topics (or themes) in a corpus of texts—could be a useful technique for discovering themes present in these tweets.
I’m tempted to compare my personal debate between automated analysis and hand coding to the digital humanities distinction between close reading and distant reading (see, for example, Ted Underwood’s comments here), but the more I think about that comparison, the less it holds up. Underwood (in the interview I just linked to) describes distant reading as “the new perspective literary historians get by considering thousands of volumes at a time,” and that’s arguably the perspective that researchers in my field gain through both hand coding and automated analysis. That is, the research I do—whether I use automated methods or not—is almost always dedicated to finding general trends across large numbers of data points rather than the more traditionally humanist tradition (if I understand it correctly) of deep dives into a single work/text/etc.
That said, there is another of Underwood’s comments that did resonate with my own consideration of which methods to use to examine tweets. When asked to comment on the relationship between close and distant reading, Underwood explains that they supplement each other nicely—that close readings “help readers understand a long trend.” This seemed to echo my own experience with developing a topic model for #ldsconf tweets: Interpreting individual topics nearly always required that I take a look at individual tweets and often also required that I read over some of the talks being referenced in these tweets. It’s not a perfect comparison, but it’s an important reminder that human understanding is an important guide for computational analysis of data.
So, before getting to the fun stuff, it’s important that I acknowledge how I did things. I based most of my code on Jockers’s Text Analysis with R for Students of Literature, including using the mallet package for topic modeling. I used all of the #ldsconf tweets that I collected last time, though after eliminating retweets, I seem to have gotten 24,329 original tweets (as compared to the 24,958 I reported last time). That’s something I’ll have to figure out before going further with this. I also haven’t taken the time to figure out what to do about another problem I mentioned in my last post—in short, my Twitter tracker broke down during the General Women’s Session of the Conference, so that session is underrepresented in the current analysis (as are, consequently, female speakers and participants).
I’ve also done some basic data cleaning, removing all instances of #ldsconf, all links, and all punctuation other than apostrophes, hash signs, and at-signs from the tweets. This is another area where I need to up my game—for example, despite a couple of different tries, I haven’t successfully managed to remove #ldsconf from a tweet without keeping #ldsconference in its entirety. As a result, there are a few orphaned erence‘s floating around in the text, and that’s a pain.
So, in short, I’ve gotten it working well enough to try out, but there’s a lot of work to do before this is really respectable. (I’m also leaving out a number of methodological considerations for brevity’s sake…)
For the time being, I’ve asked the code to come up with 50 topics. That might be overdoing it, but previous fiddling has shown that 10-15 (the original range I was anticipating) was definitely not enough. I’d like to highlight about 5 of these topics (again, limiting myself for brevity’s sake), suggest some interpretation, and comment on implications for this as a research method for my future Twitter work. I’ll introduce each topic with a wordcloud that I used to help with interpretation (Jockers’s idea, not mine).
Topic 1: Spanish
This topic largely consists of tweets in Spanish, and my knowledge of Spanish isn’t enough to know whether this could be several Spanish-language topics lumped together (i.e., that the code simply dumps all Spanish tweets in the same bucket rather than make distinctions between them).
Topic 2: ???
This is actually a tougher one to interpret. I’m pretty sure that “amp” refers to ampersands that are being represented as HTML code and are thus not being properly removed. This is another thing that I’ve tried—but so far failed—to fix. The rest of the key words seem vaguely similar, but if I look at specific tweets matching this topic, it’s a little hard to fit them together. I’ll pass on this one for now.
Topic 3: Womanhood
This topic appears to focus primarily on a talk on womanhood by Neill F. Marriott, with other parts of the General Women’s Session (where Marriott spoke) bleeding in as well.
Topic 4: Music
This topic is clearly related to music, including mentions of the Mormon Tabernacle Choir, other choirs that sang during the conference, and even specific songs.
Topic 5: Hashtag Soup
This is a really interesting topic in that it doesn’t pick up on a common subject so much as a common practice. I call this hashtag soup—the practice of appending as many hashtags as possible to a tweet in the hopes of expanding one’s audience. I’m guessing that a lot of these tweets broadcast their message via an attached picture and use the text space for the “soup.”
These five topics represent pretty well how I feel at this point about topic modeling for tweets. On one hand, it’s identifying some really interesting stuff, including topics based on single talks (i.e., Marriott’s on womanhood), re-occurring themes (i.e., comments on the choir throughout the conference), and even distinct practices (i.e., using “hashtag soup” to amplify a message). These are compelling enough to me that I’m optimistic about the possibility of regularly using these methods for Twitter research.
That said, there is clearly a lot of work to be done. The second topic seems to be suffering from a lack of proper data cleaning (oops), and I’m sure there are other topics that won’t have a clear interpretation. So, I’m optimistic, but I also know there are questions to answer and problems to tackle before the real success stories come out.
Latest posts by Spencer Greenhalgh (see all)
- Public data and digital research ethics - September 11, 2017
- Using notebooks for beginning-of-semester planning - September 5, 2017
- A couple of podcasts on screencasting - August 23, 2017