Lessons learned when Web scraping #GorafiESR tweets

Written by: Spencer Greenhalgh

Primary Source:  Spencer Greenhalgh

I’ve posted in the past about Web scraping Twitter user profiles, but I took some time last week to tackle something else that I’ve been thinking about: Scraping the tweets themselves. Web scraping tweets is a nifty trick, but it doesn’t necessarily have an obvious application right off the bat.

I wound up doing it because another French hashtag caught my eye: #GorafiESR. Le Gorafi (which I’ve posted about before) is a French satirical news source equivalent to the Onion. Recently, they launched the Madame Gorafi spinoff, and that inspired one French academic to propose instead a Higher Ed and Research (Enseignement Supérieur et Recherche, or ESR) edition of the periodical. The hashtag went viral, with French academics weighing in to suggest increasingly preposterous headlines.

This was too good to pass up, so I’ve started working with my colleague Sarah Gretter to dive into these tweets and find out what people are saying. We have a Twitter Archiving Google Sheet set up to collect tweets, which is pretty handy for easily collecting tweets and gathering some basic information on them.

However, there are a lot of tweets here! Even after limiting our collection to the first 24 hours of the hashtag and filtering out retweets, we still had over 2,800 different tweets to look at. It’s not uncommon to work with larger collections of tweets, but that usually relies more on automated methods; we’ll be reading over the tweets, so we agreed that it would be nice to have an automated way of figuring out which tweets are most important so that we could check them out before the others.

One way to do that would be to judge how many likes and retweets each of these original posts got. Presumably, the tweets that got the most attention would be the ones worth looking at first; then, we could take a look at the rest. Our TAGS collector doesn’t track likes and retweets; Agarwal’s Twitter Archiver does, but only (as far as I know) when the tweet is logged, which is a kind of awkward timing that risks missing out on some information. I have no doubts that you can do this through the Twitter API, but the Twitter API is going to limit the number of requests you can make per chunk of time.

So, what do we do? Web scrape.

Here’s a link to the code that I used to Web scrape our tweets. The code skips any inaccessible tweets (which, I’ve learned, can happen in one of at least two ways: suspended accounts and deleted tweets), so there shouldn’t be any problems with that. Plus, in addition to counting likes and retweets, I played with the code so that it would also grab:

  • the text of the tweet,
  • the Twitter handle of the user who sent the tweet, and
  • the UNIX timestamp for the tweet.

Not all of this will be helpful in all situations, but I tried to grab most of the “low-hanging metadata.” I think the UNIX timestamp will be particularly helpful: The date and time that Twitter displays for a tweet varies depending on a few things, including the timezone you’re in, but grabbing UNIX time—which is set to GMT and incorporates both date and time—might help keep numbers straight if you’re collecting across time zones. It may be possible to grab some more advanced stuff, like replies, but that’s for another day.

Like I said in the beginning, applications of tweet-scraping aren’t as obvious as applications of profile-scraping. However, I think there are some potential uses for this sort of thing. For example, why not download someone’s Twitter archive and find out which of their tweets have been the most popular? I’m working to figure out some other possible applications for this sort of thing and would love to hear any other ideas for how to put this to work!

The following two tabs change content below.
Hi there! My name is Spencer Greenhalgh, and I am a student in the Educational Psychology and Educational Technology doctoral program at Michigan State University. I came to Michigan State University with a strong belief in the importance of an education grounded in the humanities. As an undergraduate, I studied French and political science and worked as a teaching assistant in both fields. After graduation, I taught French, debate, and keyboarding in a Utah private school before coming to MSU, where I plan to study how technology can be used to help students connect the humanities with their lives. I have a particular interest in the use of games and simulations to promote ethical reasoning and explore moral dilemmas, but am eager to study any technology that can help students see the relevance of studying language, culture, history, and government.