Plotting Twitter users’ locations in R (part 2: geotags vs. Web scraping vs. API)

Written by: Spencer Greenhalgh

Primary Source:  Spencer Greenhalgh

Yesterday, I mentioned discovering the French hashtag #educattentats that was created in the wake of the 13 November terrorist attacks. As far as I can tell, I discovered the hashtag shortly after it was created, so it’s been interesting to see how use of the hashtag has grown in the hours, days, and weeks since.

Inspired by a project in the class I’m taking on Internet research methods, I decided to see if I could plot locations for all of the Twitter users who have either included this hashtag in one of their own tweets or retweeted a tweet including the hashtag. My long-term goal with this would be to split the tweets by units of time to see how use of the hashtag spread (and eventually shrank?) geographically over time. I haven’t gotten that far yet, but I am happy with what I’ve done so far (you can find the code here. I’d like to highlight a couple of parts of this process, since they represent a couple of tricks that I plan to use in the future and that may be useful to others as well.

The easiest way to find someone’s location on Twitter is by identifying geotagged tweets, as my advisor, Dr. Matt Koehler, has done to great effect on his blog. However, of the 6000 tweets I’ve collected so far, there is just one that has a geotag (and it was collected after I started exploring the data), so geotags are of no use to me at all.

Fortunately, most Twitter users list a location in their profile. There are some obvious validity issues with this (tweeters can easily specify as their location a town they don’t live in or even say that they live on Mars), but for now, I’m choosing to assume that most Twitter users are honest and accurate as far as specifying a location. The twitteR package in R can easily retrieve a Twitter user’s location given a username, but it does so through the Twitter API, which means that you have to be careful not to send too many requests in too short of a time. I’m impatient, and this is also a pretty straightforward task, so I went another route.

Each Twitter user has a distinct URL associated with her profile page, so if you have a list of usernames (which is pretty easy to get from a TAGS archiver), you can easily access each of those profile pages. So, using the XML and rvest packages, I fed my code the URL for the profile page of each of the users involved with this hashtag. I used the read_html() function to get the HTML code for each of those pages, then used the html_nodes() function and XPath to find the parts of the page where the location of the user is stored. With a little bit of cleaning, I soon had myself a list of locations.

Obviously, that’s not the whole picture. All I had at this point was a list of character strings specifying some location, and that’s not enough to place them on a map unless I want to do it by hand. So, in tomorrow’s post, I’ll discuss converting those strings into mappable locations!

Share this:

The following two tabs change content below.
Hi there! My name is Spencer Greenhalgh, and I am a student in the Educational Psychology and Educational Technology doctoral program at Michigan State University. I came to Michigan State University with a strong belief in the importance of an education grounded in the humanities. As an undergraduate, I studied French and political science and worked as a teaching assistant in both fields. After graduation, I taught French, debate, and keyboarding in a Utah private school before coming to MSU, where I plan to study how technology can be used to help students connect the humanities with their lives. I have a particular interest in the use of games and simulations to promote ethical reasoning and explore moral dilemmas, but am eager to study any technology that can help students see the relevance of studying language, culture, history, and government.