Written by: Josh Rosenberg
Primary Source: Joshua M. Rosenberg
Comparing discussions on Twitter from two science education conferences
My friend recommended an article by Sherin (2013), which got me interested in a simple natural language processing (NLP) technique. I had used TAGs to archive Tweets from the NARST conference and became interested in comparing the tweets from that conference, which consists primarily of presentations from researchers, to the NSTA conference, which consists primarily of presentations from teachers. I used Tweet Archivist to access tweets from the NSTA conference.
I wondered whether I could compare the two to understand what teachers and researchers discuss on Twitter at science education conferences and whether inferences about what is important to each can be drawn from them. First, I calculated some frequencies: It looks like #nsta15 was used 14,188 times and #narst15 was used 627 times. Since there were so many tweets for #nsta15, I randomly sampled it for exactly the same number of tweets collected for #narst15 (627), making the analysis a lot less demanding on my computer.
The statistical NLP technique
Using the text mining, or tm, package in R, I processed the tweets by removing common words (called “stopwords” in the tm package), punctuation, numbers, URLs, and usernames, and by “stemming”, or reducing, to one word those that have the same stem, such as “technology”, “technologies”, and “technological”. For each hashtag, I then created a term document matrix, which has rows representing every word (or “term”) included in any tweet and columns representing every tweet (or “document”). Their intersection represents and the number of times individual words were included in a tweet. For example, a tweet that included the words “could”, “please”, “share”, “video”, “presentation”, “moderated”, and “argumentation” would occupy one column that contains ones for the rows for each of those words and zeroes for all the other rows.
Next, I combined the two term document matrices to create one matrix for all of the content from both hashtags. I removed “sparse” words, or words that occurred very infrequently, to create a term document matrix with 634 words and 1,254 tweets. I then used this matrix representing the content of all of the tweets from both hashtags to group together similar tweets using the k-means algorithm, which minimizes the differences among tweets in the same group and maximizes the differences between tweets in different groups. However, the k-means algorithm does not determine how many clusters are in the data. To determine the number of clusters, I compared the within-cluster sum of squares and found that it decreased with additional clusters until around the seventh. After this, adding additional clusters did not contribute to less within-cluster sum of squares, so I chose seven. I gave them names based on their 10 most frequently included words. Cluster one had 18 tweets, two 71, three 74, four 332, five 222, six 168, and seven 369. Here they are as wordclouds:
Each cluster, then, represents a group of similar tweets. I calculated the average number of times all of the words were included, so that each was represented by a vector with 634 values that represent the average number of times each word was included in a tweet. I also calculated the average number of times all of the words were included for each hashtag, so like the clusters, each hashtag was represented as a vector with 634 values that represent the average number of times each word was included in a tweet, also as wordclouds:
Next, I compared each cluster of tweets to the tweets from each hashtag using cosine similarity, which seems similar in practice to correlation. Sherin described cosine similarity as a process of comparing two vectors that similar to calculating the cosine of the adjacent side and hypotenuse of a right triangle (I had to lookup SOH-CAH-TOA). If the adjacent side and hypotenuse point in exactly the same direction, the value of their cosine is one. If for instance, the first and second values of two vectors are the same, then they will point in the same direction when plotted and the value of their cosine is also one.
Comparison of each cluster of tweets to tweets from each hashtag
If the values compared between two vectors are highly similar, then the value of their cosine will get closer and closer to one, and if the values are not similar at all, then the value of their cosine will get closer and closer to zero. This comparison is extended to the number of dimensions of the vector, which in this case is 634, to compare each cluster to each hashtag:
As an important caveat, the names I gave the clusters are arbitrary and somewhat generous in interpretation; I called the sixth cluster “Enhancing interest” because of the words “fun”, “game”, and “excited”, but the word “fun” also occurs in “Research”, for example. Also, some of the stemmed words are unclear, and the differences in the cosines may or may not be statistically significant. I would like to look deeper, using this as a first step, into whether, for example, NGSS-related topics are more commonly discussed by researchers than teachers and whether NSTA presentations are more frequently about enhancing interest.