Written by: Josh Rosenberg
Primary Source: Joshua M. Rosenberg, November 23, 2015
Over the past week, my Facebook feed has been filled with posts of word clouds which represent all the terms used in Facebook status updates.
And there’s a problem with them.
It’s not that they’re shaped sort of funny but what they don’t do. This article says it best, although I’ll say a bit more, too:
At The New York Times, we strongly believe that visualization is reporting, with many of the same elements that would make a traditional story effective: a narrative that pares away extraneous information to find a story in the data; context to help the reader understand the basics of the subject; interviewing the data to find its flaws and be sure of our conclusions. Prettiness is a bonus; if it obliterates the ability to read the story of the visualization, it’s not worth adding some wild new visualization style or strange interface. Of course, word clouds throw all these principles out the window. (Harris, Nieman Lab)
In addition to what Harris wrote about how word clouds can ignore principles of data reporting, what’s wrong is that they obscure the data’s underlying structure.
We are used to working with spreadsheets in which columns represent variables and rows represent different things we measure. We are less used to working with text data but text data commonly has a similar underlying structure, the document-term matrix. In this, rows represent different chunks of text, like document or Facebook status updates; columns represent the terms that occur across all documents; and cells represent how often terms occur in each document; it’s just like a spreadsheet with variables and things we measure but with a different function. Facebook word clouds group together all of the status updates, making it impossible to determine which terms occur with which and how terms may bunch together to form topics. So, they ignore this structure, from which you can make word clouds but also representations with more information, as in this analysis of the speech of political candidates. Being able to understand the underlying structure of text data makes it easy to do this. Word clouds are pretty but may not tell a story, and understanding text or other data means we have to know how the data structure can help to do that.