Written by: Spencer Greenhalgh
Primary Source: Spencer Greenhalgh
This morning, I learned about the DataBasic suite of easy-to-use data tools from an article on the FlowingData blog. The suite offers three basic tools that let beginners cut their teeth on data science by doing some initial exploration of a dataset (either one you provide or a default dataset on the website):
- WordCounter lets you generate a very basic word cloud as well as identify commonly used words, bigrams and trigrams.
- WTFcsv will take a CSV and break down the kinds of data that are contained in its columns, just to give you a basic idea of what it looks like (if, of course, you don’t already know)
- SameDiff will compare two text-based datasets and let you know how similar or different they are
I decided to try this out by giving the WordCounter tool a whirl. I’ve been sitting on a collection of Amazon.com videogame reviews for a while and decided to feed those in to see what words, bigrams, and trigrams were frequently repeated. My hope was that I could figure out what reviewers are taking into consideration when they review games; if they mention story more often than graphics, for example, educational game designers might be more interested in focusing on the former than on the latter. This didn’t entirely work out the way I expected, though.
The first hiccup I ran into was testing the limits of the “basic” in DataBasic. The dataset I was working with had over a million reviews in it, and I had enough trouble copying the review text from a JSON file to plain text, the only file type that the WordCounter takes. So, I decided to work with a smaller dataset. 22,000, 10,000, and 5,000 tools all broke the tool, but it seemed to handle 2,000 reviews okay. This isn’t a criticism of the tool, of course. I’d been hoping that it would be powerful enough for me to do some initial exploratory work without needing to write an R script, but it’s meant to be a basic introduction to data science, so I can’t blame it for not handling a large dataset.
The second hiccup has less to do with the tool itself as with the words, bigrams, and trigrams that emerged from it. You can see the results of my exploration here (kudos to DataBasic for making a share-able results page part of the tool!). There wasn’t as much text on specific game features as I was expecting (though “graphics” appeared 407 times and the specific trigram “the graphics are” appeared 64 times). Instead, I got a lot of “stock phrases” for game reviews. Common patterns included words like “game” and “it’s,” bigrams like “this game” and “the game,” and trigrams like “this game is” or “you have to.” It’s actually pretty interesting to see how common some of these stock phrases are… it just doesn’t help with what I would want to learn from the text.
Again, though, just because DataBasics doesn’t help me with what I want to do doesn’t mean that it isn’t worth exploring. The website makes it sound like they have educational materials designed for everyone from middle schoolers to adults, so this could be a great way to introduce some key data concepts to beginners.