Kludging: Web to TXT

Written by: Thomas Padilla

Primary Source : Thomas Padilla, August 3, 2015

Text analysis projects share in common 3 challenges. First, data of interest must be found. Second, data must be gettable. Third, if it’s not already formed according to wildest dreams, ways must be known of getting data into a state that they are readily usable with desired methods and tools. While surmounting these challenges are typically not a favored part of the process, more experienced Digital Humanists often have programmatic means of getting data and transforming it in such a way that it suits their needs. These means are not inaccessible to beginners but the path from DH interest to DH exploration is sometimes better wended via a route that poses the least resistance. In what follows, I will describe a method that kludges together a couple of different easy to use tools to download web pages en masse, remove markup, and convert them to .txt.


What you need

Firefox
DownThemAll! Tools – Firefox extension for bulk downloading content from the web
Terminal – OS X command line interpreter
Textutil – command line text utility


Use Case – British Women Romantic Poets, 1789-1832

British Women Romantic Poets, 1789-1832 is a wonderful collection of 169 texts, digitized and encoded by UC Davis General Library. From a temporal and thematic standpoint this collection has a measure of cohesion that lends itself well to text analysis. Accordingly, you’d like to begin exploring the data but first you need to get the data and move it into a state where it is usable.

Challenge 1: Getting data

Examination of the access point indicates that item by item interaction is privileged with no clearly marked mechanism for downloading content.

Screen Shot 2015-08-03 at 7.59.10 AM

We could right click and save each HTML link but who has time for that?
We get around this through utilization of DownThemAll! Tools.

– Open Firefox
– Open DownThemAll! Tools by following this path in the Tools menu: Tools > DownThemAll! Tools > DownThemAll

Screen Shot 2015-08-03 at 8.08.43 AM

You will then be presented with a window that displays 346 available links on the British Women Romantic Poets page. Of those 346, 169 link to a .htm (precursor to .html) version of a text.

In the DownThemAll window designate a folder to save the 169 .htm files in. Using fast filtering, designate .htm as a filter for the links. You should now see .htm files highlighted in the download section. When all looks good, click ‘Start’.

Screen Shot 2015-08-03 at 8.09.19 AM

 

After downloading these files you’ll have a folder with 169 .htm files. Each file corresponds to a text in the collection.

Screen Shot 2015-08-03 at 8.09.32 AM

Challenge 2: Making the data usable (markup removal and format conversion)

Eureka! Not so fast . . . opening one of the .htm files in the browser is misleading. If we open the .htm file in a text editor for comparison we see that there is a whole lot of markup to remove.

Screen Shot 2015-08-03 at 9.35.01 AM

 

Most computers running OS X come with a small but powerful tool called Textutil. We’ll use this to iterate over every document in our collection, creating 169 .txt files stripped of markup.

– Open Terminal
– Navigate to the folder that contains the .htm files, e.g. cd/Desktop/bwrp

Convert each .htm file in the directory to a .txt file, remove markup

textutil -convert txt *.htm

This will take a bit to run, but relatively soon after we’ll have our data.

Screen Shot 2015-08-03 at 10.24.41 AM

Side by side comparison shows we have successfully converted our .htm files to .txt with all markup removed.

Screen Shot 2015-08-03 at 10.28.43 AM


And there you have it, an easy method to bulk download webpages, remove markup, and convert to .txt.

The following two tabs change content below.
Thomas Padilla
Thomas Padilla is Digital Humanities Librarian at Michigan State University Libraries. Prior to his move to Michigan he was at the University of Illinois at Urbana Champaign working at the Scholarly Commons and the Preservation Unit of the University Library. Prior to that he was at the Library of Congress doing digital preservation outreach and education. Thomas maintains diverse interests in digital humanities, digital preservation, data curation, archives, History, and interdisciplinarity. His work and projects often map to these areas of interest.