Youtube Data for Research

Written by: Thomas Padilla

Primary Source: Thomas Padilla

Sometimes I interact with folks interested in digital projects that entail some form of video analysis. These noble hypothetical folk, whether they know it or not, join a quest to augment Digital Humanities discourse with a format that doesn’t get enough attention. Brave souls. Just for starters, video data sources can be tough to gain access to. For projects of this type Internet Archive (IA) is a shining star amidst a sea of negligible possibilities. Numerous (>1.9 million videos) and diverse content combined with some excellent tutorials help you bulk download IA data relatively easily. A boon for video research. Up until a couple days ago I would not have thought to consider Youtube as a source worth anything but headaches, or perhaps a laugh at a fainting goat, but a neat command line tool called Youtube-DL definitely changed that.  For research purposes, if you wanted all video content produced by The White House, video content that matches the search query ‘Dragons’, a single video, or perhaps a custom playlist of videos and perhaps even associated text transcripts, then Youtube-DL is a game changer, knowledge of which may possibly make you start spinning in circles right at this very moment.

Enumerated game changing qualities:

  1. Low barrier use – no programming chops needed to get started
  2. Easy to scale – download one video, download all video from a playlist, download one/some/all video created by user (e.g. The White House), download video(s) that match a search query
  3. Manage data – impose file naming conventions on collected data derived from various components of the file (e.g. dateuploaded_user_title.mp4)
  4. Granular control – specify video format and quality, control dataset size (e.g. download up to 1 GB of data and stop)
  5. More than video – download one or all available text transcripts, extract audio from video

In what follows I’ll work through how to install Youtube-DL and implement some of the awesome discussed above.


What You Need

Brew – package management system, basically makes it easier for you to install software
FFmpeg – lets you manipulate multimedia content, basically the all the things of multimedia work
Youtube-dl – command line program for downloading Youtube content


Installing FFmpeg and Youtube-dl

– Open Terminal
– Enter the following commands

brew install ffmpeg
brew install youtube-dl

Use Case – Building a White House dataset 

You want to capture video related to the Obama Presidency. Starting with the Inaugural Address is probably as good a place as any. Maybe you want to study characteristics of video composition (video data), perform some audio analysis (audio data), and maybe even consider a text analysis of the inaugural speech (text data). Eventually you might even decide you want video produced by The White House between a certain period of time. Perhaps you might also want to build a playlist related to White House coverage of Ferguson and download that – videos, video descriptions, audio, and subtitle text data. What follows should give you what you need to approach all of the above.

– Create a folder to contain files you capture
– Open Terminal
– In Terminal navigate to the folder you created, e.g. cd/Desktop/youtubedl/whitehouse

After making your way to the folder, you have a number of different ways to use Youtube-dl:

Single item, default to highest quality video

youtube-dl https://www.youtube.com/watch?v=3PuHGKnboNY

Single item, with file naming conventions imposed

youtube-dl --restrict-filenames -o "%(upload_date)s.%(uploader)s.%(title)s.%(ext)s" https://www.youtube.com/watch?v=3PuHGKnboNY

Single item, extract audio 

youtube-dl --restrict-filenames --extract-audio --audio-format "mp3" -o "%(upload_date)s.%(uploader)s.%(playlist)s.%(title)s.%(ext)s" https://www.youtube.com/watch?v=3PuHGKnboNY

Single item, extract subtitles

youtube-dl --restrict-filenames --all-subs -o "%(upload_date)s.%(uploader)s.%(playlist)s.%(title)s.%(ext)s" https://www.youtube.com/watch?v=3PuHGKnboNY

Multiple items, download content from user between dates 

youtube-dl --dateafter 20150101 --datebefore 20150107 --restrict-filenames -o "%(upload_date)s.%(uploader)s.%(playlist)s.%(title)s.%(ext)s" https://www.youtube.com/user/whitehouse

Multiple items, search query ‘obama and ferguson’ – five videos

youtube-dl -t ytsearch5:"obama and ferguson" --restrict-filenames

Multiple items, build a playlist – download video, video descriptions, audio, and subtitles

youtube-dl --restrict-filenames --write-description --extract-audio --audio-format "mp3" -k --all-subs -o "%(upload_date)s.%(uploader)s.%(playlist)s.%(title)s.%(ext)s" https://www.youtube.com/playlist?list=PLf7yYLO8w1_lSVBqeZmp17dy7kvql6qTP


And there you have it. Youtube data for research. After working through the above consider some of Youtube-dl’s more advanced features.

The following two tabs change content below.
Thomas Padilla
Thomas Padilla is Digital Humanities Librarian at Michigan State University Libraries. Prior to his move to Michigan he was at the University of Illinois at Urbana Champaign working at the Scholarly Commons and the Preservation Unit of the University Library. Prior to that he was at the Library of Congress doing digital preservation outreach and education. Thomas maintains diverse interests in digital humanities, digital preservation, data curation, archives, History, and interdisciplinarity. His work and projects often map to these areas of interest.