A Clunky Data Collection Process and Viz Update on a Recreational Project
After seeing Randy Rainbow’s live comedy show, I was curious about how he got started. Exploring his videos on youtube inspired me to create a data visualization of his work’s evolution. It’s a work-in-progress.
This article covers some of my current data collection processes, for this project. It includes both, computational and ‘by-hand’ steps.
Note: I’m writing this for my own reference, and sharing it, in case it might be useful or entertaining to anyone else. The process is a bit clunky, but this is a recreational project.
My current data set update process:
1. Randy Rainbow Releases a new video.
→ Find the new video on youtube.
Note: Recently, he’s been linking to his videos on Facebook, but because my initial data collection started with youtube as the venue, I’m using the youtube link in my data set.
2. Add new video to data set
→ In spreadsheet: add a new row for the video, and copy video title and link into it.
The data set isn’t very big, so I just keep it in a spreadsheet.
3. Collect Updated Views Data
→ Add a new column in the spreadsheet for views as of today’s date
→ Record current view count, in this column, for all the videos
I check the ‘views’ of the video on youtube, and save that to the spreadsheet as ‘views_as_of_’ the current date.
I get the current view-counts for all the videos I’ve been tracking. While I only use the most recent count, in the visualization, I do keep a record of what the values were in past updates to the data set.
This is a clunky-by-hand process: visiting each video’s page and copying the “views” to the spreadsheet. I’ve recently discovered a great command-line tool (youtube-dl), and developed some python code that uses it to gather this data for me automatically. I’m currently working on integrating this into my data collection workflow on this project.
4. Video Transcripts
Many of the videos have transcripts, and I’ve incorporated this data into the viz as little word-bursts when the user mouses over the bubble for a video.
If there’s a transcript of the video, I save that as a text file and process it with an NLP python script that I run from the command line. I’m planning to write more about this another time.
I keep all of the transcript files in a single folder, and the script loops over them and generates a json file with the resulting analysis: the most common words in each transcript. These are used in the visualization’s mouse-triggered word-bursts.
A video that doesn’t have a transcript, when I add it to my data set, might have one added later. So when a new video is released, I go back and check to see if any transcripts have been added since I last checked.
5. Default annotation
In the viz, I have a little annotation that displays the title and some info on the video, and a line connects to the bubble for that video. By default, I set the outer left bubble as the one that should be pointed to.…because I think it looks better than pointing to a random bubble. Yes, I could write code to select the farthest left node…someday. If I get stuck on every little detail, on a playful project, then I’ll never get to the fun parts.
Every time I add a new video to the data set, the arrangement of bubbles in the chart changes a bit. A bubble that was on the edge of the cluster may end up in the middle. So I need to update which bubble the viz automatically shows as the annotated bubble. In addition to being positioned on the outer-left of the cluster, I also want the default bubble to have some extra data (a transcript)
So, when I’ve updated all the data, I test the viz in the browser and a local server, and look for which bubble is best positioned and also has a transcript. This is important, because I use the transcript name as an id for the bubble in my code.
6. What’s next?