A Clunky Data Collection Process and Viz Update on a Recreational Project

After seeing Randy Rainbow’s live comedy show, I was curious about how he got started. Exploring his videos on youtube inspired me to create a data visualization of his work’s evolution. It’s a work-in-progress.

This article covers some of my current data collection processes, for this project. It includes both, computational and ‘by-hand’ steps.

Note: I’m writing this for my own reference, and sharing it, in case it might be useful or entertaining to anyone else. The process is a bit clunky, but this is a recreational project.

Image for post
Image for post
Whimsical Data Visualization of Randy Rainbow’s Rising Popularity

My current data set update process:

1. Randy Rainbow Releases a new video.

→ Find the new video on youtube.

Note: Recently, he’s been linking to his videos on Facebook, but because my initial data collection started with youtube as the venue, I’m using the youtube link in my data set.

2. Add new video to data set

→ In spreadsheet: add a new row for the video, and copy video title and link into it.

The data set isn’t very big, so I just keep it in a spreadsheet.

Image for post
Image for post
Keeping track of videos in a spreadsheet.

3. Collect Updated Views Data

→ Add a new column in the spreadsheet for views as of today’s date

→ Record current view count, in this column, for all the videos

→ Download updated spreadsheet as a csv file, and save in same folder as javascript for viz.

I check the ‘views’ of the video on youtube, and save that to the spreadsheet as ‘views_as_of_’ the current date.

I get the current view-counts for all the videos I’ve been tracking. While I only use the most recent count, in the visualization, I do keep a record of what the values were in past updates to the data set.

This is a clunky-by-hand process: visiting each video’s page and copying the “views” to the spreadsheet. I’ve recently discovered a great command-line tool (youtube-dl), and developed some python code that uses it to gather this data for me automatically. I’m currently working on integrating this into my data collection workflow on this project.

→ In viz javascript: update value of ‘views_date’ with this new column heading.

Adding the new view count date requires that I update the javascript to use this heading from the csv. Eventually, I’m thinking of showing the view counts over time, but for now it’s just in the data and I only show the most recent video view counts in the viz.

4. Video Transcripts

Many of the videos have transcripts, and I’ve incorporated this data into the viz as little word-bursts when the user mouses over the bubble for a video.

If there’s a transcript of the video, I save that as a text file and process it with an NLP python script that I run from the command line. I’m planning to write more about this another time.

I keep all of the transcript files in a single folder, and the script loops over them and generates a json file with the resulting analysis: the most common words in each transcript. These are used in the visualization’s mouse-triggered word-bursts.

If there isn’t a transcript of the new video, I enter “none” as the “transcript” for that video in the spreadsheet. In the javascript code for the visualization, I check if there is a transcript, and if there isn’t one, I use “Oops! No transcript yet” in the word-burst.

A video that doesn’t have a transcript, when I add it to my data set, might have one added later. So when a new video is released, I go back and check to see if any transcripts have been added since I last checked.

5. Default annotation

In the viz, I have a little annotation that displays the title and some info on the video, and a line connects to the bubble for that video. By default, I set the outer left bubble as the one that should be pointed to.…because I think it looks better than pointing to a random bubble. Yes, I could write code to select the farthest left node…someday. If I get stuck on every little detail, on a playful project, then I’ll never get to the fun parts.

Every time I add a new video to the data set, the arrangement of bubbles in the chart changes a bit. A bubble that was on the edge of the cluster may end up in the middle. So I need to update which bubble the viz automatically shows as the annotated bubble. In addition to being positioned on the outer-left of the cluster, I also want the default bubble to have some extra data (a transcript)

So, when I’ve updated all the data, I test the viz in the browser and a local server, and look for which bubble is best positioned and also has a transcript. This is important, because I use the transcript name as an id for the bubble in my code.

→ In viz javascript: update “first_transcript” variable to be the transcript text file for selected video.

6. What’s next?

…Coming soon.

Written by

Data Visualization Consultant. Generative and Data Artist. Creative Coder. Founder of GalaxyGoo. http://kristinhenry.github.io/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store