Tracking the Advertisers: ‘ScreenDatify’ Data Collection and Project Evolution
Ads are constantly tracking us and advertisers are learning everything they can about us. What if I tracked the ads they showed me? What would I find? I started collecting data in January.
It all started while I was watching Samantha Bee and Randy Rainbow comedy videos on YouTube and the advertisements seemed really unusual for the content of the videos. At first, I thought I was imagining how often I was seeing these ads. So, I decided to start keeping track of them. As I started capturing data, I saw that I needed to build some tools to help with this. The ‘ScreenDatify’ project grew out of those initial observations.
As the project grew and evolved, it became obvious that it could be useful to anyone interested in tracking advertisements. Journalists, in particular, might utilize them to track political advertising in the coming campaign cycle.
ScreenDatify? What is that?
I wanted to track the ads I was shown on youtube. Since they appear on my browser screen, and I was collecting data about it, ‘ScreenDatify’ seemed like a reasonable project name. At least to start with. It became the name of the web-extension I’ll describe later.
Why am I doing this with screenshots?
I’ve been using youtube-dl for getting other data from YouTube. It’s a command-line tool that lets users pull videos and data from YouTube and other video services. I wrote a little python script that runs the tool for me. This allows me to avoid a lot of command-line typing and also to format and save the returned data as I want it. If there’s interest in this, I’ll write more about it later.
My little python script works great! But when I wanted to collect advertiser data, I found that YouTube fails to return data to youtube-dl, when using their api to get the ads data.
Around the same time, I saw that Facebook blocked Propublica’s Ad Tracking software, by changing how they deliver code to load ads in their pages. I’ll have to find screenshot of what they did. Putting that on my to-do list. What Facebook did was super crazy and really, really bad code.
I started taking screenshots. These can’t be blocked without refusing to display the website at all. At first, I did this manually using my laptop’s ‘prt sc’ key. Super easy. However, these screenshots required a lot of work to use as data: crop to browser window, plus add video id and date to the screenshot’s filename. I did this for a few weeks, and noticed some trends in the ads captured in these screenshots that seemed worth more effort to track. For example, I was seeing a lot more ads from Trump demanding that we ‘Stand with ICE’ and asking ‘Do you support Trump?’ on Randy Rainbow videos. To put this in context, Randy Rainbow is a video satirist and flamboyantly gay and campy New York comedian who creates and performs in song parodies that criticize Trump and others. His fans would not reasonably be thought to be fans of Trump.
Back to the data…I started developing browser-web extensions (to collect screenshots with date and video-id data). I ended up developing two versions. One version saves a single screenshot with a browser button that I’d click when collecting data while watching videos. The second can be triggered with a browser button, and then it automatically loads videos every few minutes. Once a video from the list loads, three screenshots are captured and saved (much like the single-save version with video-id and date). There are a few seconds between each screenshot so there is time to save one before another is captured.
I’ve been collecting data like this since January 2019. I’m still collecting it. I’m also processing it. Yes, the screenshots have important data easily code-readable: video id and date-time of screenshot.
But the data I’m aiming for also includes the ads in these screenshots. What type are they? Banners, text, something else? What is the ad for? Trump’s ego? ICE? Shoes? Who’s name is on the ad: right-wing political groups, shoe brands, insurance companies? This requires annotation of the original visual data.
For now, I’m working with some python scripts that I’ve developed to assist and speed annotation on a small-ish scale. Eventually, I plan to use image analysis and clustering algorithms to assist and speed this annotation when I have massive amounts of data.
What future plans do I have? If I get enough support and interest, I’ll start sharing my data collection tools and develop more user-friendly features for them. Hopefully, this will enable more crowdsourcing for data collection and annotation. When I get enough data, I’ll start developing some human-directed AI tools to assist and speed ad identification and labeling. I’m thinking that a blend of human labeling and machine learning will be the way to go on this.
If you’d like to participate or otherwise support this project, comment or reach out on the Patreon version of this post. Why through Patreon? There are tools there to help me manage access to my posts and moderate comments.