My tools for analysing YouTube Super Chats
Over the past months I’ve been posting analyses (like this one) about the contents, the total monetary value & amount of Super Chat donations that were sent though the YouTube livestream chats to selected Virtual YouTubers together with the geographic location (based on currency) of the donors on the r/HoloStatistics subreddit. Here I’ll explain the technology and methods behind the analyses and data collection.
The Python scripts used for this can be downloaded from my GitHub.
To collect the data I use two Python scripts and a PostgreSQL database. The first one, channel_monitor.py, repeatedly asks the YouTube Data API every ~75 minutes (if you monitor 5 channels, else you have to wait longer) only for planned streams from selected channels (which you can specify). Once it has detected some, it will start recording them using async_record_running_livestream_superchats.py. Once a stream ends, it will try to re-record super chats from the archive of the live stream in order to retrieve potentially previously unrecorded super chats. This is necessary since I’ve observed that some super chats which are present in the archived live stream chat were not included in the chat at the time of the live broadcast.
The 75 minutes pause is a downside of having to use the YouTube Data API. It increases the chance of the script not recognizing a spontaneously posted stream. Every request costs API points, of which I only have 10000 per API key. Each request to look up planned livestreams costs 100 points per channel. To address this, I am currently working on a version that uses the Holodex API instead to fetch planned streams. This allows for much shorter pauses between requests.
The script relies on taizan-hokuto’s Python module pytchat to fetch the chat data. Unfortunately, they seem to have stopped maintaining it. Since it’s open source, I started modifying it to reflect my needs better. I’ll occasionally update it when the YouTube live stream chat changes.
How is the data stored and which data gets collected and stored? Currently, it uses two places to store the data. Superchats, metadata (like channel, title, start time and end time) connected to the live stream, the livestreamer and the superchatting people are stored in a PostgreSQL database. It saves super chats in there as soon as they arrive in the chat. The scripts also create a folder on your drive for each YouTube Channel it comes across, using the channel ID as folder name. Inside these folders the script saves the super chat logs and some metadata about the stream (like title, channel, start & end time, total sum of donations split by currencies). However, these files only become available after the re-recording process finishes.
Important: It also saves membership anniversary messages. It treats them like super chats, using the imaginary currency MON (short for months), saving the membership duration (in months) as the donation value.
Included in my GitHub repository are some tools to visualise some of the collected data. It includes my script to generate a word cloud from the super chat & membership anniversary messages. To create the analyses I post on Reddit, I rely on my Postgres data & Google Spreadsheets. I query my Postgres database to generate a summary of the donations for a stream split by currency. The summary then gets manually copied & pasted into a Google Spreadsheet to generate the nice graphs and to convert the donations into USD and EUR.
More information on how to use my scripts & how & which data is stored can be found in the README and the db_structure.sql in my GitHub Repository.