YouTube Caption Games
YouTube was meant to be a video hosting site, not an interactive platform. Despite this, YouTubers have long exploited aspects of the YouTube player to create interactive experiences. Those of you who used YouTube early in the day may remember annotations, a feature meant to allow creators to add comments and corrections on their videos after they had been uploaded. Crucially, annotations could contains links to other YouTube videos, and some creators took advantage of this to make "choose your own adventure" games on YouTube. In 2019, annotations were removed from all YouTube videos, marking the end of an era. This hasn't stopped creators from trying to use pre-existing features for other reasons.
Captions, unlike annotations, aren't meant to be interacted with. However, the formats are surprisingly flexible, and some YouTubers have taken advantage of this. YouTube player also supports dragging and dropping captions, in case a caption is blocking a part of the video and you would like to not see it. If you're creative, you can put this to use. For example, this user has created a drag-and-drop "build your own emoji" game by providing captions with emojis that you can move on top of a static image. This is already definitely an unexpected use of captions, but some creators have taken it to the limit.
This video by Firerama seems to defy explanation - an interactive YouTube game. It has a paintbrush emoji in the middle, and on the right side you see some colors and an arrow next to them. If you drag the paintbrush emoji around, little balls will appear trailing it, as if you had used ink. The balls will be the color of whatever color the arrow is currently pointing to. The arrow moves through the colors in sequence and you can only use that color when the arrow is pointing to it. You can also pause the video to move the little balls around if you don’t like where they are.
We know off the bat that captions are involved because the creator tells us to turn on captions, but this goes beyond any use of captions I've seen. I was so charmed by this that I wanted to find out how they did it. How is it possible to do something like this? Surely there can't be any scripting involved, so there must be some trick they're using to make this happen. Let's investigate and see how far we can get.
Investigation
If you click 'show transcript', you'll see which items are captions. The paintbrush is a caption. The little balls are captions. Even the colors in the ‘palette’ are captions. You can drag captions in YouTube, so this means that you can reposition them. The transcript also shows that new balls (captions) are added to the video every few seconds, which is why you can ‘paint’ with them. They are added at the bottom of the screen, where captions are added by default. So the "painting" is caused by having a continual stream of balls appear and be repositioned wherever your dragged caption is. When the arrow in the video moves to the next color, the newly added captions become that color. The arrow itself doesn't do anything; it's just a way to inform the viewer that a new color is coming.
Here's the question: why does dragging a caption cause the new caption to appear where it is?
You don’t need to drag the brush - dragging ANY caption (including the ones on the right or ones you have drawn with) will cause the currently displaying captions at the bottom to move to the dragged caption’s position. This means it's not anything special about the paintbrush emoji, but about the act of dragging a caption. So question is, how does dragging one caption affect another? How do you make it so that one caption’s position is dependent on another’s, but only while dragging a caption?
If you inspect the video using dev tools, you can see that each caption is a series of nested spans. Curiously, you can also see that the captions have colors and fills even though you cannot do this with standard YouTube captions. This gives us a hint: they are probably not using YouTube’s caption maker but uploading a format that supports more styling options.
Let's drag a caption to see if anything changes in the styling. While you drag a caption, the class titled ‘caption-window’ acquires a new class, ‘ytp-dragging’. Some styling is added to it:
top: 40.444% left: 40.227%; width:38%; height: 25px; margin-left: -19px; margin-top: -12.5px;
If you can catch one of the new captions being added, you'll see it has the same top and left. My theory: every few milliseconds, it adds a new caption, and that caption somehow can copy the style of whatever has ytp-dragging
. But how can you do that? There shouldn't be any way to find what the currently dragged caption's position is, or to set a new caption's position to another caption's position. We're going to have to look deeper into how YouTube subtitles work.
YouTube subtitles
Some googling shows you can style subtitles:
“In addition, positioning can be used so as to not obscure important information on the screen such as text, or furthering identification of speakers. Captionfy is working on a positioning feature at this moment.”
Some further reading on the formats: https://web.archive.org/web/20230726193814/https://jacobstar.medium.com/the-first-complete-guide-to-youtube-captions-f886e06f7d9d
https://stackoverflow.com/questions/46494279/what-is-this-subtitle-format-called
https://www.reddit.com/r/youtubedl/comments/197jvmj/any1_know_how_edit_srv3_subtitles_without/
So it looks like this guy is using srv3 to do the magic. Luckily, we can use yt-dlp to download both the video and the creator's srv3 subs:
yt-dlp https://www.youtube.com/embed/LuangEd48wI --write-sub --sub-format srv3 --sub-lang "en.*"
Open up the srv3 files in a text editor and you will see something interesting. Here are the captions below, and you can see the paintbrush, the colored balls, and the movable balls (wp id=10).
The colored pellets that appear when you drag the cursor have wp
set to 10. Data about this wp
appears in the head:
<wp id="10" ap="4"/> <wp id="3" ap="4" ah="70" av="32"/> <wp id="4" ap="4" ah="70" av="39"/> <wp id="5" ap="4" ah="70" av="47"/> <wp id="6" ap="4" ah="70" av="55"/> <wp id="7" ap="4" ah="85" av="32"/> <wp id="8" ap="4" ah="85" av="39"/> <wp id="9" ap="4" ah="85" av="47"/> <wp id="1" ap="7" ah="0" av="0"/> <wp id="2" ap="7" ah="31" av="67"/>
Here is some information on what these mean:
wp: window point (?). Quote: "Thewp
andws
tags are meant for the uppermost point of the caption hierarchy, applying to the window the text is contained in." ah="#": Align Horizontal (X from left)
av="#": Align Vertical (Y from top)
ap="#": Anchor Point:
0 - Top Left | 1 - Top Center | 2 - Top Right
3 - Center Left | 4 - True Center | 5 - Center Right
6 - Bottom Left | 7 - Bottom Center | 8 - Bottom Right
Notice that wp
with an id of 10 has no horizontal or vertical anchor point. My speculation: if you don't have anchor points then this means it can move. So perhaps if you are dragging an item when the caption appears on the screen, then YouTube will just set the ah
and av
to wherever the current dragged item is. The documentation suggests that wp
is fiddly like that:
You might’ve seen th wp ap=”#”
attribute pass by, but that’s one you have to master in some respects and give up some control in others. The simple fact is that they can move. The control bar at the bottom pushes up against caption windows that are aligned to the bottom. For embedded videos, the title bar on top pushes down on the top-aligned others. If you’re trying to use positioning of your captions with accuracy, then you should prepare for that.
This sounds like an odd side effect. I can’t imagine setting the left and top to whatever the currently dragged item is would be a desired behavior. It seems likely that this is how the game was made. By taking advantage of this behavior, you could create the illusion you are ‘painting’.
I tried to recreate it myself by (privately) reuploading Firerama's videos with the downloaded srv3 subs. Unfortunately, it did not work. I think that some of the information that is used in the subs is not sent by YouTube. When I downloaded the subs that I had uploaded on my private copy of Firerama's video, I noticed that they were not the same as the original, which suggests that some information is destroyed somewhere in the upload/download process. I attempted to mess around with the srv3 file to see if I could somehow make it myself, but to no avail. As such, I can't confirm that the wp
is the key, but it does seem to be an important part of this effect.