Tag: YouTube

Python code mistake

I made youtube-dl faster for archivists…and solved a worst-case programming problem elegantly in the process

Update: there is an addendum at the end of this article; I mention it because yes, in the end, I switched over to Python sets. I don’t want any Python experts cringing too hard, after all. Welcome to the joys of a C programmer adapting to Python.

For those who haven’t heard of it, youtube-dlc is a much more actively maintained fork of the venerable but stagnant youtube-dl project that was announced on the DataHoarder subreddit. I have been archiving YouTube channels for many months now, trying to make sure that the exponential boost in censorship leading up to the 2020 U.S. Presidential election doesn’t cause important discussions to be lost forever.

Unfortunately, this process has led me to have a youtube-dl archive file containing well over 70,000 entries, and an otherwise minor performance flaw in the software had become a catastrophic slowdown for me. (Side note: a youtube-dl archive file contains a list of videos that were already completely downloaded and that lets you prevent re-downloading things you’ve already snagged.) Even on a brand new Ryzen 7 3700X, scanning the archive file for each individual video would sometimes progress at only a few rejections per second, which is very bad when several of the channels I archive have video counts in the multiple thousands. The computer would often spend multiple minutes just deciding not to download all of the videos on a channel, and that got painful to watch. That’s time that could be spent checking another channel or downloading a video.

When youtube-dlc popped up and offered goodwill to the people who were treated less than favorably by the youtube-dl project maintainers, I realized that I had a chance to get this fixed. I opened an issue for it, but it became clear that the budding fork didn’t have resources to dedicate to the necessary improvement. The existing code technically worked and performance wasn’t as bad for people using much smaller archive files. Both myself and the youtube-dlc maintainer (blackjack4494) quickly identified the chunk of code that was behind the problem, so I decided to take it on myself.

Discussion about the code causing the problem
There’s our problem!

The troublesome code is cut off in the discussion screenshot, so here’s a better look at it:

Problematic code from youtube-dl
Good programmers probably know why this is slow just from this image.

The code outlined in the red box above is opening the archive file for reading, consuming it line-by-line, and comparing the line read from the archive to the line associated with the candidate video to be downloaded. If a line exists that matches, the video is in the archive and shouldn’t be downloaded, so in_download_archive() returns True. This works fine and is a very simple solution to implementing the archive file feature, but there’s a problem: in_download_archive() is invoked for every single possible video download, so the file is opened and read repeatedly.

Some programmers may see this and ask “why is this a problem? The archive file is read for every video, so there’s a good chance it will remain in the OS disk cache, so reading the file over and over becomes an in-memory operation after the first time it’s read.” Given that my archive of over 70,000 lines is only 1.6 MiB in size, that seems to make some sense. What is being missed in all of this is the massive overhead of reading and processing a file, especially in a high-level language like Python.


An aside for programmers not so familiar with more “bare metal” programming languages: in C, you can use a lot of low-level trickery to work with raw file data more quickly. If I was implementing this archive file check code in C (some low-level steps will be omitted here), I’d repeatedly scan the buffer for a newline, reverse the scan direction until extra white space was jumped over (doing what strip() does in Python), convert the last byte of newline/white space after the text to a null byte, and do a strcmp(buffer, video_id) to check for a match. This still invokes all of the OS and call overhead of buffered file reads, but it uses a minimal amount of memory and performs extremely fast comparisons directly on the file data.

In a language that does a lot of abstraction and memory management work for us like Python, Java, or PHP, a lot more CPU-wasting activity goes on under the hood to read the file line by line. Sure, Python does it in 4 lines of code and C would take more like 15-20 lines, but unpack what Python is doing for you within those lines of code:

  1. Allocating several variables
  2. Opening the file
  3. Allocating a read buffer
  4. Reading the file into the buffer
  5. Scanning the buffer for newlines
  6. Copying each line into the “line” variable one at a time
  7. Trimming leading and trailing white space on the line which means
    • Temporarily allocating another buffer to strip() into
    • Copying the string being stripped into…
    • …while checking for and skipping the white space…
    • …and copy the string back out of that buffer
  8. Finally doing the string comparison
  9. All while maintaining internal reference counts for every allocated item and periodically checking for garbage collection opportunities.

Multiply the above by 2,000 video candidates and run it against an archive file with 70,000 entries and you can easily end up with steps 6, 7, and 8 being executed almost 140,000,000 times if the matching strings are at the end of the archive. Python and other high-level languages make coding this stuff a lot easier than C, but it also makes it dead simple to slaughter your program’s performance since a lot of low-level details are hidden from the programmer.


I immediately recognized that the way to go was to read the archive file one time at program startup rather than reading it over and over for every single download candidate. I also recognized that this is the exact problem a binary search tree (BST) is designed to speed up. Thus, my plan was to do the same line-by-line read-and-strip as the current code, but then store each processed line in the BST, then instead of reading the file within in_download_archive(), I’d scan the BST for the string. The neat thing about a BST is that if it were perfectly balanced, 70,000 entries would only be 17 levels deep, meaning each string check would perform at most 17 string comparisons, a much better number than the 70,000-comparison worst-case of scanning a flat text file line-by-line.

So, I set out to make it happen, and my first commit in pursuit of the goal was dropped.

Binary search tree Python code
The workhorse code; a classic binary search tree
Archive preload code that feeds the binary search tree
Archive preload code that feeds the binary search tree

This actually worked nicely!…until I tried to use my huge archive instead of a tiny test archive file, and then I was hit with the dreaded “RuntimeError: maximum recursion depth exceeded while calling a Python object.” Okay, so the recursion is going way too deep, right? Let’s remove it…and thus, my second commit dropped (red is removed, green is added).

Python code change from recursion to a loop
Let’s do it in a loop instead!

With the recursion swapped out for a loop, the error was gone…but a new problem surfaced, and unfortunately, it was a problem that comes with no helpful error messages. When fed my archive file, the program seemed to basically just…hang. Performance was so terrible that I thought the program had completely hung. I put some debug print statements in the code to see what was going on, and immediately noticed that every insert operation would make the decision to GO RIGHT when searching the tree for the correct place to add the next string. There was not one single LEFT decision in the entire flood of debug output. That’s when I finally realized the true horror that I had stepped into.

I had sorted my archive…which KILLED the performance.

Binary search trees are great for a lot of data, but there is a rare but very real worst-case scenario where the tree ends up becoming nothing more than a bloated linked list. This doesn’t happen with random or unordered data, but it often happens when the tree is populated in-order with data that is already sorted. Every line in my sorted file was “greater than” the line before it, so when fed my sorted archive, the tree became an overly complex linked list. The good news is that most people will not have a sorted archive file because of the randomness of the video strings, but the bad news is that I had sorted mine because it boosted overall archive checking performance. (Since new video downloads are appended to the archive, the most recent stuff is always at the end, meaning rejecting those newer downloads under the original archive code always required the worst-case amount of time.) It is entirely possible that someone else would sort their archive at some point, so I had accidentally set myself up in the worst-case scenario and I couldn’t just ignore it and assume no one else made the same mistake. I had to fix it.

I got 3/4 through changing over to a weighted BST before realizing that it would not improve the load times and would only help the checking behavior later. That code was scrapped without a commit. I had previously added weighted trees with rebalancing to jdupes, but removed it when all tests over time showed it didn’t improve performance.

How do you optimally feed sorted data into a BST? Ideally, you’d plop the data into a list, add the middle piece of data, then continue taking midpoints to the left and right alternately until you ran out of data to add (see this excellent tutorial for a much better explanation with pictures). Unfortunately, this requires a lot of work; you have to track what you have already added and the number of sections to track increases exponentially. It would probably take some time to do this. I decided to try something a little simpler: split the list into halves, then alternate between consuming each half from both ends until the pointers met. Thus, the third commit dropped.

Python code to attempt to add a sorted list to a binary tree
This didn’t work.

This commit had two problems: the pointer checks resulted in failure to copy 2 list elements and the improvement in behavior was insufficient. It was faster, but we’re talking about an improvement that can be described as “it seemed to just hang before, but now it completes before I get mad and hit CTRL-C to abort.” Unacceptable, to be sure. I stepped away from the computer for a while and thought about the problem. The data ideally needs to be more random than sorted. Is there an easier way? Then it hit me.

I used ‘sort’ to randomize my archive contents. Problem: gone.

All I needed to do was randomize the list order, then add the randomized list the easy way (in-order). Could it really be that simple?! Yes, yes it can! And so it was that the fourth commit dropped (red is removed, green is added).

Python code with an elegant solution
Surprisingly simple solution, isn’t it?

This worked wonderfully in testing…until I tried to use it to download multiple videos. I made a simple mistake in the code because it was getting late and I was excited to get things finished up. See if you can find the mistake before looking at my slightly embarrassing final commit below.

Python code mistake
Oops.

As I wrote this article, I realized that there was probably a cleaner way to randomize the list in Python. Sure enough, all of the code seen in that last commit can be replaced with just one thing: -random.shuffle(lines), and thus dropped my new final commit.

Python randomization loop replaced with one-liner
One-line built-in is better!

I think the code speaks for itself, but if you’d like to see the end result, make sure you watch the video at the top of this article.

Addendum

I posted this article to Reddit and got some helpful feedback from some lovely people. It’s obvious that I am not first and foremost a Python programmer and I didn’t even think about using Python sets to do the job. (Of course a person who favors C will show up to Python and implement low-level data structures unnecessarily!) It was suggested by multiple people that I replace the binary search tree with Python sets, so…I did, fairly immediately. Here’s what that looked like.

Using Python sets instead of a binary search tree
Code go bye bye

The Python set-based implementation is definitely easier to write and does seem to work well. If I did this again, I’d probably skip the BST with shuffle and just use sets. The performance is almost as good as the BST, but I ran a test using OBS Studio to capture output, then moving frame by frame to find the beginning and end of the archive checking process. The set version took 9 seconds; the BST version took 8 seconds. While the set version looks prettier, the fact that the BST has already been merged into upstream and is a bit faster means that the BST (despite probably offending some Python lovers) is here to stay. Actually, it turns out that I made a mistake: I tested the set version with ‘python’ but the BST version was compiled into an executable; after compiling the set version into an executable, it turns out that the set version takes about 6-7 seconds instead. Excuse me while I send yet another pull request!

If you have any feedback, feel free to leave a comment. The comment section is moderated, but fair.

P&G growth before and after Gillette ad

Gillette: get woke, go broke? How Gillette’s “woke” advertisement stifled their parent company’s growth

Get woke, go broke? Some have argued that Gillette’s parent company benefited from the Gillette ad, citing the spike in stock value that took place over a week after the ad ran. Others argue that the ad harmed Gillette’s stock and was a net negative.

P&G growth before and after Gillette ad
P&G growth before and after the Gillette “The Best Men Can Be” ad. Yes, I know that I started a couple of the lines on the wrong data point, but the calculations and dates in the text are correct.

Here, we see the truth. Gillette’s parent company, P&G, was growing at a rate of $0.12 in value per day from May 31, 2018 to December 14, 2018. There was a big drop at Christmas (not unusual for a retail company’s value to drop after the busy holiday buying season ends) and there was a correction one month later. The Gillette ad ran near the end of the low period spanning the first three weeks in January.

Zoomed out, it may appear that the ad caused the spike in stock value, but zooming in shows that the entire week after the ad was put out represented a loss in value, NOT a gain. The facts speak for themselves: if the ad had any short-term effect at all, it was a negative effect.

The subsequent spike is also easily explained. Because the ad was controversial and the stock spent a week dropping, bullish investors could see that a correction was overdue from the December drop and took the opportunity to buy while the price was going down so they could ride the stock on its way back up. This kicked off the expected correction.

What is most interesting about the charts (and ignored by those who endorse fact-free “woke” political agendas) is the long-term growth rate. Before the drop and correction, they were gaining value at $0.12 per day, but after the Gillette ad ran (and the expected correction from the December drop was finished), this dropped to $0.10 per day, a loss of 1/6 (16.67%) of the entire company’s growth. P&G is a huge company and Gillette is only one of their many brands; 19 of their brands rake in over $1 billion annually, and Gillette is one of those brands.

1/19 of the company’s biggest brands killed 1/6 of the company’s growth with a single “woke” advertisement. Get woke, go broke, indeed.


This was written up because of a comment by “Old Blanco Rd Productions” on a Coffee Break video about “woke advertising” on YouTube. Notably, this commenter would repeatedly try to pull the conversation back to an advertisement by Nike featuring Colin Kaepernick, an ad which caused some controversy because Colin is the originator of the “take a knee during the national anthem” thing at football games. The same commenter didn’t want to talk about Gillette and P&G and was only interested in Nike and pushing the statement that “Nike added $6 billion in value after the Colin Kaepernick commercial!” Typical bullshit accusations of “sealioning” by a “Spencer Person” ensued, though sealioning is nothing more than a logically fallacious attempt to discredit people who demand that you support your arguments with facts. Quoting from that last link: “In other words, “sealioning” is a gag to be imposed upon people you disagree with if they argue with you for too long, too persistently, or in any fashion that you dislike.” To sum it up, these two people are keen to control the conversation so they don’t lose the argument. If you’ve read everything above, you can see why they’d rather accuse me of illegitimate tactics than to accept the cold, hard, high-low-close facts in the stock charts.

What that last paragraph leads up to is this: I watched both commercials. The Nike ad was only controversial because Colin was in it, but the ad itself is not actually a “woke” ad. It’s a typical Nike ad with a positive message that encourages you to get out there and be successful and stand up for yourself. It’s a well-done advertisement that does exactly what a major brand wants: to connect their brand with positive associations in the mind of the viewer. The Gillette ad, on the other hand, was a negative ad that stereotyped not only the entire male gender, but also visibly drew racial divides, with white males as mindless villainous rapists-in-waiting and black males as the only thing keeping them from raping everybody out here. Its “positive message” was nothing more than sprinkles on a racist, sexist, man-hating, race-baiting turd of a commercial that reinforces the premise that men are pieces of trash by default. The Nike ad worked out well for Nike because it was a good ad with a good message. The Gillette ad disproportionately hampered the growth of a company that has 18 other billion-dollar brands they derive value from because it was designed to press all of the controversial sociopolitical agenda buttons that it could.

Too many YouTube ads on mobile and TV? Bypass them with SaveTube and Plex!

Adblock Plus does a great job of blocking ads on desktops and laptops, but on “app” devices like phones and TVs, ad blocking simply isn’t available. This isn’t a solution for blocking ads during casual browsing of YouTube on mobile and TV apps, but it’s a great way to get videos you know you’ll want to watch later and watch them on “app” devices with no more ad breaks ruining everything. You need two things: Sebaro’s SaveTube (and a browser extension like Greasemonkey to enable running user scripts if your browser doesn’t have native user script support) and a Plex Media Server setup. I can’t explain in this post how to install browser extensions like Greasemonkey or Tampermonkey and add SaveTube to it, nor can I explain how to install and set up Plex, but your favorite search engine and the Plex website can tell you everything you need to know.

  1. Install SaveTube.
  2. Have Plex installed, set up, and working. You’ll need Plex Media Server on your computer and the Plex app on your phone, tablet, TV, Chromecast, Roku, or whatever you happen to own.
  3. Create a new folder for your YouTube videos and add it to your Plex Media Server. Call it “YouTube videos” or whatever you like and classify the folder in Plex as “other videos.”
  4. Open a video you’d like to watch without ads. SaveTube should appear at the bottom right corner of the screen if you have it installed properly.
  5. Click “Get” in SaveTube to save a copy of the video to your computer. This defaults to the highest resolution MP4 that contains both audio and video in a single file. In general, you can’t download 1080p or 4K content already in this “muxed” file format, so it’s best not to mess with the default choice. SaveTube’s red-orange options are called “DASH streams” and contain ONLY audio or ONLY video, meaning you’ll have to remux the separate streams if you want the options that aren’t available in blue. As you can tell by now, it’s easiest to just hit “Get” and leave the other SaveTube options alone.
  6. Repeast steps 4 and 5 for every video you want.
  7. Move the YouTube video(s) you downloaded to the folder you made and added to Plex.
  8. Open the Plex Media Server web portal, click “…” to the right of your YouTube video library, and have it “scan library files.” This will refresh the list of available videos with your new additions.
  9. Open Plex on your device of choice and enjoy the videos you fetched 100% ad-free!

Legal disclaimer: I am not a lawyer, but for a layperson I have a fairly strong understanding of copyright law. It is my non-professional opinion that the actions taken here do not constitute copyright infringement or a violation of the non-circumvention provisions of the DMCA. YouTube’s page code has a block of text called “var ytplayer” that contains unencrypted URLencoded URLs to these video streams and there are no access controls on the server that restrict you from accessing videos through the links in that block of code. As long as you follow standard copyright law restrictions for the content (no redistribution, no public exhibition, etc.) this should be 100% legal. If you’re asking why this would be legal when “YouTube-to-MP3 sites” aren’t, it’s because those sites engage in redistribution of content which is an action protected by copyright. In fact, you can do the MP3 conversion yourself if you use SaveTube to download a “medium bitrate audio MP4” and use a free conversion tool like ffmpeg, Audacity, or Format Factory to convert the .m4a file to .mp3 format instead.

If you’re an attorney (a real attorney, not a person like me that simply thinks they’re smart and probably isn’t) and you believe that this procedure is not legal, please feel free to contact me and explain.

A proposed bumper for The Basic Filmmaker

The Basic Filmmaker put out a call on YouTube for people to make a new bumper for his channel. This is what I came up with in a few hours. I think some aspects of it could be improved but I’m pretty happy with it as-is. The five-second time limit was a bit constraining and I would probably do it a little differently if I made it a second time around. Anyone who has watched my short films might recognize a couple of the clips in this video from the “Behind the Scenes” during the credits of The Old Man’s Pendant II.