Tag: programming

Python code mistake

I made youtube-dl faster for archivists…and solved a worst-case programming problem elegantly in the process

Update: there is an addendum at the end of this article; I mention it because yes, in the end, I switched over to Python sets. I don’t want any Python experts cringing too hard, after all. Welcome to the joys of a C programmer adapting to Python.

For those who haven’t heard of it, youtube-dlc is a much more actively maintained fork of the venerable but stagnant youtube-dl project that was announced on the DataHoarder subreddit. I have been archiving YouTube channels for many months now, trying to make sure that the exponential boost in censorship leading up to the 2020 U.S. Presidential election doesn’t cause important discussions to be lost forever.

Unfortunately, this process has led me to have a youtube-dl archive file containing well over 70,000 entries, and an otherwise minor performance flaw in the software had become a catastrophic slowdown for me. (Side note: a youtube-dl archive file contains a list of videos that were already completely downloaded and that lets you prevent re-downloading things you’ve already snagged.) Even on a brand new Ryzen 7 3700X, scanning the archive file for each individual video would sometimes progress at only a few rejections per second, which is very bad when several of the channels I archive have video counts in the multiple thousands. The computer would often spend multiple minutes just deciding not to download all of the videos on a channel, and that got painful to watch. That’s time that could be spent checking another channel or downloading a video.

When youtube-dlc popped up and offered goodwill to the people who were treated less than favorably by the youtube-dl project maintainers, I realized that I had a chance to get this fixed. I opened an issue for it, but it became clear that the budding fork didn’t have resources to dedicate to the necessary improvement. The existing code technically worked and performance wasn’t as bad for people using much smaller archive files. Both myself and the youtube-dlc maintainer (blackjack4494) quickly identified the chunk of code that was behind the problem, so I decided to take it on myself.

Discussion about the code causing the problem
There’s our problem!

The troublesome code is cut off in the discussion screenshot, so here’s a better look at it:

Problematic code from youtube-dl
Good programmers probably know why this is slow just from this image.

The code outlined in the red box above is opening the archive file for reading, consuming it line-by-line, and comparing the line read from the archive to the line associated with the candidate video to be downloaded. If a line exists that matches, the video is in the archive and shouldn’t be downloaded, so in_download_archive() returns True. This works fine and is a very simple solution to implementing the archive file feature, but there’s a problem: in_download_archive() is invoked for every single possible video download, so the file is opened and read repeatedly.

Some programmers may see this and ask “why is this a problem? The archive file is read for every video, so there’s a good chance it will remain in the OS disk cache, so reading the file over and over becomes an in-memory operation after the first time it’s read.” Given that my archive of over 70,000 lines is only 1.6 MiB in size, that seems to make some sense. What is being missed in all of this is the massive overhead of reading and processing a file, especially in a high-level language like Python.


An aside for programmers not so familiar with more “bare metal” programming languages: in C, you can use a lot of low-level trickery to work with raw file data more quickly. If I was implementing this archive file check code in C (some low-level steps will be omitted here), I’d repeatedly scan the buffer for a newline, reverse the scan direction until extra white space was jumped over (doing what strip() does in Python), convert the last byte of newline/white space after the text to a null byte, and do a strcmp(buffer, video_id) to check for a match. This still invokes all of the OS and call overhead of buffered file reads, but it uses a minimal amount of memory and performs extremely fast comparisons directly on the file data.

In a language that does a lot of abstraction and memory management work for us like Python, Java, or PHP, a lot more CPU-wasting activity goes on under the hood to read the file line by line. Sure, Python does it in 4 lines of code and C would take more like 15-20 lines, but unpack what Python is doing for you within those lines of code:

  1. Allocating several variables
  2. Opening the file
  3. Allocating a read buffer
  4. Reading the file into the buffer
  5. Scanning the buffer for newlines
  6. Copying each line into the “line” variable one at a time
  7. Trimming leading and trailing white space on the line which means
    • Temporarily allocating another buffer to strip() into
    • Copying the string being stripped into…
    • …while checking for and skipping the white space…
    • …and copy the string back out of that buffer
  8. Finally doing the string comparison
  9. All while maintaining internal reference counts for every allocated item and periodically checking for garbage collection opportunities.

Multiply the above by 2,000 video candidates and run it against an archive file with 70,000 entries and you can easily end up with steps 6, 7, and 8 being executed almost 140,000,000 times if the matching strings are at the end of the archive. Python and other high-level languages make coding this stuff a lot easier than C, but it also makes it dead simple to slaughter your program’s performance since a lot of low-level details are hidden from the programmer.


I immediately recognized that the way to go was to read the archive file one time at program startup rather than reading it over and over for every single download candidate. I also recognized that this is the exact problem a binary search tree (BST) is designed to speed up. Thus, my plan was to do the same line-by-line read-and-strip as the current code, but then store each processed line in the BST, then instead of reading the file within in_download_archive(), I’d scan the BST for the string. The neat thing about a BST is that if it were perfectly balanced, 70,000 entries would only be 17 levels deep, meaning each string check would perform at most 17 string comparisons, a much better number than the 70,000-comparison worst-case of scanning a flat text file line-by-line.

So, I set out to make it happen, and my first commit in pursuit of the goal was dropped.

Binary search tree Python code
The workhorse code; a classic binary search tree
Archive preload code that feeds the binary search tree
Archive preload code that feeds the binary search tree

This actually worked nicely!…until I tried to use my huge archive instead of a tiny test archive file, and then I was hit with the dreaded “RuntimeError: maximum recursion depth exceeded while calling a Python object.” Okay, so the recursion is going way too deep, right? Let’s remove it…and thus, my second commit dropped (red is removed, green is added).

Python code change from recursion to a loop
Let’s do it in a loop instead!

With the recursion swapped out for a loop, the error was gone…but a new problem surfaced, and unfortunately, it was a problem that comes with no helpful error messages. When fed my archive file, the program seemed to basically just…hang. Performance was so terrible that I thought the program had completely hung. I put some debug print statements in the code to see what was going on, and immediately noticed that every insert operation would make the decision to GO RIGHT when searching the tree for the correct place to add the next string. There was not one single LEFT decision in the entire flood of debug output. That’s when I finally realized the true horror that I had stepped into.

I had sorted my archive…which KILLED the performance.

Binary search trees are great for a lot of data, but there is a rare but very real worst-case scenario where the tree ends up becoming nothing more than a bloated linked list. This doesn’t happen with random or unordered data, but it often happens when the tree is populated in-order with data that is already sorted. Every line in my sorted file was “greater than” the line before it, so when fed my sorted archive, the tree became an overly complex linked list. The good news is that most people will not have a sorted archive file because of the randomness of the video strings, but the bad news is that I had sorted mine because it boosted overall archive checking performance. (Since new video downloads are appended to the archive, the most recent stuff is always at the end, meaning rejecting those newer downloads under the original archive code always required the worst-case amount of time.) It is entirely possible that someone else would sort their archive at some point, so I had accidentally set myself up in the worst-case scenario and I couldn’t just ignore it and assume no one else made the same mistake. I had to fix it.

I got 3/4 through changing over to a weighted BST before realizing that it would not improve the load times and would only help the checking behavior later. That code was scrapped without a commit. I had previously added weighted trees with rebalancing to jdupes, but removed it when all tests over time showed it didn’t improve performance.

How do you optimally feed sorted data into a BST? Ideally, you’d plop the data into a list, add the middle piece of data, then continue taking midpoints to the left and right alternately until you ran out of data to add (see this excellent tutorial for a much better explanation with pictures). Unfortunately, this requires a lot of work; you have to track what you have already added and the number of sections to track increases exponentially. It would probably take some time to do this. I decided to try something a little simpler: split the list into halves, then alternate between consuming each half from both ends until the pointers met. Thus, the third commit dropped.

Python code to attempt to add a sorted list to a binary tree
This didn’t work.

This commit had two problems: the pointer checks resulted in failure to copy 2 list elements and the improvement in behavior was insufficient. It was faster, but we’re talking about an improvement that can be described as “it seemed to just hang before, but now it completes before I get mad and hit CTRL-C to abort.” Unacceptable, to be sure. I stepped away from the computer for a while and thought about the problem. The data ideally needs to be more random than sorted. Is there an easier way? Then it hit me.

I used ‘sort’ to randomize my archive contents. Problem: gone.

All I needed to do was randomize the list order, then add the randomized list the easy way (in-order). Could it really be that simple?! Yes, yes it can! And so it was that the fourth commit dropped (red is removed, green is added).

Python code with an elegant solution
Surprisingly simple solution, isn’t it?

This worked wonderfully in testing…until I tried to use it to download multiple videos. I made a simple mistake in the code because it was getting late and I was excited to get things finished up. See if you can find the mistake before looking at my slightly embarrassing final commit below.

Python code mistake
Oops.

As I wrote this article, I realized that there was probably a cleaner way to randomize the list in Python. Sure enough, all of the code seen in that last commit can be replaced with just one thing: -random.shuffle(lines), and thus dropped my new final commit.

Python randomization loop replaced with one-liner
One-line built-in is better!

I think the code speaks for itself, but if you’d like to see the end result, make sure you watch the video at the top of this article.

Addendum

I posted this article to Reddit and got some helpful feedback from some lovely people. It’s obvious that I am not first and foremost a Python programmer and I didn’t even think about using Python sets to do the job. (Of course a person who favors C will show up to Python and implement low-level data structures unnecessarily!) It was suggested by multiple people that I replace the binary search tree with Python sets, so…I did, fairly immediately. Here’s what that looked like.

Using Python sets instead of a binary search tree
Code go bye bye

The Python set-based implementation is definitely easier to write and does seem to work well. If I did this again, I’d probably skip the BST with shuffle and just use sets. The performance is almost as good as the BST, but I ran a test using OBS Studio to capture output, then moving frame by frame to find the beginning and end of the archive checking process. The set version took 9 seconds; the BST version took 8 seconds. While the set version looks prettier, the fact that the BST has already been merged into upstream and is a bit faster means that the BST (despite probably offending some Python lovers) is here to stay. Actually, it turns out that I made a mistake: I tested the set version with ‘python’ but the BST version was compiled into an executable; after compiling the set version into an executable, it turns out that the set version takes about 6-7 seconds instead. Excuse me while I send yet another pull request!

If you have any feedback, feel free to leave a comment. The comment section is moderated, but fair.

Cute cloned dogs

A CHALLENGER APPEARS: “fclones”…fastest duplicate scanner ever? It’s complicated.

While perusing my link referrals on GitHub, I noticed this thread where my duplicate scanner jdupes was mentioned. I then noticed the comment below it:

There is also a much faster modern alternative to fdupes and jdupes: fclones. It searches for files in parallel and uses a much faster hash function than md5.

My response comment pretty much says it all, so I’m making that the entire remainder of this post.

I noticed that fclones does not do the byte-for-byte safety check that jdupes (and fdupes) does. It also relies exclusively on a non-cryptographic hash for comparisons. It is unsafe to rely on a non-cryptographic hash as a substitute for the file data, and comparisons between duplicate finders running in full-file comparison mode vs. running in hash-and-compare mode are not appropriate. The benchmark on the fclones page ran jdupes 1.14 without the -Q option that disables the final byte-for-byte confirmation, so there is a lot of extra work for the purpose of avoiding potential data loss being done by jdupes and being skipped entirely by fclones.

jdupes already uses a faster hash function than MD5 (xxHash64 as of this writing, previously jodyhash), and it is fairly trivial to switch to even faster hash functions if desired…but the fact is that once you switch to any “fast hash” function instead of a cryptographic one the hash function used is rarely a bottleneck, especially compared to the I/O bottleneck represented by most consumer-grade hard drives and low-end SSDs. If everything to be checked is in the buffer cache already then it might be a bottleneck, but the vast majority of duplicate scan use cases will be performed on data that is not cached.

Searching for files in parallel is only an advantage if the disk I/O is not a bottleneck, and you’ll notice that the fclones author performed the dedupe benchmarks on a (presumably very fast since it’s paired to a relatively recent Xeon) 512GB NVMe SSD with an extremely fast multi-core multi-threaded processor. There is a very small access time penalty for random read I/O on a fast NVMe SSD, but there is an extremely large access time penalty for random read I/O on a traditional rotating hard drive or RAID array composed of several hard drives. Any number of multiple threads firing off reads on the same RAID array at the same time will slow even most RAID arrays to a single-digit MB/sec death crawl. I understand that many people will be working with SSDs and some duplicate scanner programs will be a better choice for SSDs, but the majority of computer systems have spinning rust instead of flash-based disks.

It is strongly advisable to (A) run your own benchmarks on your specific workload and hardware, and (B) understand how to use the program within your own personal acceptable level of risk. Both of these are different for every different person’s needs.

UPDATE: I found another instance of the fclones author claiming jdupes being single-threaded makes it slow; to quote directly:

Unfortunately these older programs are single-threaded, and on modern hardware (particularly on SSDs) they are order of magnitude slower than they could be. If you have many files, a better option is to use fclones (disclaimer: I’m the author), which uses multiple threads to process many files in parallel and offers additional filtering options to restrict the search.

The points I’ve made above still stand. Unless you’re running the author’s enterprise-grade high-end hardware, your disk random access latency is your major limiting factor. I’d love to see what fclones does on something like a 24TB disk array. I’d wager–exactly as stated above–that 8 or 32 simultaneous I/O threads brings the whole process to a death crawl. Perhaps I should bite the bullet and run the tests myself.

UPDATE 2: I was right. Benchmark article and analysis forthcoming.

Featured image Licensed under CC-BY from Steve Jurvetson, https://www.flickr.com/photos/jurvetson/3327872958

The key to faster shell scripts: know your shell’s features and use them!

I have a cleanup program that I’ve written as a Bash shell script. Over the years, it has morphed from a thing that just deleted a few fixed directories if they existed at all (mostly temporary file directories found on Windows) to a very flexible cleanup tool that can take a set of rules and rewrite and modify them to apply to multiple versions of Windows, along with safeguards that check the rules and auto-rewritten rules to prevent the equivalent of an “rm -rf /*” from happening. It’s incredibly useful for me; when I back up a customer’s PC data, I run the cleaner script first to delete many gigabytes of unnecessary junk and speed up the backup and restore process significantly.

Unfortunately, having the internal rewrite and safety check rules has the side effect of massively slowing the process. I’ve been tolerating the slowness for a long time, but as the rule set increased in size over the past few years the script has taken longer and longer to complete, so I finally decided to find out what was really going on and fix this speed problem.

Profiling shell scripts isn’t quite as easy as profiling C programs; with C, you can just use a tool like Valgrind to find out where all the effort is going, but shell scripts depend on the speed of the shell, the kernel, and the plethora of programs executed by the script, so it’s harder to follow what goes on and find the time sinks. However, I observed that a lot of time was spent in the steps between deleting items; since each rewrite and safety check is done on-the-fly as deletion rules are presented for processing, those were likely candidates. The first thing I wanted to know was how many times the script called an external program to do work; you can easily kill a shell script’s performance with unnecessary external program executions. To gather this info, I used the strace tool:

strace -f -o strace.txt tt_cleaner

This produced a file called “strace.txt” which contains every single system call issued by both the cleaner script and any forked programs. I then looked for the execve() system call and gathered the counts of the programs executed, excluding “execve resumed” events which aren’t actual execve() calls:

grep execve strace.txt | sed ‘s/.*execve/execve/’ | cut -d\” -f2 | grep -v resumed | sort | uniq -c | sort -g

The resulting output consisted of numbers below 100 until the last two lines, and that’s when I realized where the bottleneck might be:

4157 /bin/sed
11227 /usr/bin/grep

That’s a LOT of calls to sed, but the number of calls to grep was almost three times bigger, so that’s where I started to search for ways to improve. As I’ve said, the rewrite code takes each rule for deletion and rewrites it for other possible interpretations; “Username\Application Data” on Windows XP was moved to “Username\AppData\Roaming” on Vista and up, while “All Users\Application Data” was moved to “C:\ProgramData” in the same, plus there is a potential mirror of every single rule in “Username\AppData\Local\VirtualStore”. The rewrite code handles the expansion of the deletion rules to cover every single one of these possible cases. The outer loop of the rewrite engine grabs each rewrite rule in order while the inner loop does the actual rewriting to the current rule AND and all prior rewrites to ensure no possibilities are missed (VirtualStore is largely to blame for this double-loop architecture). This means that anything done within the inner loop is executed a huge number of times, and the very first command in the inner loop looked like this:

if echo “${RWNAMES[$RWNCNT]}” | grep -qi “${REWRITE0[$RWCNT]}”

This checks to see if the rewrite rule applies to the cleaner rule before doing the rewriting work. It calls grep once for every single iteration of the inner loop. I replaced this line with the following:

if [[ “${RWNAMES[$RWNCNT]}” =~ .*${REWRITE0[$RWCNT]}.* ]]

I had to also tack a “shopt -s nocasematch” to the top of the shell script to make the comparison case-insensitive. The result was a 6x speed increase. Testing on an existing data backup which had already been cleaned (no “work” to do) showed a consistent time reduction from 131 seconds to 22 seconds! The grep count dropped massively, too:

97 /usr/bin/grep

Bash can do wildcard and regular expression matching of strings (the =~ comparison operator is a regex match), so anywhere your shell script uses the “echo-grep” combination in a loop stands to benefit greatly by exploiting these Bash features. Unfortunately, these are not POSIX shell features and using them will lead to non-portable scripts, but if you will never use the script on other shells and the performance boost is significant, why not use them?

The bigger lesson here is that you should take some time to learn about the features offered by your shell if you’re writing advanced shell scripts.

Update: After writing this article, I set forth to eliminate the thousands of calls to sed. I was able to change an “echo-sed” combination to a couple of Bash substring substitutions. Try it out:

FOO=${VARIABLE/string_to_replace/replacement}

It accepts $VARIABLES where the strings go, so it’s quite powerful. Best of all, the total runtime dropped to 10.8 seconds for a total speed boost of over 11x!

Finding Duplicates Faster: The story of ‘jdupes’, or how I unexpectedly became a better programmer

The problem of finding and handling duplicate files has been with us for a long time. Since the end of the year 1999, the de facto answer to “how can I find and delete duplicate files?” for Linux and BSD users has been a program called ‘fdupes’ by Adrian Lopez. This venerable staple of system administrators is extremely handy when you’re trying to eliminate redundant data to reclaim some disk space, clean up a code base full of copy-pasted files, or delete photos you’ve accidentally copied from your digital camera to your computer more than once. I’ve been quite grateful to have it around–particularly when dealing with customer data recovery scenarios where every possible copy of a file is recovered and the final set ultimately contains thousands of unnecessary duplicates.

Unfortunately, development on Adrian’s fdupes had, for all practical purposes, ground to a halt. From June 2014 to July 2015, the only significant functional changes to the code have been modification to compile on Mac OS X. The code’s stagnant nature has definitely shown itself in real-world tests; in February 2015, Eliseo Papa published “What is the fastest way to find duplicate pictures?” which contains benchmarks of 15 duplicate file finders (including an early version of my fork which we’ll ignore for the moment) that places the original fdupes dead last in operational speed and shows it to be heavily CPU-bound rather than I/O-bound. In fact, Eliseo’s tests say that fdupes takes a minimum of 11 times longer to run than 13 of the other duplicate file finders in the benchmark!

As a heavy user of the program on fairly large data sets, I had noticed the poor performance of the software and became curious as to why it was so slow for a tool that should simply be comparing pairs of files. After inspecting the code base, I found a number of huge performance killers:

  1. Tons of time was wasted waiting on progress to print to the terminal
  2. Many performance-boosting C features weren’t used (static, inline, etc)
  3. A couple of one-line functions were very “hot,” adding heavy call overhead
  4. Using MD5 for file hashes was slower than other hash functions
  5. Storing MD5 hashes as strings instead of binary data was inefficient
  6. A “secure” hash like MD5 isn’t needed; matches get checked byte-for-byte

I submitted a pull request to the fdupes repository which solved these problems in December 2014. Nothing from the pull request was discussed on Github and none of the fixes were incorporated into fdupes. I emailed Adrian to discuss my changes with him directly and there was some interest in certain changes, but in the end nothing was changed and my emails became one-way.

It seemed that fdupes development was doomed to stagnation.

In the venerable traditions of open source software. I forked it and gave my new development tree a new name to differentiate it from Adrian’s code: jdupes. I solved the six big problems outlined above with these changes:

  1. Rather than printing progress indication for every file examined, I added a delay counter to drastically reduce terminal printing. This was a much bigger deal when using SSH.
  2. I switched the code and compilation process to use C99 and added relevant keywords to improve overall performance.
  3. The “hot” one-line functions were changed to #define functions to chop function call overhead for them in half.
  4. (Also covers 5 and 6) I wrote my own hash function  and replaced all of the MD5 code with it, resulting in a benchmarked speed boost of approximately 17%. The resulting hashes are passed around as a 64-bit unsigned integer, not an ASCII string, which (on 64-bit machines) reduces hash comparisons to a single compare instruction.

 

After forking all of these changes and enjoying the massive performance boost they brought about, I felt motivated to continue looking for potential improvements. I didn’t realize at the time that a simple need to eliminate duplicate files more quickly would morph into spending the next half-year ruthlessly digging through the code for ways to make things better. Between the initial pull request that led to the fork and Eliseo Papa’s article, I managed to get a lot done:

 

At this point, Eliseo published his February 19 article on the fastest way to find duplicates. I did not discover the article until July 8 of the same year (at which time jdupes was at least three versions higher than the one being tested), so I was initially disappointed with where jdupes stood in the benchmarks relative to some of the other tested programs, but even the early jdupes (version 1.51-jody2) code was much faster than the original fdupes code for the same job.

1.5 months into development, jdupes was 19 times faster in a third-party test than the code it was forked from.

Nothing will make your programming efforts feel more validated than seeing something like that from a total stranger.

Between the publishing of the article and finding the article, I had continued to make heavy improvements:

 

When I found Eliseo’s article from February, I sent him an email inviting him to try out jdupes again:

I have benchmarked jdupes 1.51-jody4 from March 27 against jdupes 1.51-jody6, the current code in the Git repo. The target is a post-compilation directory for linux-3.19.5 with 63,490 files and 664 duplicates in 152 sets. A “dry run” was performed first to ensure all files were cached in memory first and remove variances due to disk I/O. The benchmarking was as follows:

$ ./compare_fdupes.sh -nrq /usr/src/linux-3.19.5/
Installed fdupes:
real 0m1.532s
user 0m0.257s
sys 0m1.273s

Built fdupes:
real 0m0.581s
user 0m0.247s
sys 0m0.327s

Five sequential runs were consistently close (about ± 0.020s) to these times.

In half a year of casual spare-time coding, I had made fdupes 32 times faster.

There’s probably not a lot more performance to be squeezed out of jdupes today. Most of my work on the code has settled down into working on new features and improving Windows support. In particular, Windows has supported hard linked files for a long time, and I’ve taken full advantage of Windows hard link support. I’ve also made the progress indicator much more informative to the user. At this point in time, I consider the majority of my efforts complete. jdupes has even gained inclusion as an available program in Arch Linux.

Out of the efforts undertaken in jdupes, I have gained benefits for other projects as well. For example, I can see the potential for using the string_table allocator in other projects that don’t need to free() string memory until the program exits. Most importantly, my overall experience with working on jdupes has improved my overall programming skills tremendously and I have learned a lot more than I could have imagined would come from improving such a seemingly simple file management tool.

If you’d like to use jdupes, feel free to download one of my binary releases for Linux, Windows, and Mac OS X. You can find them here.

I hate Java. I hate Java. I hate Java.

I’ll readily admit, my programming experience is mostly limited to 6502/65816 assembler, some C, and a lot of PHP/MySQL, but I already know that I hate Java.  Why?  It’s simple, really: it doesn’t make any sense at all, and it’s extremely unhelpful when something goes wrong.

This rant stems from working on a Java IRC bot that was torn up and rebuilt by someone for a custom purpose.  I was hosting the bot until it simply stopped working.  It choked up and wouldn’t start after a certain revision, despite working on the guy’s Windows box.  I snagged a newer JRE, and instead of the horrid 12-line error when trying to start it, I get nothing but “IO exception occured.”  Thanks for the informative message, really.  I’m so glad to know that an “IO” (don’t you mean I/O?) exception occurred.  Previously, when I tried to manipulate the code myself, I couldn’t even change it to do the most basic things.  Why not?  Because Java doesn’t make sense at all, especially to someone used to working with C and PHP (you know, real programming languages).  A lot of Java-heads will moan about my opinion or offer up lame excuses for Java, but the truth is that it’s a garbage language that doesn’t make any sense, and from what I’ve read its “standards” change as the Sun JRE releases incrementally move up.  I won’t touch it with a ten-foot pole.