What WON’T speed up the jdupes duplicate file finder

Some of you are probably aware that I’m the person behind the jdupes duplicate file finder. It’s amazing how far it has spread over the past few years, especially considering it was originally just me working on speeding up fdupes because it was too slow and I didn’t really have any plans to release my changes. Over the years, I’ve pretty much done everything possible that had a chance of significantly speeding up the program, but there are always novel ideas out there about what could be done to make things even better. I have received a lot of suggestions by both email and the jdupes issue tracker on GitHub, and while some of them have merit, there are quite a few that come up time and time again. It’s time to swat some of these down more publicly, so here is a list of things that people suggest to speed up jdupes, but won’t really do that.

Switching hash algorithms

One way I sped up jdupes after forking the original fdupes code was to swap out the MD5 secure hash algorithm for my own custom “jodyhash” fast hash algorithm. This made a huge difference in program performance. MD5 is a CPU-intensive thing to calculate, but jodyhash was explicitly written to use primitive CPU operations that translate directly to simple, fast, and compact machine language instructions. Since discovering that there were some potentially undesirable properties to jodyhash (though those properties had zero effect in practical testing on real-world data), the slightly faster xxHash64 fast hash algorithm has been used. Still, there are those who suggest changing the hash algorithm yet again to improve performance further. Candidates such as t1ha are certainly a little faster than xxHash64, but switching to them has no real value. I chose xxHash64 in part due to its containment within a single .c/.h file pair, making it particularly easy to include with the program, but some replacement hash code bases are not so easily included. Even if they were, the hash algorithm won’t make enough of a difference to change anything in any real-world workloads. The problem is that the vast majority of the slowness in jdupes stems from waiting on I/O operations to complete, not from CPU usage. This isn’t true in fdupes, where MD5 is still stubbornly used as the hash algorithm, but jdupes spends a ridiculous amount of time waiting on the operating system to complete disk reads and a very tiny amount of time waiting on hash calculations to complete.

Tree balancing

At one point, I wrote a spiffy bit of tree rebalancing code that would go down the file tree and change the parent-child relationships to more fairly balance out the tree depth for any given branch. The use of a hash algorithm with minimally decent randomization would mostly balance things out from the start, though, so my concerns about excessive tree depth turned out to be unfounded, and tree rebalance code did nothing to improve overall performance, so it was ultimately scrapped. fdupes tried to use red-black trees at one point, but discarded the implementation for similar reasons of insufficient gains. The file tree built in jdupes tends to balance out reasonably well on its own.

Delete during scanning

This seems like a good idea on paper (and indeed, fdupes has implemented this as an option), but it’s not a good idea in practice for most cases. It doesn’t necessarily speed things up very much and it guarantees that options which work on full file sets (such as the file ordering/sorting options) are not usable. The straightforward “delete any duplicates as fast as possible” case is improved, but anything much more complex is impossible. The performance boost is usually not worth it, because at best, a few extra file comparisons may not happen. It’s a tempting feature, but the risks outweigh the benefits and the added complexity for corner cases, so I’m never planning to do this.

Comparing final file blocks after first blocks

There are two reasons not to do this. The biggest is that I’ve run tests on large data sets and found that the last block of a pair of files tend to match if the first blocks match, so it won’t fast-exclude the vast majority of file pairs seen in the wild. The secondary reason is that moving from the first block to the last block of a file (particularly large files) when using a mechanical disk or disk array will cause a big penalty in the form of three extra (and possibly very long) disk head seeks for every file pair being checked. This is less of an issue on a solid-state drive, but remember that bit about most files having identical end blocks if they have identical start blocks? It’s a good idea that only slows things down in practical application on real-world data. Just for an added sting, jdupes uses an optimization where the first block’s hash is not redone when hashing the full file, but the hash of a final block is not reusable in the same way, so the labor would have to be doubled for the full-file match check.

Comparing median blocks

The rationale is similar to comparing final blocks, but slightly different. The seeks are often shorter and the chances of rejection more likely with median blocks, but all of the problems outlined for final blocks are still present. The other issue is that median blocks require a tiny bit of extra work to calculate what block is the median block for a given file. It’s added complexity with no real reward for the effort, just like final blocks.

Cache hashes across runs

This is actually a planned feature, but there are massive pitfalls with caching file hashes. Every loaded hash would have to be checked against a list of files that actually exist, requiring considerable computational effort. There is a risk that a file’s contents were modified without the cache being updated. File path relativity is an issue that can get ugly. Where do you store the database, and in what format? How do you decide to invalidate cache entries? The xxHash64 fast hash algorithm might also not be suitable for such persistent hashes to be relatively safe to use, implying a return to the slowness of secure hash algorithms and the loss of performance that is implied by such a change. It’s a low-hanging and extremely tempting way to speed things up, but the devil is in the details, and it’s a really mean devil. For now, it’s better to simply not have this around.

Those are just a few ways that things can’t be so easily sped up. Do you have any more “bad” ideas that come to mind?

Leave a Reply

Your email address will not be published. Required fields are marked *