Tag: Linux

Cute cloned dogs

A CHALLENGER APPEARS: “fclones”…fastest duplicate scanner ever? It’s complicated.

While perusing my link referrals on GitHub, I noticed this thread where my duplicate scanner jdupes was mentioned. I then noticed the comment below it:

There is also a much faster modern alternative to fdupes and jdupes: fclones. It searches for files in parallel and uses a much faster hash function than md5.

My response comment pretty much says it all, so I’m making that the entire remainder of this post.

I noticed that fclones does not do the byte-for-byte safety check that jdupes (and fdupes) does. It also relies exclusively on a non-cryptographic hash for comparisons. It is unsafe to rely on a non-cryptographic hash as a substitute for the file data, and comparisons between duplicate finders running in full-file comparison mode vs. running in hash-and-compare mode are not appropriate. The benchmark on the fclones page ran jdupes 1.14 without the -Q option that disables the final byte-for-byte confirmation, so there is a lot of extra work for the purpose of avoiding potential data loss being done by jdupes and being skipped entirely by fclones.

jdupes already uses a faster hash function than MD5 (xxHash64 as of this writing, previously jodyhash), and it is fairly trivial to switch to even faster hash functions if desired…but the fact is that once you switch to any “fast hash” function instead of a cryptographic one the hash function used is rarely a bottleneck, especially compared to the I/O bottleneck represented by most consumer-grade hard drives and low-end SSDs. If everything to be checked is in the buffer cache already then it might be a bottleneck, but the vast majority of duplicate scan use cases will be performed on data that is not cached.

Searching for files in parallel is only an advantage if the disk I/O is not a bottleneck, and you’ll notice that the fclones author performed the dedupe benchmarks on a (presumably very fast since it’s paired to a relatively recent Xeon) 512GB NVMe SSD with an extremely fast multi-core multi-threaded processor. There is a very small access time penalty for random read I/O on a fast NVMe SSD, but there is an extremely large access time penalty for random read I/O on a traditional rotating hard drive or RAID array composed of several hard drives. Any number of multiple threads firing off reads on the same RAID array at the same time will slow even most RAID arrays to a single-digit MB/sec death crawl. I understand that many people will be working with SSDs and some duplicate scanner programs will be a better choice for SSDs, but the majority of computer systems have spinning rust instead of flash-based disks.

It is strongly advisable to (A) run your own benchmarks on your specific workload and hardware, and (B) understand how to use the program within your own personal acceptable level of risk. Both of these are different for every different person’s needs.

UPDATE: I found another instance of the fclones author claiming jdupes being single-threaded makes it slow; to quote directly:

Unfortunately these older programs are single-threaded, and on modern hardware (particularly on SSDs) they are order of magnitude slower than they could be. If you have many files, a better option is to use fclones (disclaimer: I’m the author), which uses multiple threads to process many files in parallel and offers additional filtering options to restrict the search.

The points I’ve made above still stand. Unless you’re running the author’s enterprise-grade high-end hardware, your disk random access latency is your major limiting factor. I’d love to see what fclones does on something like a 24TB disk array. I’d wager–exactly as stated above–that 8 or 32 simultaneous I/O threads brings the whole process to a death crawl. Perhaps I should bite the bullet and run the tests myself.

UPDATE 2: I was right. Benchmark article and analysis forthcoming.

Featured image Licensed under CC-BY from Steve Jurvetson, https://www.flickr.com/photos/jurvetson/3327872958

Is “dupd” really faster than my duplicate scanner “jdupes?”

UPDATE: I posted this and went to sleep. After I woke up, my inbox was full of issue report and pull request comments and closings. dupd now supports relative path arguments, has a larger I/O block size, and I patched a bug that caused total failure on XFS. That’s awesome!

Why am I picking on dupd?

I have written about my duplicate scanner “jdupes” in a previous post and have spent a lot of time enhancing it. I have run across two different blog posts online about duplicate file finders that compare jdupes with another scanner called “dupd” and proclaims dupd the clear speed winner for finding duplicate files. One post is by Eliseo Papa, comparing jdupes (when it was briefly called “fdupes-jody”) against dupd and showing that dupd was insanely faster, but used significantly more CPU and RAM. The other is by the dupd developer Jyri J. Virkki with a much newer version of jdupes than Eliseo used, finding that the scanned home directory with over 160,000 files was scanned by jdupes in approximately double the time as by dupd. Eliseo was using what today would be considered a very old and slow version of my program. Eliseo also tested a lot of scanners and didn’t provide any actual commands or methodology beyond a simple overview, though that’s completely understandable given how many programs were involved and that each one probably worked very differently. I’m more interested in Jyri’s results; Jyri provides actual command usage so I can attempt to replicate those results on my own systems. If dupd is double the speed of jdupes, I’d love to know how it’s being done, but first I must run some tests to verify that this is true.

Please note that I’m not really “picking on” dupd. It has been declared better than my own duplicate scanner by a third party and by the developer. There’s some pretty brilliant programming work in dupd and I have thoroughly enjoyed reading the code. This is intended to be a respectful rebuttal due to a combination of my own pride in my work and some obvious issues with Jyri’s testing. With that said, let’s discuss!

Methodology

Here’s a brief overview of the rest of this post.

  1. Use dupd and understand how to use it
  2. Explain the problem with Jyri’s tests
  3. Compile the latest version of both programs (GitHub master source)
  4. Detail of test system and test data
  5. Attempt to replicate Jyri’s results with a large data set on one of my machines
  6. Run my own benchmark tests to see which one is actually faster for me
  7. Final thoughts

Use and understand dupd

I ran into some immediate problems and quirks with how dupd works. Within about 15 minutes, I realized that dupd is a very different beast from jdupes. Where jdupes operates on the specified directories and saves no state anywhere between program runs, dupd will maintain an SQLite database in your home directory (by default) with the information it finds during its file scans. I tested dupd somewhere that had a directory with a backtick in the name and dupd skipped it; this is due to the use of a backtick as the SQLite separator character; dupd auto-ignores anything with a backtick to avoid potential problems with SQLite queries (–pathsep provides a workaround but this is still not quite ideal). dupd also doesn’t provide a way to ignore hard linked file pairs, so it’s up to the user to check the inode/device number pairs of dupd’s reported duplicate files with a command like stat if this is an important thing.

The most painful flaw I’ve found in dupd (and the one that gave me the most difficulty in working with it) is the file path specification restrictions. They’re absolutely maddening for my purposes because specifying paths is difficult. With jdupes I can arbitrarily work with multiple directories and use shell wildcard expansion to pick them out, but dupd only takes its file paths two ways: the implicit default of the current directory or one or more absolute paths (starting from /) which must each be prefixed with the -p or –path option. I can type something like “cd /home/user/food; jdupes -nr A* B*” but dupd requires that I type out e.g. “dupd scan –path /home/user/food/Apple –path /home/user/food/Avocado –path /home/user/food/Bacon –path /home/user/food/Broccoli” to selectively scan the same data set. For this reason, I restricted my testing to a single directory, passing nothing to dupd and passing a single dot to jdupes.

Why Jyri’s tests are flawed

There are some flaws that make Jyri’s “dupd vs. jdupes” tests unfair, mostly due to mismatched overhead:

  • dupd’s use of SQLite; to perform an apples-to-apples comparison, the –nodb switch is mandatory to disable this feature and dump the duplicate list to stdout just like jdupes.
  • Jyri’s test performed a duplicate scan with dupd which produced no output (no –nodb option and no run of “dupd report”) whereas jdupes produced full output
  • In addition to producing output, jdupes dumped that output to a file on disk
  • The jdupes progress indicator wasn’t disabled; progress indication blocks while printing to the terminal, slowing execution significantly for the sake of user-friendliness
  • Jyri’s tests were only run on cached data (high CPU, run times didn’t deviate) which tests algorithm speed but not performance when reading uncached data on a rotating disk

Compile the latest versions of each

For the record, I used dupd at commit 8f6ea02 and jdupes at commit 14d92bc in this test. If you want to attempt to replicate my results, you’ll need these exact revisions.

Test system and test data set

The system in use has an AMD A8-7600 quad-core CPU, 8GB of DDR3-1600 RAM, and 3x3TB 7200RPM SATA-6G hard drives. The drives are configured as a Linux md raid5 using the XFS filesystem (mounted with ‘noatime’). The deadline I/O scheduler is used for all physical devices. Maximum raw streaming read (measured with ‘pv’) on this array tops out at 374 MiB/sec near the beginning of the array and is about half as fast at the end. The array contains an “aged filesystem” which has some file fragmentation and probably has significantly less than ideal distribution of files relative to one another, making it an excellent choice for testing the program differences in a less artificial manner.

The test system is a dedicated, isolated system. It is not running anything other than the benchmarks. The CPU scaling governor is set to “performance” to lock the CPU at maximum speed and remove deviations caused by CPU frequency changes.

The first data set is a large collection of over 48,000 files that are stored in 135 subdirectories with no further subdirectories of their own. The average file size is 495 KB, while the smallest file is 925 bytes and the largest file is 3.3 MB. Total size is 23 GB. There are five hard-linked sets of files between them, but none are duplicates otherwise, so this is sort of a worst-case scenario because all actual duplicate scan comparison work is wasted effort.

The second data set is a smaller set of over 31,000 files in 12 subdirectories and 762 total directories (find -type d | wc -l). The average file size is 260 KB, the smallest file is 40 bytes while the largest file is 6.3 MB. Total size is 7.8 GB. There are no hard linked pairs and there are 208 duplicate files in 39 duplicate sets, occupying 23 MB.

Attempt to replicate Jyri’s tests

Jyri ran the following commands:

repeat 5 time ./jdupes -r $HOME -A  > out
repeat 5 time dupd scan -p $HOME -q

To normalize the results, I’ll read all the files so they’ll populate the disk cache as Jyri seems to have done. I will run the Bash equivalent of these commands on my data set as my first tests. I also ran

$ export TIMEFORMAT='%Us user %Ss system %P%% cpu %E total'

to get timing output similar to Jyri’s time measurement output. I have erased the jdupes progress output beyond the first run for aesthetic reasons.

Data set 1

$ X=0; while [ $X -lt 5 ]; do time jdupes -r . -A > out; X=$((X + 1)); done
Examining 48271 files, 135 dirs (in 1 specified)
0.043s user 0.060s system 96.47% cpu 0.107 total
0.027s user 0.080s system 97.53% cpu 0.109 total
0.043s user 0.060s system 95.55% cpu 0.108 total
0.040s user 0.067s system 97.02% cpu 0.110 total
0.037s user 0.067s system 95.76% cpu 0.108 total

$ X=0; while [ $X -lt 5 ]; do time dupd scan -q; X=$((X + 1)); done
0.110s user 0.087s system 36.43% cpu 0.540 total
0.120s user 0.077s system 67.83% cpu 0.290 total
0.117s user 0.077s system 67.13% cpu 0.288 total
0.110s user 0.083s system 67.80% cpu 0.285 total
0.120s user 0.077s system 53.00% cpu 0.371 total

Even with Jyri’s switch choices which are biased towards dupd, dupd is still 2.5 times slower (or worse) than jdupes here. Surprisingly, the wild time deviation seen above happened in nearly every run. The closest run to any consistency circled around 0.300 total time. This test implies that if there are almost no duplicates, dupd wastes significantly more time than jdupes.

Data set 2

$ X=0; while [ $X -lt 5 ]; do time jdupes -r . -A > out; X=$((X + 1)); done
Examining 31663 files, 762 dirs (in 1 specified)
0.037s user 0.060s system 99.88% cpu 0.097 total
0.040s user 0.053s system 67.59% cpu 0.138 total
0.027s user 0.067s system 97.13% cpu 0.096 total
0.020s user 0.073s system 96.78% cpu 0.096 total
0.027s user 0.067s system 96.71% cpu 0.096 total
$ X=0; while [ $X -lt 5 ]; do time dupd scan -q; X=$((X + 1)); done
0.090s user 0.077s system 79.69% cpu 0.209 total
0.100s user 0.070s system 82.49% cpu 0.206 total
0.107s user 0.067s system 60.84% cpu 0.285 total
0.090s user 0.080s system 80.92% cpu 0.210 total
0.097s user 0.077s system 81.33% cpu 0.213 total

I chose the most favorable run for the dupd results; the highest total time reported was 0.561 but similar inconsistency to data set 1 was also present here. Regardless, dupd is still over two times slower than jdupes on this data set.

My benchmarks of jdupes vs. dupd

I wanted to run thorough tests on both programs to compare both algorithmic and real-world performance. I also wanted to see how the use of SQLite affects dupd’s performance since that is a major difference and could be adding some overhead. I ran one set of tests with all scanned file data already in the block cache to test the algorithm’s raw speed (no disk access other than possibly dupd’s SQLite accesses) and another with drives synced and disk caches dropped prior to each run using “sync; echo 3 > /proc/sys/vm/drop_caches”. I tested dupd with and without the –nodb option to see if it made any difference. I turned off all forms of jdupes output except printing the final duplicate list. All program output for both programs was redirected to /dev/null to eliminate any waiting resulting from I/O blocking on terminals or disks. The dupd SQLite database was purged during all runs.

Cached (algorithmic) performance: data set 1

$ X=0; while [ $X -lt 5 ]; do time jdupes -Arq . >/dev/null; X=$((X + 1)); done
0.047s user 0.053s system 98.33% cpu 0.102 total
0.053s user 0.047s system 99.30% cpu 0.101 total
0.033s user 0.067s system 98.95% cpu 0.101 total
0.043s user 0.053s system 96.32% cpu 0.100 total
0.027s user 0.073s system 99.09% cpu 0.101 total
$ X=0; while [ $X -lt 5 ]; do rm -f ~/.dupd_sqlite; sync; time dupd scan -q >/dev/null 2>/dev/null; X=$((X + 1)); done
0.087s user 0.110s system 91.32% cpu 0.215 total
0.110s user 0.087s system 91.61% cpu 0.215 total
0.107s user 0.090s system 90.56% cpu 0.217 total
0.120s user 0.077s system 68.85% cpu 0.286 total
0.117s user 0.077s system 92.83% cpu 0.208 total
$ X=0; while [ $X -lt 5 ]; do rm -f ~/.dupd_sqlite; sync; time dupd scan -q --nodb >/dev/null 2>/dev/null; X=$((X + 1)); done
0.100s user 0.093s system 140.85% cpu 0.137 total
0.117s user 0.073s system 140.16% cpu 0.136 total
0.110s user 0.080s system 140.22% cpu 0.136 total
0.100s user 0.090s system 139.95% cpu 0.136 total
0.113s user 0.077s system 140.69% cpu 0.135 total

Wow! dupd’s performance went up significantly with the –nodb option! This seems to prove that SQLite is slowing things down for these one-shot runs. Even with the 30+% performance boost, the dupd algorithm still seems to be 35% slower than jdupes despite dupd having the obvious advantage of multi-threaded execution.

Cached (algorithmic) performance: data set 2

 $  X=0; while [ $X -lt 5 ]; do time jdupes -Arqm . >/dev/null; X=$((X + 1)); done
0.027s user 0.063s system 99.29% cpu 0.091 total
0.037s user 0.050s system 95.70% cpu 0.091 total
0.040s user 0.047s system 95.93% cpu 0.090 total
0.033s user 0.057s system 99.29% cpu 0.091 total
0.033s user 0.053s system 96.08% cpu 0.090 total
$ X=0; while [ $X -lt 5 ]; do rm ~/.dupd_sqlite; sync; time dupd scan -q >/dev/null 2>/dev/null; X=$((X + 1)); done
0.090s user 0.080s system 58.91% cpu 0.289 total
0.107s user 0.063s system 82.97% cpu 0.205 total
0.110s user 0.060s system 65.64% cpu 0.259 total
0.110s user 0.060s system 47.40% cpu 0.359 total
0.100s user 0.073s system 59.90% cpu 0.289 total
$ X=0; while [ $X -lt 5 ]; do rm -f ~/.dupd_sqlite; sync; time dupd scan -q --nodb >/dev/null 2>/dev/null; X=$((X + 1)); done
0.113s user 0.053s system 136.57% cpu 0.122 total
0.103s user 0.063s system 136.45% cpu 0.122 total
0.107s user 0.060s system 138.46% cpu 0.120 total
0.097s user 0.067s system 136.78% cpu 0.119 total
0.090s user 0.077s system 137.49% cpu 0.121 total

For some reason, this data set significantly slowed down the SQLite-enabled run of dupd. It’s now about three times slower than jdupes. A major speedup with –nodb appears again, keeping in line with the approximate 1/3 slower performance than jdupes seen in the previous test.

Real-world performance: data set 1

# X=0; while [ $X -lt 5 ]; do sync; echo 3 > /proc/sys/vm/drop_caches; time jdupes -Arq . >/dev/null; X=$((X + 1)); done
 0.210s user 0.790s system 1.52% cpu 65.432 total
 0.190s user 0.730s system 1.39% cpu 66.036 total
 0.137s user 0.780s system 1.39% cpu 65.574 total
 0.230s user 0.813s system 1.59% cpu 65.361 total
 0.167s user 0.800s system 1.47% cpu 65.444 total
 # X=0; while [ $X -lt 5 ]; do rm -f ~/.dupd_sqlite; sync; echo 3 > /proc/sys/vm/drop_caches; time dupd scan -q >/dev/null 2>/dev/null; X=$((X + 1)); done
0.293s user 0.890s system 1.74% cpu 67.809 total
0.273s user 0.960s system 1.81% cpu 67.766 total
0.243s user 0.873s system 1.64% cpu 68.069 total
0.247s user 0.910s system 1.70% cpu 67.965 total
0.267s user 0.827s system 1.61% cpu 67.830 total
# X=0; while [ $X -lt 5 ]; do rm -f ~/.dupd_sqlite; sync; echo 3 > /proc/sys/vm/drop_caches; time dupd scan -q --nodb >/dev/null 2>/dev/null; X=$((X + 1)); done
0.273s user 0.767s system 1.54% cpu 67.403 total
0.253s user 0.930s system 1.75% cpu 67.448 total
0.233s user 0.817s system 1.55% cpu 67.409 total
0.223s user 0.890s system 1.63% cpu 67.908 total
0.273s user 0.987s system 1.83% cpu 68.511 total

AHA! Now that we’re testing on real disks with a large uncached workload, the story is very different. Neither jdupes nor dupd will make your hard drives faster. While dupd is still slower, the extra two or three seconds pales in comparison to the disk I/O time, bringing the performance gap down to an insignificant ~3.7%. Note that –nodb didn’t make a significant difference this time either. No one will ever notice that 0.3-0.4 second speedup.

Real-world performance: data set 2

# X=0; while [ $X -lt 5 ]; do sync; echo 3 > /proc/sys/vm/drop_caches; time jdupes -Arq . >/dev/null; X=$((X + 1)); done
 0.140s user 0.610s system 1.40% cpu 53.333 total
 0.147s user 0.717s system 1.56% cpu 55.284 total
 0.160s user 0.683s system 1.52% cpu 55.268 total
 0.173s user 0.710s system 1.60% cpu 54.890 total
 0.157s user 0.697s system 1.55% cpu 54.968 total
X=0; while [ $X -lt 5 ]; do rm -f ~/.dupd_sqlite; sync; echo 3 > /proc/sys/vm/drop_caches; time dupd scan -q >/dev/null 2>/dev/null; X=$((X + 1)); done
0.227s user 0.730s system 1.69% cpu 56.389 total
0.220s user 0.817s system 1.82% cpu 56.755 total
0.250s user 0.760s system 1.77% cpu 56.785 total
0.230s user 0.773s system 1.79% cpu 56.044 total
0.223s user 0.750s system 1.73% cpu 56.241 total
# X=0; while [ $X -lt 5 ]; do rm -f ~/.dupd_sqlite; sync; echo 3 > /proc/sys/vm/drop_caches; time dupd scan -q --nodb >/dev/null 2>/dev/null; X=$((X + 1)); done
0.293s user 0.750s system 1.83% cpu 56.803 total
0.237s user 0.727s system 1.71% cpu 56.052 total
0.247s user 0.773s system 1.82% cpu 55.988 total
0.177s user 0.873s system 1.86% cpu 56.439 total
0.217s user 0.820s system 1.83% cpu 56.409 total

Data set 1 and data set 2 have the same results. When disk access overhead is included in testing, the algorithmic speed differences don’t seem to make a significant difference overall.

Real-world performance: data set 3 (bonus round)

Based on the tight match with lots of files generally around 0.5 MB in size, I decided to do some extra real-world tests. Data set 3 is all of my personal photography, scanned family photos, and video clips from a combination of my various Android phones and my Canon T1i DSLR. Statistics: over 18500 files in 395 directories, average size 2.6 MB, smallest 0, largest 552 MB, 47GB total, no duplicates or hard links at all.

# X=0; while [ $X -lt 5 ]; do sync; echo 3 > /proc/sys/vm/drop_caches; time jdupes -Arq . >/dev/null; X=$((X + 1)); done
0.030s user 0.243s system 2.02% cpu 13.483 total
0.043s user 0.247s system 2.14% cpu 13.542 total
0.040s user 0.230s system 2.03% cpu 13.245 total
0.030s user 0.230s system 1.95% cpu 13.265 total
0.040s user 0.223s system 1.97% cpu 13.308 total
# X=0; while [ $X -lt 5 ]; do rm -f ~/.dupd_sqlite; sync; echo 3 > /proc/sys/vm/drop_caches; time dupd scan -q >/dev/null 2>/dev/null; X=$((X + 1)); done
0.060s user 0.267s system 2.59% cpu 12.601 total
0.063s user 0.267s system 2.63% cpu 12.510 total
0.057s user 0.277s system 2.67% cpu 12.447 total
0.057s user 0.273s system 2.65% cpu 12.432 total
0.053s user 0.273s system 2.62% cpu 12.447 total
# X=0; while [ $X -lt 5 ]; do rm -f ~/.dupd_sqlite; sync; echo 3 > /proc/sys/vm/drop_caches; time dupd scan -q --nodb >/dev/null 2>/dev/null; X=$((X + 1)); done
0.080s user 0.257s system 2.67% cpu 12.566 total
0.080s user 0.253s system 2.68% cpu 12.428 total
0.053s user 0.253s system 2.47% cpu 12.371 total
0.053s user 0.250s system 2.43% cpu 12.445 total
0.093s user 0.240s system 2.68% cpu 12.432 total

The plot thickens. On this data set with no duplicates, no hard links, and a wide variety of file sizes, jdupes is slower than dupd by ~6.4%, a very different result than with the huge sets of smaller files. It’s still not a large real-world variance; no one will miss the 0.8 second delay, but it is a very interesting result because it contradicts every previous test that says dupd is the slower choice. I also find it interesting that the gap between dupd with and without –nodb closes up on this data set.

Real-world performance: data set 4 (bonus round)

This is a superset of data set 1. It’s the same kind of data but it’s an absolutely massive set with much more variance and no shortage of duplicates, both hard linked and still duplicated. Stats: over 900,000 files in 12599 directories, average size 527 KB, smallest file 37 bytes, largest file 55 MB, lots of hard links and duplicates. After an entire day at work and only finishing three runs of jdupes (not unusual for this data set) I decided that two days worth of not accessing the machine to accommodate testing was too inconvenient, so I ultimately only finished one dupd test.

# X=0; while [ $X -lt 5 ]; do sync; echo 3 > /proc/sys/vm/drop_caches; time jdupes -Arq . >/dev/null; X=$((X + 1)); done
18.750s user 87.107s system 1.35% cpu 7818.862 total
18.047s user 86.247s system 1.32% cpu 7894.973 total
18.400s user 87.173s system 1.33% cpu 7890.817 total
# X=0; while [ $X -lt 5 ]; do rm -f ~/.dupd_sqlite; sync; echo 3 > /proc/sys/vm/drop_caches; time dupd scan -q >/dev/null 2>/dev/null; X=$((X + 1)); done
25.380s user 104.373s system 1.32% cpu 9804.224 total
^C

Why was dupd 2000 seconds slower than jdupes? I can only speculate. One possibility is that the multi-thread behavior of the program results in extra disk thrashing. My understanding is that dupd uses one thread to read metadata and another thread to perform hash and compare work; if directory reads and file reads happen at the same time (they reach the I/O queue in succession) the disk will have to thrash between those two locations to fulfill the I/O requests. Perhaps the deadline I/O scheduler and dupd don’t work well together. Perhaps SQLite overhead plays a big role. I know that 900,000 files are probably scattered across the disk from one another, so it could be that the always-sequential jdupes file access characteristics plus the large I/O block size (jdupes processes files in 1 MiB chunks at a time) reduce disk thrashing for this data; indeed, with an average file size of well under 1 MiB, an entire file is almost always read in one sequential shot.

Thoughts and conclusions

As far as bragging rights go, jdupes has a superior raw algorithm but dupd can still beat it marginally in real-world tests. Duplicate scanning is rarely performed on files that are 100% cached in RAM, so a program’s disk access characteristics relative to the data set being scanned can be more important than being double the speed in unrealistic algorithm benchmarks. In real-world testing with no pre-cached disk blocks available to artificially soften the huge performance hit of disk thrashing, jdupes and dupd are close enough in performance that neither is a “winner.” They generally seem to do their jobs at about the same speed on most of the data I’ve tested against…so which one should you use? The answer is very simple: use whichever tool you’re comfortable with that does what you want to do. I am partial to jdupes for obvious reasons and I have explained what I don’t like about dupd usage and behavior, but I’m also aware of some glaring flaws in how jdupes works as well. It’s a lot like deciding between two different claw hammers to bang on a nail: you can argue about the minor differences in appearance, dimensions, and weight all day long, but most users just have the need to hit nails and don’t have any reason to care which hammer they use to get the job done.

I’ve traced the terrible dupd performance on my huge data set down to dupd reading 8 KiB of each file at a time versus jdupes reading 1024 KiB (1 MiB) at a time, a difference that almost certainly reduces I/O slowdowns from disk thrashing. I’ve submitted a few bug fixes and pull requests for dupd to help Jyri keep track of the issues I’ve experienced and to improve dupd, including a mention of the bad performance due to the read block size. I also learned some tricks from the dupd source code that I hadn’t thought of before; they may not be useful in jdupes development, but I feel enriched as a programmer from my experience with the program and its code and I’ve tried to contribute back in exchange. Everyone wins!

Isn’t that what open source software is all about?

Finding Duplicates Faster: The story of ‘jdupes’, or how I unexpectedly became a better programmer

The problem of finding and handling duplicate files has been with us for a long time. Since the end of the year 1999, the de facto answer to “how can I find and delete duplicate files?” for Linux and BSD users has been a program called ‘fdupes’ by Adrian Lopez. This venerable staple of system administrators is extremely handy when you’re trying to eliminate redundant data to reclaim some disk space, clean up a code base full of copy-pasted files, or delete photos you’ve accidentally copied from your digital camera to your computer more than once. I’ve been quite grateful to have it around–particularly when dealing with customer data recovery scenarios where every possible copy of a file is recovered and the final set ultimately contains thousands of unnecessary duplicates.

Unfortunately, development on Adrian’s fdupes had, for all practical purposes, ground to a halt. From June 2014 to July 2015, the only significant functional changes to the code have been modification to compile on Mac OS X. The code’s stagnant nature has definitely shown itself in real-world tests; in February 2015, Eliseo Papa published “What is the fastest way to find duplicate pictures?” which contains benchmarks of 15 duplicate file finders (including an early version of my fork which we’ll ignore for the moment) that places the original fdupes dead last in operational speed and shows it to be heavily CPU-bound rather than I/O-bound. In fact, Eliseo’s tests say that fdupes takes a minimum of 11 times longer to run than 13 of the other duplicate file finders in the benchmark!

As a heavy user of the program on fairly large data sets, I had noticed the poor performance of the software and became curious as to why it was so slow for a tool that should simply be comparing pairs of files. After inspecting the code base, I found a number of huge performance killers:

  1. Tons of time was wasted waiting on progress to print to the terminal
  2. Many performance-boosting C features weren’t used (static, inline, etc)
  3. A couple of one-line functions were very “hot,” adding heavy call overhead
  4. Using MD5 for file hashes was slower than other hash functions
  5. Storing MD5 hashes as strings instead of binary data was inefficient
  6. A “secure” hash like MD5 isn’t needed; matches get checked byte-for-byte

I submitted a pull request to the fdupes repository which solved these problems in December 2014. Nothing from the pull request was discussed on Github and none of the fixes were incorporated into fdupes. I emailed Adrian to discuss my changes with him directly and there was some interest in certain changes, but in the end nothing was changed and my emails became one-way.

It seemed that fdupes development was doomed to stagnation.

In the venerable traditions of open source software. I forked it and gave my new development tree a new name to differentiate it from Adrian’s code: jdupes. I solved the six big problems outlined above with these changes:

  1. Rather than printing progress indication for every file examined, I added a delay counter to drastically reduce terminal printing. This was a much bigger deal when using SSH.
  2. I switched the code and compilation process to use C99 and added relevant keywords to improve overall performance.
  3. The “hot” one-line functions were changed to #define functions to chop function call overhead for them in half.
  4. (Also covers 5 and 6) I wrote my own hash function  and replaced all of the MD5 code with it, resulting in a benchmarked speed boost of approximately 17%. The resulting hashes are passed around as a 64-bit unsigned integer, not an ASCII string, which (on 64-bit machines) reduces hash comparisons to a single compare instruction.

 

After forking all of these changes and enjoying the massive performance boost they brought about, I felt motivated to continue looking for potential improvements. I didn’t realize at the time that a simple need to eliminate duplicate files more quickly would morph into spending the next half-year ruthlessly digging through the code for ways to make things better. Between the initial pull request that led to the fork and Eliseo Papa’s article, I managed to get a lot done:

 

At this point, Eliseo published his February 19 article on the fastest way to find duplicates. I did not discover the article until July 8 of the same year (at which time jdupes was at least three versions higher than the one being tested), so I was initially disappointed with where jdupes stood in the benchmarks relative to some of the other tested programs, but even the early jdupes (version 1.51-jody2) code was much faster than the original fdupes code for the same job.

1.5 months into development, jdupes was 19 times faster in a third-party test than the code it was forked from.

Nothing will make your programming efforts feel more validated than seeing something like that from a total stranger.

Between the publishing of the article and finding the article, I had continued to make heavy improvements:

 

When I found Eliseo’s article from February, I sent him an email inviting him to try out jdupes again:

I have benchmarked jdupes 1.51-jody4 from March 27 against jdupes 1.51-jody6, the current code in the Git repo. The target is a post-compilation directory for linux-3.19.5 with 63,490 files and 664 duplicates in 152 sets. A “dry run” was performed first to ensure all files were cached in memory first and remove variances due to disk I/O. The benchmarking was as follows:

$ ./compare_fdupes.sh -nrq /usr/src/linux-3.19.5/
Installed fdupes:
real 0m1.532s
user 0m0.257s
sys 0m1.273s

Built fdupes:
real 0m0.581s
user 0m0.247s
sys 0m0.327s

Five sequential runs were consistently close (about ± 0.020s) to these times.

In half a year of casual spare-time coding, I had made fdupes 32 times faster.

There’s probably not a lot more performance to be squeezed out of jdupes today. Most of my work on the code has settled down into working on new features and improving Windows support. In particular, Windows has supported hard linked files for a long time, and I’ve taken full advantage of Windows hard link support. I’ve also made the progress indicator much more informative to the user. At this point in time, I consider the majority of my efforts complete. jdupes has even gained inclusion as an available program in Arch Linux.

Out of the efforts undertaken in jdupes, I have gained benefits for other projects as well. For example, I can see the potential for using the string_table allocator in other projects that don’t need to free() string memory until the program exits. Most importantly, my overall experience with working on jdupes has improved my overall programming skills tremendously and I have learned a lot more than I could have imagined would come from improving such a seemingly simple file management tool.

If you’d like to use jdupes, feel free to download one of my binary releases for Linux, Windows, and Mac OS X. You can find them here.

[SOLVED] X terminal emulator doesn’t read .bashrc or .profile or /etc/profile, shows ‘sh’ prompt

This is an ancient problem, but I seem to run into it frequently and even briefly forget it. When you open an X terminal emulator such as Xterm or rxvt, you might be frustrated by the fact that you can’t get a profile or startup script to run in the shell prior to displaying a prompt. Your shell prompt also defaults to something fairly useless like “sh-4.3$” instead of something more informative and fancier.

This occurs because your user account’s default shell is either not explicitly set or is set as /bin/sh rather than /bin/bash. To fix this you’ll need to run the following command as root (i.e. with su or sudo):

usermod -s /bin/bash your_user_name

You’ll need to log out and back in before the change will fully take effect; alternatively, if you’re starting X from a command prompt, you can drop out of X, export SHELL=/bin/bash, and start X again. This can also be done in your .profile or .bashrc as a workaround if you don’t have root access for some reason and don’t use a graphical login manager.

MiniTSS 2.9.0 released, plus c02ware site redesign

If you take a look at the page for the Tritech Service System distribution of Linux, you’ll notice a few new things. The most obvious is that I’m redoing the c02ware site design; there’s now a basic logo, proper site navigation, a mobile-friendly layout, and a cleaner-looking color scheme. Consistency across pages has been greatly improved, and lots of unnecessary old junk and confusing content has been completely tossed out.

This change is being driven by my push to release the Tritech Service System with all of our proprietary bits included as a commercial product, with regular updates, bug fixes, and support. I will continue to release TSS without any proprietary bits as a public and completely free system, but for anyone in the PC repair business, the paid-for stuff can easily pay for itself in workflow acceleration and productivity boosts within a month, and we want to be able to bring that advantage to other PC service shops and I.T. departments. If you are interested in being notified when the Tritech Service System becomes available for purchase, send me an email and I will keep you in the loop.

A major goal in TSS is keeping the system as small as possible without cutting out basic features. In the effort to move towards this goal, I have released MiniTSS 2.9.0! The download is a paltry eight megabytes in size, and includes includes the following software packages:

  • busybox 1.21.1
  • chntpw 110511
  • cifsmount (mount.cifs helper)
  • dd-rescue 1.28
  • dropbear 0.52
  • fuse 2.8.3
  • glibc 2.10.1
  • libblkid 1.1.0
  • libuuid 1.3.0
  • ncurses 5.6
  • ntfs-3g-ntfsprogs 2011.4.12
  • pv 1.2.0
  • rsync 3.0.7
  • socat 1.7.2.1
  • sysfsutils 2.1.0
  • tar 1.22
  • tss-base-fs-mini 101
  • tss-bootstrap
  • udev 163
  • xz-utils 5.0.4
  • zlib 1.2.3

MiniTSS is not just a live CD/USB system. You can download the “source” archive, unpack it on your Linux system, add or remove packages to initramfs as you see fit, and rebuild your own custom version with whatever software you actually need. The system only provides basic tools and a command line interface, and therefore is aimed at intermediate-level Linux users.

So, how do I handle repairing a RAID-5 in a server I can’t touch?

Two drives failed in a 5-disk RAID-5 array at a client who dropped our services; fortunately, I’d put in a backup system, so when they brought us back on, I had a full backup from midnight to restore from. Unfortunately, only one drive was on order for various reasons and they needed it back up as soon as possible…now, here’s the million dollar question:

How do you reconstruct a RAID-5 array with 3 good drives remaining, which once was a 5-drive array, ? Consider the following:

  • The original array was 5x 500GB drives, of which only 3 remain
  • The backup data is about 500GB (so just RAID-1 for the data isn’t a choice)
  • The Linux boot/root filesystem sits on a RAID-1 across the first partitions of all five hard drives (resilient against up to all but the last single drive failing)

Since the software is still resilient against three drive failures, concerns over that are pointless. The client doesn’t have time to wait on a spare drive or two, but we want to order up a spare because of possible future capacity needs so we can pull off the next step. What was the best compromise? Simple!

mdadm --create /dev/md1 -l 5 -n 4 /dev/sd[abc]2 missing

This creates a 4-drive RAID-5 from the existing old 5-drive RAID’s partitions that is intentionally degraded for when we add the fourth disk in later. From there, I wrote a new XFS filesystem, mounted it, and restored the data. This is what /proc/mdstat looks like now on that system (note that no reboot was needed for these repairs at all!):

# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md1 : active raid5 sdc2[2] sdb2[1] sda2[0]
      1406519040 blocks super 1.2 level 5, 128k chunk, algorithm 2 [4/3] [UUU_]

md0 : active raid1 sda1[0] sdb1[1] sda1[4] sdd1[5](F)
      19542976 blocks [5/3] [UU__U]

unused devices: <none>

Manually copying a RAID-0 striped array to a single drive for data recovery

This question was posed on a forum:

I have a customer who has a computer, 2 SATA disk (striped in RAID config. Windows won’t load. Diag reports bad hard drive. When I disconnect one, it kills the stripe and the computer appears to not have a hard drive at all. Seems kind of silly to have it set this way as it increases the risk of failure. Other than putting each hard drive in another computer, I’d like to determine which of the disk are bad.

Also, not quite sure how to attack data recovery as they are a stripe set and plugging in to a SATA to USB does not appear to be a valid method. If I put a third hard drive in as a boot drive, do i have to reconfig the stripe set and if i do, will it kill the data.

I have reassembled two RAID-0 “striped” drives to a single larger drive by hand before. It’s actually a programmatically simple operation, but you require a lot of low-level knowledge and some guesswork to do it. The specific pair I had appeared to store the metadata somewhere other than the start of the disk, and I was able to discover through a hex editor that the drive was on a 64KB stripe size. I also spotted which drive had a partition table and which didn’t, because that’s only on the first drive which contains the first stripe.

At a Linux command prompt, with the two RAID-0 disks (that were NOT being detected properly by the Linux LVM2 “FakeRAID” algorithms, by the way) and a disk of twice their size connected, I wrote a very simple script that looked something like this (sda/sdb as RAID-0, sdc as the destination disk, and this might work under ash or similar as well).

—- cut here —-

#/bin/bash

; X=sda position, Y=sdb position, Z=sdc position, STRIPE=stripe size
X=0; Y=0; Z=0; STRIPE=65536

; Retrieve the size of a RAID-0 disk so we can terminate at the end
SIZE=$(cat /proc/partitions | grep ‘sda$’ | awk ‘{print $3}’)
; Divide size by stripe, including any tail blocks (expr truncates)
SIZE=$(( SIZE + STRIPE – 1 ))
SIZE=$(expr $SIZE / $STRIPE ))
while [ “$Z” -lt “$SIZE” ]
do
dd if=/dev/sda of=/dev/sdc seek=$Z; skip=$X bs=$STRIPE count=1
Z=$(( Z + 1 ))
dd if=/dev/sdb of=/dev/sdc seek=$Z; skip=$Y bs=$STRIPE count=1
Z=$(( Z + 1 ))
X=$(( X + 1 ))
Y=$(( Y + 1 ))
done

—- cut here —-

Note that all it does is load 64K at a time from each disk and save it to the third disk in sequential order. This is untested, and requires modification to suit your scenario, and is only being written here as an example. It does not fail if a ‘dd’ command fails, so it will work okay for data recovery; you will lose any stripe that contains a bad block, though, and the algorithm could be improved to use dd_rescue (if you have it) or to copy smaller units in a stripe so that only a partial stripe loss occurs on bad data.

My take on Windows 8, Metro, touchscreens, and other desktop disasters

Windows 8 has this shiny new user interface that’s known as “Metro.” I hate Metro. LOTS of people hate Metro. Metro is supposed to be easier for touchscreen usage, but Windows is a desktop operating system. I don’t want to re-hash everything that other people have written about why Metro is garbage, so I’ll just drop a few points to get my ideas across.

  • Metro is designed specifically with touchscreens in mind. Some all-in-one desktop computers are now touch-capable, and Windows 8 is supposed to become available for ARM architectures so that Windows 8 can be used on new tablets. However, there are two major problems: MOST desktop and laptop computers DO NOT HAVE TOUCH CAPABILITY AT ALL (that’s the vast majority of what it runs on) and TOUCH IS NOT PRACTICAL FOR DESKTOP USE.
  • Touchscreens require holding your arm up to manipulate what we traditionally would use a mouse and pointer to work with. That’s fine for a minute, but if you think your arm is NOT going to get tired ten minutes into touchscreen-centric hell, you’re fooling yourself.
  • Have you seen the Explorer windows? They brought that awful, terrible “ribbon UI” from Office 2010 into Windows 8. Not only is it annoying as hell to use, it’s counterintuitive: with monitors trending towards widescreen displays, vertical screen space is in much shorter supply, while ironically still being the most needed type of screen space for office applications and for seeing more files at once in Explorer’s “details” file view. Yet somehow, Microsoft’s logic is to replace one toolbar with something that’s three toolbars in height. Way to go, you idiots. (If a ribbon popped out of the left or right, it’d make more sense, but ribbon organization is actually less efficient than toolbars, in my opinion.)
  • Start button in desktop mode: GONE. WHY?! The Start button paradigm was revolutionary. There’s a reason that it’s persisted since the introduction of Windows 95, and is often imitated in many Linux desktops: it gets the job done, and does so pretty well, as long as you didn’t have 100 folders inside it (and Vista fixed that with the introduction of a scrolling Start menu programs list that ACTUALLY HAS A FREAKING SCROLL BAR…what took so long to come up with that?!)

If I was to advocate for a radical UI change, I’d want to see something more like Fluxbox on Linux systems. I can right-click anywhere on the “desktop” to get a program menu, with no Start button required. If I use a Fluxbox theme with rounded top corners on the windows, I can launch my mouse to the upper-left or upper-right corners of the screen (two of the most prominent “hotspots” as any skilled UI designer will tell you) and right-click to get said menu as well. Right-clicking on the title bar brings up all of the window management functions I could ever need. Fluxbox isn’t the prettiest thing in the world, and it’s a little weird to someone who is used to choosing between “Start menu” and “Mac dock” ways of working with programs, but being able to call up a Start menu of sorts without even needing the button in the first place isn’t hard to get used to, and is much faster than having to aim for a button.

Honestly, I’ve gotten spoiled by Fluxbox and Linux. I can’t believe how fast a huge application like Firefox starts up under Fluxbox. Ubuntu and other distributions with heavy full-blown desktop environments are on par with Windows, but with a minimalist one like Fluxbox, the world just seems so  much faster, even with an unaccelerated VESA video driver.

I digressed a bit, but the moral of the story is this: simple is beautiful, fast, and functional. All this metro/ribbon/touchy crap wastes screen space, slows things down, and frustrates users. I knew things were going sour when Windows had keyboard shortcut accelerator underlining disabled by default, but I didn’t know we would end up with this Metro disaster. I’m making a call out to everyone to advocate for a simpler desktop that doesn’t need to change for the sake of change because it’s functionally sound and easy to work with, without the eye candy and bells and whistles and massive tool ribbons.

Linux PowerPC yaboot + initramfs/initrd woes solved; no more “unable to mount root fs” problem!

This one had me ripping my hair out for two days straight. Anyone who has tried to create a Linux bootable CD for a PowerPC system has either run into this problem, followed some kind of magic set of directions that don’t explain the details that could cause this problem, or do something crummy like using the CD as the root filesystem.

PowerPC systems are very different from i386/i686/x86_64 systems in how they boot, and because they are much less common, they garner less interest and also have less available documentation and Internet forum assistance. The specific problem that I ran into is this: using the Tritech Service System’s construction for x86 as a template, and gleaning information from other PPC bootable CD images, I was able to create a CD that would properly boot the iBook G3 I used for testing into yaboot, the PowerPC Linux loader. The process of figuring out how to pull this off took many hours of reading and dissection, and I could easily chalk a full day’s work up as wasted on this process due to the fact that it’s not well-documented. From there, yaboot was configured to load my kernel and initrd (in this case, an initramfs, not an initrd, but the loading process is the same.) However, I was greeted every single time with kernel output that showed no indication of any initrd/initramfs being loaded and handed off to the kernel. I was stumped. It seemed as if I had done everything that the others do, yet it didn’t work. I tried these things to resolve the problem, to no avail:

  • Copying the map.hfs file from another Linux distribution that seemed more complete
  • Editing yaboot.conf to add and remove things like ramdisk_size=16384 or device=cd: to the options
  • Recompiling the PPC32 kernel with initrd turned on (shouldn’t be needed for initramfs, but I was quite annoyed and desperate)
  • Playing with the ofboot.b text file to see if anything inside could make a difference (CHRP is becoming a dirty word in my book)
  • Booting the G3 to Open Firmware and typing excessively cryptic and obnoxious commands that make learning “sed” look like a cakewalk
  • Pondering the consumption of potent alcoholic beverages while at work to defer blame for not figuring this nonsense out

So, after two days of trying to go from a collection of packages and a kernel to a real-world bootable Tritech Service System 2.7.6 ISO for PowerPC Macs, and nearly losing my sanity in the process, I finally hit upon an obscure, nasty, rarely discussed, extremely STUPID, yet horribly important fact:

yaboot doesn’t load initrd or initramfs if the kernel image is compressed.

Yes, that’s it. That’s the source of my ills. The godforsaken bootloader will detect a compressed kernel and simply and quietly ignore the “initrd=/boot/initrd1.gz” parameter. The even simpler solution? Instead of using the compiled kernel at arch/powerpc/boot/zImage.pmac, one must use the compiled kernel at…well, you might not believe this…just plain vmlinux. The uncompressed raw kernel image produced immediately under the Linux kernel folder you build in. That’s all that I had to do, and I have never been so pissed off over such a small detail in my life.

All too often in the computer world, I see the “user” aspects of things documented repeatedly and done to death; entire volumes have been written just to explain how to perform basic functions or configure a program to the liking of the user. Even the process of compiling a Linux kernel is so thoroughly documented and explained that it’s fairly hard to fail to do it if you use a decent guide. Why is it, though, that these crucial points involving low-level details and bootloader quirks are overlooked and go largely undocumented? If I type “yaboot initramfs” into Google or Yahoo or Bing, why doesn’t the very first page that appears scream at me in bold text “YABOOT WILL NOT LOAD INITRD IMAGES IF THE KERNEL IMAGE IS COMPRESSED!” I know of approximately ZERO bootloaders that have this obnoxiously non-standard behavior. I’ve messed with LILO, SILO, GRUB, SYSLINUX, ISOLINUX, PXELINUX, BootX, and U-Boot on a $99 WM8650 ARM-based netbook, and not once have I run into this problem with ANY of those bootloaders AT ALL.

I hope that this information helps anyone trying to master a Linux on PowerPC bootable ISO to not waste two days and use their CD burner to create ten useless shiny silver coasters in the process. Also, could someone explain to me WHY the yaboot bootloader can’t load both images as compressed images?

Kernel panic – not syncing: Attempted to kill init! (glibc problem)

While working on the Tritech Service System, I made the mistake of using a glibc package compile for an i686 in the initramfs for an i586 kernel. I happened to do some searching to figure out the source of the problem, since all of my kernels would crash with this message in the exact same place, and thought I’d share it with everyone. This could frustrate custom Linux distro attempts easily.

In short: if the kernel panics because something “attempted to kill init!” then make sure your C library (glibc, eglibc, uclibc, dietlibc, whatever) is not compiled for a CPU higher than the CPU you’re trying to run on.

What’s happening is the system is attempting to execute “init” which immediately terminates due to the fact that the library uses invalid CPU instructions (the older processor doesn’t know about the newer instructions compiled into the library). The message “attempted to kill init!” is technically correct: init was killed because it tried to do something bad, but init is required to run anything else, so once init immediately crashes out, there’s nothing left for the system to do, and the kernel hangs itself up.