Tag: development

Finding Duplicates Faster: The story of ‘jdupes’, or how I unexpectedly became a better programmer

The problem of finding and handling duplicate files has been with us for a long time. Since the end of the year 1999, the de facto answer to “how can I find and delete duplicate files?” for Linux and BSD users has been a program called ‘fdupes’ by Adrian Lopez. This venerable staple of system administrators is extremely handy when you’re trying to eliminate redundant data to reclaim some disk space, clean up a code base full of copy-pasted files, or delete photos you’ve accidentally copied from your digital camera to your computer more than once. I’ve been quite grateful to have it around–particularly when dealing with customer data recovery scenarios where every possible copy of a file is recovered and the final set ultimately contains thousands of unnecessary duplicates.

Unfortunately, development on Adrian’s fdupes had, for all practical purposes, ground to a halt. From June 2014 to July 2015, the only significant functional changes to the code have been modification to compile on Mac OS X. The code’s stagnant nature has definitely shown itself in real-world tests; in February 2015, Eliseo Papa published “What is the fastest way to find duplicate pictures?” which contains benchmarks of 15 duplicate file finders (including an early version of my fork which we’ll ignore for the moment) that places the original fdupes dead last in operational speed and shows it to be heavily CPU-bound rather than I/O-bound. In fact, Eliseo’s tests say that fdupes takes a minimum of 11 times longer to run than 13 of the other duplicate file finders in the benchmark!

As a heavy user of the program on fairly large data sets, I had noticed the poor performance of the software and became curious as to why it was so slow for a tool that should simply be comparing pairs of files. After inspecting the code base, I found a number of huge performance killers:

  1. Tons of time was wasted waiting on progress to print to the terminal
  2. Many performance-boosting C features weren’t used (static, inline, etc)
  3. A couple of one-line functions were very “hot,” adding heavy call overhead
  4. Using MD5 for file hashes was slower than other hash functions
  5. Storing MD5 hashes as strings instead of binary data was inefficient
  6. A “secure” hash like MD5 isn’t needed; matches get checked byte-for-byte

I submitted a pull request to the fdupes repository which solved these problems in December 2014. Nothing from the pull request was discussed on Github and none of the fixes were incorporated into fdupes. I emailed Adrian to discuss my changes with him directly and there was some interest in certain changes, but in the end nothing was changed and my emails became one-way.

It seemed that fdupes development was doomed to stagnation.

In the venerable traditions of open source software. I forked it and gave my new development tree a new name to differentiate it from Adrian’s code: jdupes. I solved the six big problems outlined above with these changes:

  1. Rather than printing progress indication for every file examined, I added a delay counter to drastically reduce terminal printing. This was a much bigger deal when using SSH.
  2. I switched the code and compilation process to use C99 and added relevant keywords to improve overall performance.
  3. The “hot” one-line functions were changed to #define functions to chop function call overhead for them in half.
  4. (Also covers 5 and 6) I wrote my own hash function  and replaced all of the MD5 code with it, resulting in a benchmarked speed boost of approximately 17%. The resulting hashes are passed around as a 64-bit unsigned integer, not an ASCII string, which (on 64-bit machines) reduces hash comparisons to a single compare instruction.

 

After forking all of these changes and enjoying the massive performance boost they brought about, I felt motivated to continue looking for potential improvements. I didn’t realize at the time that a simple need to eliminate duplicate files more quickly would morph into spending the next half-year ruthlessly digging through the code for ways to make things better. Between the initial pull request that led to the fork and Eliseo Papa’s article, I managed to get a lot done:

 

At this point, Eliseo published his February 19 article on the fastest way to find duplicates. I did not discover the article until July 8 of the same year (at which time jdupes was at least three versions higher than the one being tested), so I was initially disappointed with where jdupes stood in the benchmarks relative to some of the other tested programs, but even the early jdupes (version 1.51-jody2) code was much faster than the original fdupes code for the same job.

1.5 months into development, jdupes was 19 times faster in a third-party test than the code it was forked from.

Nothing will make your programming efforts feel more validated than seeing something like that from a total stranger.

Between the publishing of the article and finding the article, I had continued to make heavy improvements:

 

When I found Eliseo’s article from February, I sent him an email inviting him to try out jdupes again:

I have benchmarked jdupes 1.51-jody4 from March 27 against jdupes 1.51-jody6, the current code in the Git repo. The target is a post-compilation directory for linux-3.19.5 with 63,490 files and 664 duplicates in 152 sets. A “dry run” was performed first to ensure all files were cached in memory first and remove variances due to disk I/O. The benchmarking was as follows:

$ ./compare_fdupes.sh -nrq /usr/src/linux-3.19.5/
Installed fdupes:
real 0m1.532s
user 0m0.257s
sys 0m1.273s

Built fdupes:
real 0m0.581s
user 0m0.247s
sys 0m0.327s

Five sequential runs were consistently close (about ± 0.020s) to these times.

In half a year of casual spare-time coding, I had made fdupes 32 times faster.

There’s probably not a lot more performance to be squeezed out of jdupes today. Most of my work on the code has settled down into working on new features and improving Windows support. In particular, Windows has supported hard linked files for a long time, and I’ve taken full advantage of Windows hard link support. I’ve also made the progress indicator much more informative to the user. At this point in time, I consider the majority of my efforts complete. jdupes has even gained inclusion as an available program in Arch Linux.

Out of the efforts undertaken in jdupes, I have gained benefits for other projects as well. For example, I can see the potential for using the string_table allocator in other projects that don’t need to free() string memory until the program exits. Most importantly, my overall experience with working on jdupes has improved my overall programming skills tremendously and I have learned a lot more than I could have imagined would come from improving such a seemingly simple file management tool.

If you’d like to use jdupes, feel free to download one of my binary releases for Linux, Windows, and Mac OS X. You can find them here.

Why I’ll never build a home in Chatham County, NC

I have lived in Siler City, NC (in Chatham County, NC) for four years. Having established a solid commercial presence here and finding the area to be generally decent and agreeable to live in, I’ve been seriously looking into the process of establishing a more permanent residence. However, every time I look up more information regarding the process, I see more reasons to avoid Chatham County for establishing any kind of permanent residence. The reasons are many and varied, but I can chalk the biggest one up to one major factor that causes me more concern than any other. What is this major issue that single-handedly doomed my fantasies of building a home on some undeveloped Chatham County land?

Impact fees.

That’s right, impact fees. Something which I’d never once heard of before I came here. I’ve looked at land in Oxford, NC in the past, as well as various other counties north and northwest of Orange County, and not once have I heard of “impact fees.” What’s an “impact fee” supposed to be for, anyway? Apparently, it’s a one-time county government surcharge (read: “TAX”) that’s supposed to raise money for building or maintaining schools. You know, like elementary, middle, and high schools…for the children I shall never ever produce. And boy, these kids I don’t and won’t have would cost me a ton. How much, you ask?

Chatham County’s impact fee is a one-time fee of $3,500.

Needless to say, I’m not keen on buying a $50,000  parcel of empty land to build my future upon if I have to give Chatham County $3,500 for the privilege of building a house there. Despite being a small business owner (or perhaps because of that), I DO NOT make a large amount of money every month–in fact, I’d say I make the equivalent take-home pay of what someone making $8 an hour would make for the 50+ hours a week I work. Fortunately, I have also gone to some trouble to ensure I live reasonably within my means. I’d love to own instead of rent, but let’s put this into perspective: Chatham County tells me that to start building my dream home here, I have to give them about 9 weeks of my pay just for the “impact fee” privilege, ignoring all other fees such as those required for permits and inspections. The purpose of the impact fee being something that I’ll never see any benefit from is merely an added insult. I don’t want to pay for someone else’s children to go to school, and guess what? I’ll look elsewhere because of the hubris of the fools in charge of Chatham County.

I mean, think about this: if I’m buying land for $50,000 and the county demands $3,500 to “allow me” to build a home, that’s 7% of what I’d have paid for the land! That’s not all there is to it, and I could name off other regressive punishment taxation that chases off development such as the “recreation fee,” but my point is clear.

Chatham County: no one wants to move here because you run things like you’re Chapel Hill, Cary, or Raleigh, but you’re none of these. Chatham is rapidly becoming a “bedroom community” and many businesses are shutting down or moving to neighboring counties that don’t have absurdly brain-dead policies like this. I can’t count how many decent-sized corporations have considered Chatham County, NC as a possible location for some kind of sizable facility that would bring hundreds of jobs to the area, only to be denied something they needed. From what I understand, the old Joan Fabrics building in Siler City (which is now occupied by Acme-McCrary, leaving an empty Acme-McCrary building right across the street) was examined for potential as a distribution center for Sheetz, and that deal fell through because someone in some level of local government didn’t want all that tractor-trailer traffic to be there on US 64. Hello, genius, the building has something like 8-10 loading docks on the side! If you want to fill it, is it really reasonable to expect that those docks will be left mostly unused?

In four years, I have witnessed a slow but steady decline in Chatham County’s economy, and while my business is doing well, it’s more because of adaptation and our ability to engineer workflow and customer experience improvements; the county and city governments largely seem to prefer that businesses shut down and get replaced by trees and pastures. On top of that, fees such as impact and recreation fees that charge a premium for the privilege of developing and growing Chatham County end up reducing overall revenues by strongly encouraging people to build their lives in Burlington, Sanford, and Asheboro instead.

Please, for the love of all that’s sane and logical, get rid of these kinds of fees. They hurt everyone in the entire county, and they’re the biggest reason I’ll never build my permanent home here. I don’t want to live in a bedroom community, and when the lease is up on the current location for my business, I’m going to have to justify remaining in Siler City. A forecast of future economic activity will play very heavily into this choice.

How many people are going through the same thought process about this subject every year? How much opportunity for growth has Chatham flipped the bird towards and lost forever? With the constant growth going on in the county, it will only become more difficult to justify over time.