Hard drive platter spinning

Don’t destroy used hard drives! Wipe them and reuse or sell them with confidence!

The advice to not use a hard drive from eBay is not the best advice. You should fully wipe the drive (a zero fill will do) and then install a new OS on it. The old data will be 100% unrecoverable and you won’t unnecessarily destroy a perfectly good piece of equipment. Please don’t advocate for this kind of wasteful drive destruction.

Yes, a zero fill is more than enough. The “DoD secure wipe” was designed for hard drives from the 80s and early 90s that used FM/MFM/RLL data modulation. Today, drives use [E]PRML and other advanced techniques instead.

Yes, Peter Guttmann wrote a paper about recovering data from hard drives that said you could easily do so, but that was in the era of widespread MFM/RLL drives, and Guttmann himself later walked back his recommendations:

“In the time since this paper was published, some people have treated the 35-pass overwrite technique described in it more as a kind of voodoo incantation to banish evil spirits than the result of a technical analysis of drive encoding techniques. As a result, they advocate applying the voodoo to PRML and EPRML drives even though it will have no more effect than a simple scrubbing with random data. In fact performing the full 35-pass overwrite is pointless for any drive since it targets a blend of scenarios involving all types of (normally-used) encoding technology, which covers everything back to 30+-year-old MFM methods (if you don’t understand that statement, re-read the paper). If you’re using a drive which uses encoding technology X, you only need to perform the passes specific to X, and you never need to perform all 35 passes. For any modern PRML/EPRML drive, a few passes of random scrubbing is the best you can do. As the paper says, “A good scrubbing with random data will do about as well as can be expected”. This was true in 1996, and is still true now.”

Hard drive platter and arm
It’s a miracle that these things work at all.

Quoting Donald Kenney:

“PRML uses a different approach to improving storage density. To permit greater data density, recorded data amplitude is reduced and bits are packed together more closely. The digital signal is then recovered using digital signal processing techniques on the analog data stream from the drive. PRML achieves a 30-40% improvement in storage density over RLL modulation without PRML. EPRML modifies the algorithms used in PRML to achieve additional improvements claimed to be 20-70%.”

The extremely low magnetic amplitude in [E]PRML modulation puts the analog data signal on the platter so close to the noise floor that a DSP is required to apply filters to the noise to recover the data signal. A simple zero fill will push the previous (very weak) signal firmly back into the noise floor. Snatching data from an MFM drive using a scanning tunneling electron microscope relied on the strong amplitude of the data writes being “messy,” as in the magnetic domains of previous writes (sometimes multiple layers of them) would still be detectable “around” the current writes because so much of the surface was influenced by the previous writes that unnecessary “leakage” of the magnetic domains would occur, and subsequent writes wouldn’t necessarily be able to “reach” all of the affected areas.

PRML techniques massively boost data density; doing so makes the margins in which you’d locate this “leaked” data so tight that there isn’t much room for it to exist in the first place, but on top of that, the strength of the write is an order of magnitude weaker. It’s frankly a miracle of modern science that the data so close to the noise floor and with such an insanely tiny amount of surface area can be read back at all. One simple overwrite pass will destroy the data beyond even the abilities of any given three-letter agency to get it back.

So, in short, a one-pass zero-fill of the drive is enough to “sanitize” the data therein. Please don’t throw away or destroy hard drives just because someone else used them before, and if you’re selling a computer, just wipe the drive completely and your now-destroyed data is perfectly safe from prying eyes.

Python code mistake

I made youtube-dl faster for archivists…and solved a worst-case programming problem elegantly in the process

Update: there is an addendum at the end of this article; I mention it because yes, in the end, I switched over to Python sets. I don’t want any Python experts cringing too hard, after all. Welcome to the joys of a C programmer adapting to Python.

For those who haven’t heard of it, youtube-dlc is a much more actively maintained fork of the venerable but stagnant youtube-dl project that was announced on the DataHoarder subreddit. I have been archiving YouTube channels for many months now, trying to make sure that the exponential boost in censorship leading up to the 2020 U.S. Presidential election doesn’t cause important discussions to be lost forever.

Unfortunately, this process has led me to have a youtube-dl archive file containing well over 70,000 entries, and an otherwise minor performance flaw in the software had become a catastrophic slowdown for me. (Side note: a youtube-dl archive file contains a list of videos that were already completely downloaded and that lets you prevent re-downloading things you’ve already snagged.) Even on a brand new Ryzen 7 3700X, scanning the archive file for each individual video would sometimes progress at only a few rejections per second, which is very bad when several of the channels I archive have video counts in the multiple thousands. The computer would often spend multiple minutes just deciding not to download all of the videos on a channel, and that got painful to watch. That’s time that could be spent checking another channel or downloading a video.

When youtube-dlc popped up and offered goodwill to the people who were treated less than favorably by the youtube-dl project maintainers, I realized that I had a chance to get this fixed. I opened an issue for it, but it became clear that the budding fork didn’t have resources to dedicate to the necessary improvement. The existing code technically worked and performance wasn’t as bad for people using much smaller archive files. Both myself and the youtube-dlc maintainer (blackjack4494) quickly identified the chunk of code that was behind the problem, so I decided to take it on myself.

Discussion about the code causing the problem
There’s our problem!

The troublesome code is cut off in the discussion screenshot, so here’s a better look at it:

Problematic code from youtube-dl
Good programmers probably know why this is slow just from this image.

The code outlined in the red box above is opening the archive file for reading, consuming it line-by-line, and comparing the line read from the archive to the line associated with the candidate video to be downloaded. If a line exists that matches, the video is in the archive and shouldn’t be downloaded, so in_download_archive() returns True. This works fine and is a very simple solution to implementing the archive file feature, but there’s a problem: in_download_archive() is invoked for every single possible video download, so the file is opened and read repeatedly.

Some programmers may see this and ask “why is this a problem? The archive file is read for every video, so there’s a good chance it will remain in the OS disk cache, so reading the file over and over becomes an in-memory operation after the first time it’s read.” Given that my archive of over 70,000 lines is only 1.6 MiB in size, that seems to make some sense. What is being missed in all of this is the massive overhead of reading and processing a file, especially in a high-level language like Python.

An aside for programmers not so familiar with more “bare metal” programming languages: in C, you can use a lot of low-level trickery to work with raw file data more quickly. If I was implementing this archive file check code in C (some low-level steps will be omitted here), I’d repeatedly scan the buffer for a newline, reverse the scan direction until extra white space was jumped over (doing what strip() does in Python), convert the last byte of newline/white space after the text to a null byte, and do a strcmp(buffer, video_id) to check for a match. This still invokes all of the OS and call overhead of buffered file reads, but it uses a minimal amount of memory and performs extremely fast comparisons directly on the file data.

In a language that does a lot of abstraction and memory management work for us like Python, Java, or PHP, a lot more CPU-wasting activity goes on under the hood to read the file line by line. Sure, Python does it in 4 lines of code and C would take more like 15-20 lines, but unpack what Python is doing for you within those lines of code:

  1. Allocating several variables
  2. Opening the file
  3. Allocating a read buffer
  4. Reading the file into the buffer
  5. Scanning the buffer for newlines
  6. Copying each line into the “line” variable one at a time
  7. Trimming leading and trailing white space on the line which means
    • Temporarily allocating another buffer to strip() into
    • Copying the string being stripped into…
    • …while checking for and skipping the white space…
    • …and copy the string back out of that buffer
  8. Finally doing the string comparison
  9. All while maintaining internal reference counts for every allocated item and periodically checking for garbage collection opportunities.

Multiply the above by 2,000 video candidates and run it against an archive file with 70,000 entries and you can easily end up with steps 6, 7, and 8 being executed almost 140,000,000 times if the matching strings are at the end of the archive. Python and other high-level languages make coding this stuff a lot easier than C, but it also makes it dead simple to slaughter your program’s performance since a lot of low-level details are hidden from the programmer.

I immediately recognized that the way to go was to read the archive file one time at program startup rather than reading it over and over for every single download candidate. I also recognized that this is the exact problem a binary search tree (BST) is designed to speed up. Thus, my plan was to do the same line-by-line read-and-strip as the current code, but then store each processed line in the BST, then instead of reading the file within in_download_archive(), I’d scan the BST for the string. The neat thing about a BST is that if it were perfectly balanced, 70,000 entries would only be 17 levels deep, meaning each string check would perform at most 17 string comparisons, a much better number than the 70,000-comparison worst-case of scanning a flat text file line-by-line.

So, I set out to make it happen, and my first commit in pursuit of the goal was dropped.

Binary search tree Python code
The workhorse code; a classic binary search tree
Archive preload code that feeds the binary search tree
Archive preload code that feeds the binary search tree

This actually worked nicely!…until I tried to use my huge archive instead of a tiny test archive file, and then I was hit with the dreaded “RuntimeError: maximum recursion depth exceeded while calling a Python object.” Okay, so the recursion is going way too deep, right? Let’s remove it…and thus, my second commit dropped (red is removed, green is added).

Python code change from recursion to a loop
Let’s do it in a loop instead!

With the recursion swapped out for a loop, the error was gone…but a new problem surfaced, and unfortunately, it was a problem that comes with no helpful error messages. When fed my archive file, the program seemed to basically just…hang. Performance was so terrible that I thought the program had completely hung. I put some debug print statements in the code to see what was going on, and immediately noticed that every insert operation would make the decision to GO RIGHT when searching the tree for the correct place to add the next string. There was not one single LEFT decision in the entire flood of debug output. That’s when I finally realized the true horror that I had stepped into.

I had sorted my archive…which KILLED the performance.

Binary search trees are great for a lot of data, but there is a rare but very real worst-case scenario where the tree ends up becoming nothing more than a bloated linked list. This doesn’t happen with random or unordered data, but it often happens when the tree is populated in-order with data that is already sorted. Every line in my sorted file was “greater than” the line before it, so when fed my sorted archive, the tree became an overly complex linked list. The good news is that most people will not have a sorted archive file because of the randomness of the video strings, but the bad news is that I had sorted mine because it boosted overall archive checking performance. (Since new video downloads are appended to the archive, the most recent stuff is always at the end, meaning rejecting those newer downloads under the original archive code always required the worst-case amount of time.) It is entirely possible that someone else would sort their archive at some point, so I had accidentally set myself up in the worst-case scenario and I couldn’t just ignore it and assume no one else made the same mistake. I had to fix it.

I got 3/4 through changing over to a weighted BST before realizing that it would not improve the load times and would only help the checking behavior later. That code was scrapped without a commit. I had previously added weighted trees with rebalancing to jdupes, but removed it when all tests over time showed it didn’t improve performance.

How do you optimally feed sorted data into a BST? Ideally, you’d plop the data into a list, add the middle piece of data, then continue taking midpoints to the left and right alternately until you ran out of data to add (see this excellent tutorial for a much better explanation with pictures). Unfortunately, this requires a lot of work; you have to track what you have already added and the number of sections to track increases exponentially. It would probably take some time to do this. I decided to try something a little simpler: split the list into halves, then alternate between consuming each half from both ends until the pointers met. Thus, the third commit dropped.

Python code to attempt to add a sorted list to a binary tree
This didn’t work.

This commit had two problems: the pointer checks resulted in failure to copy 2 list elements and the improvement in behavior was insufficient. It was faster, but we’re talking about an improvement that can be described as “it seemed to just hang before, but now it completes before I get mad and hit CTRL-C to abort.” Unacceptable, to be sure. I stepped away from the computer for a while and thought about the problem. The data ideally needs to be more random than sorted. Is there an easier way? Then it hit me.

I used ‘sort’ to randomize my archive contents. Problem: gone.

All I needed to do was randomize the list order, then add the randomized list the easy way (in-order). Could it really be that simple?! Yes, yes it can! And so it was that the fourth commit dropped (red is removed, green is added).

Python code with an elegant solution
Surprisingly simple solution, isn’t it?

This worked wonderfully in testing…until I tried to use it to download multiple videos. I made a simple mistake in the code because it was getting late and I was excited to get things finished up. See if you can find the mistake before looking at my slightly embarrassing final commit below.

Python code mistake

As I wrote this article, I realized that there was probably a cleaner way to randomize the list in Python. Sure enough, all of the code seen in that last commit can be replaced with just one thing: -random.shuffle(lines), and thus dropped my new final commit.

Python randomization loop replaced with one-liner
One-line built-in is better!

I think the code speaks for itself, but if you’d like to see the end result, make sure you watch the video at the top of this article.


I posted this article to Reddit and got some helpful feedback from some lovely people. It’s obvious that I am not first and foremost a Python programmer and I didn’t even think about using Python sets to do the job. (Of course a person who favors C will show up to Python and implement low-level data structures unnecessarily!) It was suggested by multiple people that I replace the binary search tree with Python sets, so…I did, fairly immediately. Here’s what that looked like.

Using Python sets instead of a binary search tree
Code go bye bye

The Python set-based implementation is definitely easier to write and does seem to work well. If I did this again, I’d probably skip the BST with shuffle and just use sets. The performance is almost as good as the BST, but I ran a test using OBS Studio to capture output, then moving frame by frame to find the beginning and end of the archive checking process. The set version took 9 seconds; the BST version took 8 seconds. While the set version looks prettier, the fact that the BST has already been merged into upstream and is a bit faster means that the BST (despite probably offending some Python lovers) is here to stay. Actually, it turns out that I made a mistake: I tested the set version with ‘python’ but the BST version was compiled into an executable; after compiling the set version into an executable, it turns out that the set version takes about 6-7 seconds instead. Excuse me while I send yet another pull request!

If you have any feedback, feel free to leave a comment. The comment section is moderated, but fair.

Cute cloned dogs

A CHALLENGER APPEARS: “fclones”…fastest duplicate scanner ever? It’s complicated.

While perusing my link referrals on GitHub, I noticed this thread where my duplicate scanner jdupes was mentioned. I then noticed the comment below it:

There is also a much faster modern alternative to fdupes and jdupes: fclones. It searches for files in parallel and uses a much faster hash function than md5.

My response comment pretty much says it all, so I’m making that the entire remainder of this post.

I noticed that fclones does not do the byte-for-byte safety check that jdupes (and fdupes) does. It also relies exclusively on a non-cryptographic hash for comparisons. It is unsafe to rely on a non-cryptographic hash as a substitute for the file data, and comparisons between duplicate finders running in full-file comparison mode vs. running in hash-and-compare mode are not appropriate. The benchmark on the fclones page ran jdupes 1.14 without the -Q option that disables the final byte-for-byte confirmation, so there is a lot of extra work for the purpose of avoiding potential data loss being done by jdupes and being skipped entirely by fclones.

jdupes already uses a faster hash function than MD5 (xxHash64 as of this writing, previously jodyhash), and it is fairly trivial to switch to even faster hash functions if desired…but the fact is that once you switch to any “fast hash” function instead of a cryptographic one the hash function used is rarely a bottleneck, especially compared to the I/O bottleneck represented by most consumer-grade hard drives and low-end SSDs. If everything to be checked is in the buffer cache already then it might be a bottleneck, but the vast majority of duplicate scan use cases will be performed on data that is not cached.

Searching for files in parallel is only an advantage if the disk I/O is not a bottleneck, and you’ll notice that the fclones author performed the dedupe benchmarks on a (presumably very fast since it’s paired to a relatively recent Xeon) 512GB NVMe SSD with an extremely fast multi-core multi-threaded processor. There is a very small access time penalty for random read I/O on a fast NVMe SSD, but there is an extremely large access time penalty for random read I/O on a traditional rotating hard drive or RAID array composed of several hard drives. Any number of multiple threads firing off reads on the same RAID array at the same time will slow even most RAID arrays to a single-digit MB/sec death crawl. I understand that many people will be working with SSDs and some duplicate scanner programs will be a better choice for SSDs, but the majority of computer systems have spinning rust instead of flash-based disks.

It is strongly advisable to (A) run your own benchmarks on your specific workload and hardware, and (B) understand how to use the program within your own personal acceptable level of risk. Both of these are different for every different person’s needs.

UPDATE: I found another instance of the fclones author claiming jdupes being single-threaded makes it slow; to quote directly:

Unfortunately these older programs are single-threaded, and on modern hardware (particularly on SSDs) they are order of magnitude slower than they could be. If you have many files, a better option is to use fclones (disclaimer: I’m the author), which uses multiple threads to process many files in parallel and offers additional filtering options to restrict the search.

The points I’ve made above still stand. Unless you’re running the author’s enterprise-grade high-end hardware, your disk random access latency is your major limiting factor. I’d love to see what fclones does on something like a 24TB disk array. I’d wager–exactly as stated above–that 8 or 32 simultaneous I/O threads brings the whole process to a death crawl. Perhaps I should bite the bullet and run the tests myself.

UPDATE 2: I was right. Benchmark article and analysis forthcoming.

Featured image Licensed under CC-BY from Steve Jurvetson,

SubscribeStar Logo

jdupes 1.16.0: File Extension Filtering, And I Need Your Support

Please consider supporting me on SubscribeStar so I can continue to bring open source software and tutorial videos to you!

Over the past weekend, I implemented a feature in jdupes that was pretty strongly desired by the user community. Version 1.16.0 has the ability to filter by file extensions, either by excluding files with certain extensions or only scanning files with certain extensions. Now you can do things like scan only the JPEG files you dumped from your phone while ignoring all of the videos, scan a software project folder’s .c and .h files for duplicates while ignoring all of the others, or find all duplicates except for XML files.

In addition, I’ve cleaned up the extended filter (-X/–extfilter) framework and created an entirely separate help text section (jdupes -X help) that explains the extfilter options and behavior in great detail.

The extended filters are also cumulative, so specifying multiple filter options works as expected; for example, “jdupes -X noext=mp3 -X noext=aac -X size+=:1M” will exclude all files from consideration that end in .mp3/.aac as well as all files that are 1MiB or larger in size.

Unfortunately, there is not currently a way to combine filters, i.e. exclude all files with a particular extension over a particular size. That may be a feature in the future, but right now, I’m trying to add some basic filter functionality that satisfies as many basic filtering use cases as possible with as little work as possible. In the case of the extension filter, it took me about 3-4 hours to code, test, and fix issues with the feature, then issue a new release. It was relatively easy to implement, and even includes the ability to scan a comma-separated list of extensions rather than requiring a separate option for every single extension you want to filter.

Other features that filter on file paths will be more difficult to implement, but are highly desired by the community, so I have plans to implement those soon. The next fix, however, will be for the problematic BTRFS/XFS dedupe support that can’t dedupe (ioctl_fideduperange) large file sets properly.

Thanks as always for your continued support. More support means more time to work on my projects! I appreciate all of your help.

Manny the Martyr – Be That Way MP3 (public domain, CC0, royalty free music)

When the wonderful public domain music website finished its redesign, this song went missing on “Page 2” and it’s one of my favorite public domain songs. Links to it are getting hard to find, so I’ve uploaded it here. Right-click the link to download the song, or click the link to listen in your browser.

Manny the Martyr – Be That Way.mp3

Sage Software logo with "oof!" overlaid

Shell script that converts Sage PRO exported text (.out files) to CSV text format

I have had this tool lying around since 2014. I wrote it once for a business that needed to convert the plain-text .OUT files from Sage PRO into CSV format. It isn’t a super smart script; it only converts the formatting so that the file can be opened in a program like LibreOffice Calc or Microsoft Excel. One .OUT file can have information for lots of accounts, so it doesn’t even bother trying to split up the accounts, though it’s easy to do by hand if desired. I don’t know if this will work with newer versions of PRO or with reports different from the kind I wrote it against. It is offered as-is, no warranty, use at your own risk, don’t blame me if the output gets you a call from the IRS.

If this is useful to you, please leave a comment and let me know! The company I did this for ended up not even using the product of my hard work, so just knowing that anyone at all found this useful will make me very happy.

To use this, you’ll need to give it the name of the .out file you want it to process. Also, this was written when my shell scripting was still a little unrefined…please don’t judge too harshly 🙂

Click here to download the Sage PRO to CSV shell script.


# Convert Sage PRO exported text to CSV text format
# Copyright (C) 2014-2020 by Jody Bruchon <>
# Distributed under The MIT License
# Distributed AS-IS with ABSOLUTELY NO WARRANTY. Use at your own risk!

# Program synopsis:
# Converts a Sage PRO ".out" text file to CSV for use as a spreadsheet

# OUT files are generally fixed-width plain text with a variety of
# header and footer information.

# The general process of converting them to CSV text is as follows:

# - Read each line in the file
# - Skip lines that aren't part of the financial data
# - Skip irrelevant page/column headers and any empty lines
# - Read the account number/name information header
# - Consume columns of transaction data in order; convert to CSV data
# - Ignore account/grand totals and beginning balance fields
# - Loop through all the lines until input data is exhausted

# This script has only been tested on a specific version of Sage PRO
# and with one year of financial data output from one company. It may
# not work properly on your exported data, in which case you'll need
# to fix it yourself.

# will throw an error if it encounters unexpected data; however, this
# does not always happen if the data appears to conform to expected
# input data ordering and formatting. For example, financial data is
# assumed to be fixed-width columns and the data is not checked for
# correct type i.e. a valid float, integer, or string.

echo "A tool to convert Sage PRO exported text to CSV text format"
echo "Copyright (C) 2014-2020 by Jody Bruchon <>"
echo "Distributed under The MIT License"
echo -e "Distributed AS-IS with ABSOLUTELY NO WARRANTY. Use at your own risk.\n"

if [ ! -e "$1" ]
    then echo "Specify a file to convert."
    echo -e "\nUsage: $0 01-2014.out > 01-2014.csv\n\n"
    exit 1

SKIP=0    # Number of lines to skip
LN=0    # Current processing ine number
TM=0    # Transaction output mode

HEADERS='"Tran Date","Source","Session","Transaction Description","Batch","Tran No","Debit Amt.","Credit Amt.","Ending Bal."'

# Column widths
C1=8    # Tran Date
C2=2    # Source (initials
C3=9    # Session
C4=23    # Transaction Description
C5=9    # Batch
C6=6    # Tran No
C7=26    # Debit Amt.
C8=20    # Credit Amt.
C9=18    # Ending Bal.

CMAX=9    # Number of columns

pad_col () {
    X=$(expr $CMAX - $1)
    while [ $X -gt 0 ]
        do echo -n ","
        X=$((X - 1))

consume_col () {
    # Read next item in line
    CNT=$(eval echo \$C$Z)
    #echo CNT $CNT
    I="$(echo -E "$T" | sed "s/\\(.\{$CNT\}\\).*/\"\1\",/")"
    T="$(echo -E "$T" | sed "s/^.\{$CNT\}    //")"
    # Strip extraneous spaces in fields
    if [ $Z != 4 ]
        then I="$(echo -E $I | sed 's/^  *//;s/  *$//')"
    echo -n "$I"

while read -r LINE
    # Count line numbers in case we need to report an error
    LN=$((LN + 1))

    # Handle line skips as needed
    if [ $SKIP -gt 0 ]
        then SKIP=$((SKIP - 1))

    # Strip common page headers (depaginate)
    if echo "$LINE" | grep -q "^Page:"
        then SKIP=7

    # Strip standard column headers
    if echo "$LINE" | grep -q "^Tran Date"; then continue; fi
    if echo "$LINE" | grep -q "^Account Number"; then continue; fi

    # Don't process totally empty lines
    if [ -z "$LINE" ]; then continue; fi

    # Pull account number and name
    if echo "$LINE" | grep -q '^[0-9]\{5\}'
        ACCT="$(echo -E "$LINE" | cut -d\  -f1)"
        ACCTNAME="$(echo -E "$LINE" | sed 's/   */ /g;s/^  *//' | cut -d\  -f2-)"
        pad_col 0
        echo -n "$ACCT,\"$ACCTNAME\""; pad_col 2

    # Sometimes totals end up on the previous line
    if echo -E "$LINE" | grep -q '^[0-9][0-9][^/]'
        then LL="$LINE"
    if echo -E "$LINE" | grep -q '^\$'
        then LL="$LINE"
    if [ ! -z "$LL" ]
        then LINE="$LINE $LL"
        unset LL

    if echo "$LINE" | grep -q "Beginning Balance"
#        then BB="$(echo -E "$LINE" | awk '{print $3}')"
#        echo -n "\"Begin Bal:\",$BB"; pad_col 2
#        pad_col 0
        TM=1; AT=0
        echo "$HEADERS"

    if echo "$LINE" | grep -q '^[0-9][0-9]/[0-9][0-9]/[0-9][0-9]'
        then if [ $TM -eq 1 ]
            while [ $Z -lt $CMAX ]
                Z=$((Z + 1))
            else echo "error: unexpected transaction" >&2
            exit 1

    # Handle account totals line
    if echo "$LINE" | grep -q "^Account Total:"
        then TM=0; AT=1

    if echo "$LINE" | grep -q "^Begin. Bal."
        then if [ $AT -eq 1 ]
            echo -n '"Begin Bal",'
            T="$(echo -E "$LINE" | sed 's/Begin[^$]*//;s/\$  */$/g;s/\$/"$/g;s/ Net Change:  */","Net Change/g;s/\$/,"$/g;s/$/"/;s/   *//g;s/^",//')"
            T2="$(echo -E "$T" | cut -d\" -f1-7)"
            T3="$(echo -E "$T" | cut -d\" -f7-)"
            echo $T2,$T3
            echo "error: unexpected totals line" >&2
    if echo "$LINE" | grep -q "^Grand Total:"
        pad_col 0; pad_col 0
        echo '"Grand Total"'; pad_col 1

    # Output error (unknown line)
    echo "ERROR: Unknown data while processing line $LN" >&2
    echo -E "$LINE" >&2
    exit 1
#    echo -E "$LINE"

done < "$1"
jdupes Screenshot

What WON’T speed up the jdupes duplicate file finder

Some of you are probably aware that I’m the person behind the jdupes duplicate file finder. It’s amazing how far it has spread over the past few years, especially considering it was originally just me working on speeding up fdupes because it was too slow and I didn’t really have any plans to release my changes. Over the years, I’ve pretty much done everything possible that had a chance of significantly speeding up the program, but there are always novel ideas out there about what could be done to make things even better. I have received a lot of suggestions by both email and the jdupes issue tracker on GitHub, and while some of them have merit, there are quite a few that come up time and time again. It’s time to swat some of these down more publicly, so here is a list of things that people suggest to speed up jdupes, but won’t really do that.

Switching hash algorithms

One way I sped up jdupes after forking the original fdupes code was to swap out the MD5 secure hash algorithm for my own custom “jodyhash” fast hash algorithm. This made a huge difference in program performance. MD5 is a CPU-intensive thing to calculate, but jodyhash was explicitly written to use primitive CPU operations that translate directly to simple, fast, and compact machine language instructions. Since discovering that there were some potentially undesirable properties to jodyhash (though those properties had zero effect in practical testing on real-world data), the slightly faster xxHash64 fast hash algorithm has been used. Still, there are those who suggest changing the hash algorithm yet again to improve performance further. Candidates such as t1ha are certainly a little faster than xxHash64, but switching to them has no real value. I chose xxHash64 in part due to its containment within a single .c/.h file pair, making it particularly easy to include with the program, but some replacement hash code bases are not so easily included. Even if they were, the hash algorithm won’t make enough of a difference to change anything in any real-world workloads. The problem is that the vast majority of the slowness in jdupes stems from waiting on I/O operations to complete, not from CPU usage. This isn’t true in fdupes, where MD5 is still stubbornly used as the hash algorithm, but jdupes spends a ridiculous amount of time waiting on the operating system to complete disk reads and a very tiny amount of time waiting on hash calculations to complete.

Tree balancing

At one point, I wrote a spiffy bit of tree rebalancing code that would go down the file tree and change the parent-child relationships to more fairly balance out the tree depth for any given branch. The use of a hash algorithm with minimally decent randomization would mostly balance things out from the start, though, so my concerns about excessive tree depth turned out to be unfounded, and tree rebalance code did nothing to improve overall performance, so it was ultimately scrapped. fdupes tried to use red-black trees at one point, but discarded the implementation for similar reasons of insufficient gains. The file tree built in jdupes tends to balance out reasonably well on its own.

Delete during scanning

This seems like a good idea on paper (and indeed, fdupes has implemented this as an option), but it’s not a good idea in practice for most cases. It doesn’t necessarily speed things up very much and it guarantees that options which work on full file sets (such as the file ordering/sorting options) are not usable. The straightforward “delete any duplicates as fast as possible” case is improved, but anything much more complex is impossible. The performance boost is usually not worth it, because at best, a few extra file comparisons may not happen. It’s a tempting feature, but the risks outweigh the benefits and the added complexity for corner cases, so I’m never planning to do this.

Comparing final file blocks after first blocks

There are two reasons not to do this. The biggest is that I’ve run tests on large data sets and found that the last block of a pair of files tend to match if the first blocks match, so it won’t fast-exclude the vast majority of file pairs seen in the wild. The secondary reason is that moving from the first block to the last block of a file (particularly large files) when using a mechanical disk or disk array will cause a big penalty in the form of three extra (and possibly very long) disk head seeks for every file pair being checked. This is less of an issue on a solid-state drive, but remember that bit about most files having identical end blocks if they have identical start blocks? It’s a good idea that only slows things down in practical application on real-world data. Just for an added sting, jdupes uses an optimization where the first block’s hash is not redone when hashing the full file, but the hash of a final block is not reusable in the same way, so the labor would have to be doubled for the full-file match check.

Comparing median blocks

The rationale is similar to comparing final blocks, but slightly different. The seeks are often shorter and the chances of rejection more likely with median blocks, but all of the problems outlined for final blocks are still present. The other issue is that median blocks require a tiny bit of extra work to calculate what block is the median block for a given file. It’s added complexity with no real reward for the effort, just like final blocks.

Cache hashes across runs

This is actually a planned feature, but there are massive pitfalls with caching file hashes. Every loaded hash would have to be checked against a list of files that actually exist, requiring considerable computational effort. There is a risk that a file’s contents were modified without the cache being updated. File path relativity is an issue that can get ugly. Where do you store the database, and in what format? How do you decide to invalidate cache entries? The xxHash64 fast hash algorithm might also not be suitable for such persistent hashes to be relatively safe to use, implying a return to the slowness of secure hash algorithms and the loss of performance that is implied by such a change. It’s a low-hanging and extremely tempting way to speed things up, but the devil is in the details, and it’s a really mean devil. For now, it’s better to simply not have this around.

Those are just a few ways that things can’t be so easily sped up. Do you have any more “bad” ideas that come to mind?

Google Stadia was guaranteed to fail, according to basic freaking math

If you don’t know what Google Stadia is, it’s basically a networked gaming console. It renders everything on big servers at Google so your tiny Chromecast or other wimpy smart TV or computer or phone or internet-enabled potato or carrier pigeon or whatever doesn’t have to do any of the rendering work, and it takes inputs and sends fully rendered video frames over your network connection in a similar manner to a video streaming service. The idea is that you plug in a control pad, download the Stadia app, and you can play games without buying any special hardware. It’s a revolution in video gaming! It’s the end of home consoles!

…and it was guaranteed to be dead on arrival…and anyone with the most basic knowledge could have figured this out, but Google somehow green-lit it.

Anyone who looks at a typical ping time on a home internet connection, even a good one, can easily figure out why Stadia was doomed to be trash from the outset. A game running at any remotely usable frame rate (I’d say 20fps is a minimum for pretty much anything at this point) needs to receive inputs, process inputs, do all the game logic calculations for the next frame, render the next frame, and blit the frame, and for a 20fps frame rate, a game on your normal system has 50ms total to completely turn that around. If you are playing a faster action game that requires real-time control, you need higher frame rates than that, meaning even lower total latencies than 50ms.

Now let’s look at ping times from my house on my otherwise completely unused connection to

Pinging [] with 32 bytes of data:
Reply from bytes=32 time=33ms TTL=42
Reply from bytes=32 time=33ms TTL=42
Reply from bytes=32 time=32ms TTL=42
Reply from bytes=32 time=32ms TTL=42

OK, so a ping round-trip takes 32ms, leaving 12ms to do everything included above. BUT WAIT, THERE’S MORE: Stadia can’t send uncompressed frames, because that will take too long to arrive, so there’s compression overhead as well, meaning there’s also going to be added decompression overhead on the client side. Even with a hardware H.264 encoder/decoder combo, a finite amount of time is still required to do this. Let’s be INSANELY GENEROUS and say that the encode/decode takes 2ms on each side. Now, even before ALL THE STUFF I ALREADY MENTIONED is accounted for, we’re down to 8ms of time left to hit that 20fps frame rate goal. Remember, in the 8ms remaining, we must still process inputs, run game logic, and render out the frame to be compressed…and this is also an ideal situation assuming an otherwise completely unused connection with no or very minimal network congestion going on. This also assumes that input comes in as early as possible, which is basically never the case. There will almost always be at least one frame of input lag just because of this.

Even if you reduce the goal frame rate to 15fps, the total time available between frames only rises from 50ms to 66ms. While that does constitute a tripling of the time available to run game logic and render a frame, it’s still a really short time frame, and any network usage by any device on the same connection or other households on the same shared network node will essentially render this work pipeline unusably slow. Multiplayer gaming with client-side rendering has the advantage of only sending extremely small packets of data that transmit quickly and act as “commands” for the client software, meaning all existing multiplayer network gaming is sort of like a specialized computing cluster for that game, with the heavy lifting done where the latencies have to be the lowest. Stadia combines all of the horrible problems of live video streaming with the problems of multiplayer latencies. It was dead on arrival. It is destined to fail.

Anyone with simple networking and gaming knowledge can figure this out.

But a multi-billion dollar international corporation that snarfs up the best and brightest minds somehow missed it.

Let that sink in.

Camcorder and microphone on rock above waterfall

Should beginner videographers learn photography first? Yes and no.

(This is my response to the question in the title, posed somewhere on Reddit.)

Filmmaking is a combination of creative writing, audio recording, photography, and motion handling. There are so many things that go into even the simplest decent-seeming video production work that it’d be difficult to say “learn this first” to any one of them. You need all of them or you’ll have glaring deficiencies in your skill set. Even “just a guy who points cameras” benefits from understanding the editing process, how audio works, etc.

That being said, I got into photography as a hobby in 2010 when I purchased my first DSLR, and it was definitely a huge benefit by the time I got the filmmaking itch around 2015. Understanding composition, lighting, and manual controls is absolutely critical to good filmmaking, and you can experiment with all of that in photography. Things like audio can be learned with education and a little bit of experimentation, but composition is difficult to teach since it’s an artistic thing more than a technical one. You can learn about handy shortcuts like the rule of thirds and still take a very poorly composed photo.

When I started offering my video services professionally instead of just making short films in my backyard and office for fun, I had been doing photography for 7+ years and filmmaking as an occasional hobby for about 2 years. The biggest problems I ran into once I started professional work were as follows:

  • Audio can require a lot of experimentation to get right, and having good audio gear is extremely important. My Zoom H4n has been the best tool in my toolkit. It was hard dropping $200 on a recorder, but I challenge anyone to get better audio on a budget than my H4n attached to the podium with a SmallRig double-ball arm clamp. Shotgun mics and booms look cool, but are not appropriate for everything.
  • Poor gear choices from photography plagued me. I have a Targus (read: real cheap) tripod and a Manfrotto Compact Advanced ($90, pretty nice for photography, not a great choice for any kind of pan/tilt video work) and I had two video cameras. I bought a Magnus VT-350 7ft fluid-head tripod because the pan/tilt motion was so sticky on the other two and I had a severe problem with people walking in front of the camera during a packed event. On another event, I put the wide camera on the Magnus to avoid the people problem and was stuck with my manually operated camera and telephoto lens on a sticky tripod, ruining 70-80% of my close-ups due to the painful jerks when I’d move anything. I ended up buying another VT-350 that night and had it before the other two shows they were doing. Know what gear you need to have and spend the money on good support hardware. The VT-350 is still a cheap tripod and suffers from some issues like low weight and a little flex in the plastic QR plate, but in practice these are not major issues. GET GOOD GEAR.
  • I didn’t want to spend $25 on gaffer’s tape. It seemed stupid to pay that much for tape. BUY GAFFER’S TAPE. Pro tip: also buy a small roll of glow-in-the-dark gaffer’s tape and tape it to stuff like your tripod and wires so they’re very visible at events.
  • Every hour you spend in pre-production work will save you two or more hours in production and post-production. Anything you can plan ahead will spare you tons of pain. Arrive 90-120 minutes before an event begins to set up so you can test your stuff way before the people show up. Write and revise a script a couple of times before you shoot interviews or a wedding or anything else that requires storytelling; don’t “do it live” because you’ll burn tons of time planning on-the-spot and produce an inferior work product while doing so. Make sure your equipment is good to go the day before a shoot, with charged batteries and empty memory cards and bags all packed and all required wires and adapters accounted for.
  • Clients generally don’t know jack about video, and nothing prepares you for dealing with them and their grand dreams or demands. Think of yourself as the guy with cameras and lenses and light kits, and then think of the client as the guy with an overpriced iPhone that loves shooting in that fake bokeh wannabe “portrait mode.” These people might understand videography, but more likely they’ll think that you can do anything they’ve ever seen done on YouTube or cable TV. You’re going to have to explain to them exactly what can and can’t be done, and temper their expectations. No, you don’t have a camera boom like they used at that concert on TV, so those cool sweeping shots aren’t going to happen. Be polite but firm on what you can and can’t do. If they want something more than you have, they’re gonna pay for the required rentals.
  • Video is photography with motion. This seems like a silly and obvious point, but it’s a major problem when moving out of photography to video work, especially for someone else. If you do event coverage or sports especially, you’re going to have to track subjects that move in ways you can’t easily predict. You’ll have to learn how to do this one way or another, and it’s really hard at first. The best thing to do is to leave enough room around the subject to allow for your reaction delay without losing them when they move around. A field monitor can be especially handy for sports video. Don’t let your shoots get compromised by a sudden movement. If you need practice, go outside to a place with birds or dragonflies or other fast-moving natural things, and take something telephoto (a camcorder with a nice optical zoom will do), and try to anticipate their movements and keep them in frame as much as possible. It will get easier as you practice it more.

One thing to note is that the lines between photography and videography are blurring. I recently helped a local mayoral candidate with video and photo work, but the only traditional photography involved was the portraiture. All of the photos on the site are really just 4K frame grabs. I shot the 4K footage with the intent of frame-grabbing any needed photos later, so I used a 1/100-1/125 shutter instead of 1/60 to significantly reduce motion blur. It makes the video portions a little less smooth-looking, but it’s worth it for the ability to pull clean 8MP photos out all day long.

Comparison of brown color reproduction between film and digital camera

PROOF: Follow-up to my “Vox Media says light is racist and that’s stupid” video

Vox published a video a few years ago about how “color film was made for white people.” There were two major claims that dictated the entire framework of the video:

  1. Color film made dark-skinned people look really bad, especially when white people were in the same frame, and
  2. Manufacturers of color film left out chemicals that “would bring out certain red and brown tones.”

Anyone who understands a decent amount about photography and especially about beloved ancient color films such as Kodachrome 25 can easily debunk point 1, because it’s a simple and visually very obvious matter of poor dynamic range and decisions about exposure. Old film has 6-7 stops (a “stop” is an exponential change, where a stop of difference is a doubling or halving of light) of dynamic range, but new film and most digital cameras have double that dynamic range. When you expose for correct midtones on old film stocks, you would inevitably lose all your darker and lighter areas, meaning an outdoor photo would have little to no texture on clouds and very bright surfaces and anything about 3 stops lower than the value exposed for would be very dark and featureless. New film has no such problem, and digital cameras and camcorders don’t sense images the same way as film.

Debunking “racist chemistry” is hard

But what of point 2? The color film formulation claim is not so easily dismissed with simple physics and easily researched facts about film stocks. I’m not personally willing to do the level of digging required to find out what film chemistry was like in the 1940s, assuming that such documentation still exists and is somehow still accessible. However, I was poking through my scanned color film negatives one day, and I was surprised to discover that I had taken a photograph that might illustrate that Vox’s “no brown tones in the chemistry” claim was a lie. It was a dark brown wooden coffee table with some very slight reddish tones in the finish. Shuffling through other pictures I had, I found a cell phone picture from a year or two prior that contained the same table, taken with different lighting but the differences were way too visually obvious to resist. In my film photo, it looks almost like a cinnamon finish even in the shadows and even though the rest of the photo has good white balance. In my cell phone photo, the darkness of the finish is obvious, and the two brown tones look almost like different tables entirely.

The film photo was taken on good old standard-issue FujiFilm ISO 400 from Wal-Mart that expired roughly around 2006 and uses the C-41 color film process for development. The phone photo was taken on a relatively cheap ($150 or so?) Android cell phone bought around 2014-2015.

How is this still a problem, Vox?

The problem with Vox’s claims start to become immediately apparent. Sure, Kodak is an American company that makes film with the (majority white) American market in mind, so it’s at least plausible that Kodak might not have added the necessary chemistry for various reasons (cost, complexity, or perhaps even the Vox video’s implications of racism), but FujiFilm has always been a Japanese company that would operate primarily with the Japanese market in mind. Japanese people have a wide variety of skin tones with plenty of variations of olive, pink, and yes, the notoriously “left out of the chemistry” brown. The film stock I used is also from the 2000s, well after the racist film problem was supposedly solved to appease wood furniture sellers and chocolate makers. Here’s the relevant screenshot from the video, in case you haven’t watched yet:

Comparison of brown color reproduction between film and digital camera
Oh no! Vox Media’s political narrative is crumbling! The shock! The horror!

In case it’s not clear, the Android phone photo’s color is pretty close to the actual color of the table, but the film photo is way off, even if you only look at the shadows and ignore where the sunlight is hitting it.

So, if Vox Media’s video about racist color film is correct about their film chemistry claims, why would a Japanese company with a target market full of colorful brown people put out a film stock many years beyond the “fixed brown tones” mark that doesn’t reproduce brown tones accurately? Are we to believe that Fuji is racist against their own people? No, that’s ridiculous, just like Vox’s claims of racist film chemistry are ridiculous. Fortunately, I have come up with a much simpler explanation that makes a lot more sense.

Brown makes brown look bad

FujiFilm ISO 400 C-41 film negative

The picture above is not just any film negative; it’s the first C-41 color film negative I ever developed on my own, and it’s one of about six rolls of Fuji ISO 400 that I got with my Canon T50 film SLR when I bought it and picked it up in a literal hurricane a few years ago. What color is the negative material outside of any photos? If you said “brown” then congratulations, you have fully functional eyesight. A negative must be converted to a positive before it can be used as a normal photo, so the colors must be inverted. Here’s how that would look when done on a computer:

Inverted image of color film negative

Ouch. Instead of the brown stuff, we now see the lovely color cyan. Cyan is the inverse of orange, so brown (dark orange) will invert to dark cyan. To fix this, we’ll have to remove a lot of cyan from the image…Color film negative, inverted and color-shifted to restore normal color balance

That’s not perfect, but close enough for this demonstration. Basically, you have to artificially boost red and lower green and blue to get the original image from the inverted negative. (No, I didn’t try very hard for this demonstration, so don’t complain.)

IrfanView color correction for film

The positive image being heavily skewed towards the inverse color of brown means that reproducing brown with color negative film is only possible with a reduced level of accuracy. Brown in particular will be reproduced less accurately than its brighter relative (orange) because brown already has a weaker effect on the film due to being a darker color, plus it’s fighting a heavy color shift towards its inverse. This also affects the reproduction of cyan (obviously), but unless you’re spending your whole day photographing the lichen Xanthoparmelia with color film for some reason, you won’t see enough cyan in nature (or even outside of nature) to notice the reduced color quality. Anytime you shift the tint of an image, you necessarily artificially reduce or increase the amount of a color that can be accurately reproduced. While film is an analog medium, it has its limits just like any digital image, and the more you “push” or “pull” that image, the more observable those limitations become.

Imperfect proof, but quite sufficient

This doesn’t offer definitive proof that Vox’s claims of racist color film chemistry are false, but it heavily strains credulity that the cause of poor reproduction of “certain brown and red tones” was racist film chemistry formulations when all of that was supposed to be a problem before the 1990s (at the latest!) and an film stock made for a market full of brown people from the 2000s and sold in stores all across the globe still exhibits the same exact issues. The brown backing of C-41 color film and the tricks required to neutralize the effect of that brown tint only further erode support for the notion that it’s a problem of “oops, we left out the brown people ingredients” film chemistry.

If I have to choose between “racist conspiracy of white America that somehow still applies to film made for countries full of brown people” and “the brown backing makes it harder to reproduce brown because you have to remove the brown to make it look normal,” I am definitely going to pick the latter. It’s a simple explanation that can be easily observed and tested in an imaging program rather than an elaborate conspiracy theory presented by notorious social justice race-baiters and that doesn’t fit easily observed facts.

(I’m also never going to let them live down that trick in the original video where they used Kodachrome with very bad dynamic range as the “black people looked bad” example and much newer Kodachrome with good dynamic range as the “white people looked good” example. You dirty lying bastards knew exactly what you were doing when you chose those two photos.)