A pigeon in Washington, D.C.

Politeness and the Pigeon

Politeness is briefly defined as “the practical application of good manners or etiquette so as not to offend others.” It is an unspoken set of restrictions on how each person can speak and what they’re allowed to speak about. The purpose of politeness is to avoid behaviors and topics that could provoke hostility in others. Such rules are important in a group of people that want to get along…and then the pigeon shows up.

A pigeon in Washington, D.C.
A pigeon in Washington, D.C.

Some say life is like a game of chess. If you’re having a party where people are enjoying games of chess, your door is open, a pigeon flies into your house, lands beside a chess board, and moves a piece…well, isn’t that neat! This pigeon that flew in here is playing chess! You might be kind to the pigeon, giving it a little cup of water and trying to play a mock chess game in return. You and everyone else at the party is being entertained by the chess-playing pigeon even if it doesn’t play chess very well. There’s no harm in letting the pigeon play chess. It’s just doing what everyone else is doing, albeit in a strange way that doesn’t quite fit in with the rest of the group. While some say life is like a game of chess, others say something like this:

“Never play chess with a pigeon. The pigeon just knocks all the pieces over. Then poops all over the board. Then struts around like it won.”

Politeness is self-censorship to avoid conflict. Nature assigns conflict between creatures a very high price; even if you “win” in a fistfight, you’re unlikely to take the victory walk without a limp and a black eye for your trouble. Most people are desperate to avoid conflict because it can hurt them for a long time, be it physically, mentally, or socially. Some people you know may never speak to you again. Your new scar may never stop hurting when something grazes it for the rest of your life. You may fall into a nasty depression from the subsequent fallout and mental flashbacks to the trauma, or from guilt over harm that came to someone that had nothing to do with it. Politeness is a virtue because conflict can have such a high price, but politeness is a nasty vice when conflict is necessary and is prevented.

There is a phenomenon known as the “bystander effect” where individuals are less likely to offer help to a victim in the presence of other people. The two major factors behind it are diffusion of responsibility (in a group of 10, you’re only 10% responsible, so why should you step up?) and compulsion to behave in a socially acceptable manner–otherwise known as “politeness.” If the rest of the group doesn’t step in to stop a bad actor then it’s very likely you won’t either. The “stupidity” of people when they’re in groups is well-known both in psychology and by anyone who observes the behavior of people around them. It’s polite to go along with the group, but it’s not necessarily morally or ethically correct. The first person in a group to respond tends to control the group by doing so, and that person will always be the person who ignores group dynamics, be it by choice (often called “leadership”) or by nature (sometimes socially ignorant, sometimes outright malicious).

Rules governing interpersonal conduct only have value when they represent a net positive; that is, they prevent unnecessary conflict without blocking necessary conflict. Consider someone breaking into your home, jacking your car, or raping your daughter. Most people aren’t willing to extend politeness to this person; in fact, most would happily extend a bullet, a golf club, a baseball bat, or a fist to them instead. Self-defense against a grave threat to your life, liberty, or property is an example of necessary conflict. Demanding that people “be polite” and “play nice” with violent criminals forces them to either become victimized as much as possible or to clearly violate the rules for the sake of self-preservation. This example is presented because it’s painfully obvious, but the same logic applies for lesser “crimes” such as that of the pigeon at the chess party.

Gangster pigeons
Gangster Pigeons by Destiny3000

Witnessing the poop-laden chess board, the scattered chess pieces and feathers, and the pigeon strutting around without care for what anyone in the room thinks, it becomes obvious to all that the pigeon should have been caught and thrown out right away rather than catered to out of politeness and curiosity. The front door was left open and the host politely allowed the pigeon to set up shop at a chess board without resistance; now there’s a mess and everyone at the party is suffering. Someone could step up, take down the pigeon, and remove it from the party, but no one wants to be the person to break the group dynamic and manhandle the bird, plus at least a couple of “sidewalk superintendents” that like to police group compliance are guaranteed to scold them loudly for being “brutal” to the “poor bird” despite what it has just done.

Eventually, one person decides that enjoying the party is more important than being polite to the creature that is trashing it and that the bite of the politeness harpies can be endured. Grabbing a nearby jacket, this person chases the pigeon throughout the house. The pigeon chaotically runs and flies around to escape, knocking over more things and pooping another time or two in the process, sometimes flying towards people who yell and jump out of the way. The hecklers screech about how the hunter’s interference is only causing the party to become even more wrecked. Others in the group also start demanding that the cruel and fruitless attempts to capture the antagonistic pigeon be halted. All of this takes only a minute or two to unfold and escalate.

Where do we go from here? There are three choices: the pigeon can be caught and removed and the party can resume, the pigeon can be left alone and the party can end prematurely on a sour note as a professional is called to remove the pigeon, or the most polite path can be taken: the party can resume despite the pigeon’s presence. In the polite case, people pretend the stress of the pigeon’s presence doesn’t bother them and the party slowly dies with the pigeon rewarded for its troubles by doing as it pleases. In the early ending case, everyone is unhappy and the pigeon controls the entire house for hours, rewarded for its troubles by doing as it pleases.

In the first and most ideal scenario, the pigeon is removed in a few minutes, the hecklers and those who supported them quiet down so people hopefully forget that their demands were demonstrably the wrong ones, the mess is cleaned up in a few minutes, the party resumes, and everyone walks away an interesting story to tell. The person that removed the pigeon wasn’t polite to the pigeon, the hecklers that followed, or the heckler add-ons that followed again, but in the end everyone was better off because the pigeon was kicked out.

The moral of the story is that sacrificing conflict at the altar of politeness without regard for necessity is a race to the bottom. Another is that people who make and enforce the rules should enforce the rules from the start, not later on after things get messy. None of this would have been necessary if someone had stopped the pigeon from coming into the house in the first place.

Hope to see you never the fuck again
If you don’t buy a shirt from Effin’ Birds then we can’t be friends.

The Predatory Lie of YouTube Success

Whenever you watch a video about how to be successful on YouTube and why your videos aren’t performing, you’ll always here that “YouTube’s algorithm just shows people what they want to see.” You’ll be told all about how YouTube does a fantastic job of figuring out what kind of content people want and funneling the vast majority of user traffic towards that content. If your channel is failing then it’s because you’re doing something wrong. It’s always your fault and never YouTube’s fault. If you just put more work into getting it right and giving your viewers what they want then you’ll definitely succeed at showing the algorithm that your content is worthy of becoming the next big thing on YouTube.

This is a lie. I know from experience.

My YouTube channel spent many years “at the bottom.” I made a video in 2021 about a hot topic at the time: Windows 11 and how the sinister requirements Microsoft published for running it fit into a greater threat to general-purpose computing (TCPA, Palladium, Trusted Computing, you name it) that has been brewing since at least 2003. There was no shortage of Windows 11 coverage in the news when I published that video one week before it went release-to-manufacturing (RTM). The cutoff for upgrading was absurdly high, only allowing PCs built in the past 3-4 years to get it. The noise around Windows 11’s release should have shoveled heaps of traffic at my video. Instead, the first 300 days of publication on YouTube only caught about 20,000 views.

Then something crazy happened. Near the end of April 2022, 2 months shy of a year after publication, the video began taking in THOUSANDS of new views daily, then TENS OF THOUSANDS of new views. The view count had doubled by the end of April. 18 days later, the view count had literally multiplied by 10 times. Two months later, the meteoric rise in view counts slowed at 1.66 million views, finally settling down in the 1.8 million view range.

If YouTube’s algorithm really shows the viewer what they want…why did a video worth 1.8 million views stagnate for 10 months before taking off? The reason is deceptively simple: while “The Algorithm™” might generally work the way that the YouTube fanatic channels claim it does, the truth is that YouTube really is a lottery. New and stagnant videos are shown to an extremely small number of people at a time. If you’re lucky enough for that absurdly tiny group to have a few click-throughs and long watches of your video, YouTube runs the experiment again by showing it to more people, and continuously tests your content this way to decide if it should be allowed to succeed or fail. Bigger channels get a bigger group of people to enable their content to “take off” since they’ve “proven themselves” to the algorithm as reliable drivers of views. This sounds like a brilliant way to come up with good video suggestions at first glance–but it’s actually pretty terrible.

Your video’s success relies on a few watches by a very small group of people. What happens if it’s the wrong group of people? Your video fails. The planet has 8 billion people, and the vast majority of them won’t care about your content in the slightest. If you’re a big channel then this doesn’t really affect you due to the size of your subscriber base and the favoritism shown towards larger channels, but if you’re a smaller channel like mine was (and is, if we’re being honest) then it really is nothing more than a lottery with the odds leaning very strongly against you. This ignores all the rug-pulls YouTube does to force you to make more content at no cost to them, constantly, forever.

That’s the truth they don’t want exposed. Don’t waste your time trying to succeed on YouTube. The pay you’ll get will never fully justify the amount of work required, and YouTube will gladly drive you into the ground to milk you for every last ad dollar.

Web Environment Integrity Must Be Stopped title card

Web Environment Integrity Must Be Stopped: Enslavement By “Remote Attestation”

“Web Environment Integrity” is a specification being worked on by Google employees and prototyped into Chromium, the open-source core of Google Chrome on which most modern browsers are based (Microsoft Edge, Brave, Opera, Vivaldi, and many others). The short version of how it works is that your browser gets a permission slip from a big tech company like Google, Microsoft, or Cloudflare that certifies whether or not you’re “trustworthy” which websites can use to gatekeep access. It is supposed to help with spam, bots, and fake ad clicks, but the goals laid out by the project are in direct conflict and not possible to achieve as a result. One such goal is “for the web to be usable without this framework” but the entire point of the framework is to lock down the web, so which half will be the one that Google chooses to not achieve?

Let there be no doubt: if Google manages to get this “remote attestation” framework in use on most websites, they’ll have the power to unilaterally lock people out of most of the internet, and website owners will have no way to know that their users were locked out by the “trusted third party” that maliciously “attested” to the user actually being a spam bot to trick them into locking the user out.

Fortunately, the internet has decided to call them out on their nonsense, and in typical central authoritarian fashion, they’ve locked the thread so no one can keep burning them publicly on their own absurd proposal.

Monkey wrench

Right to Repair: Why Freedom and Access Are So Important

The one huge important thing about repairing things yourself is the same huge important thing about maximally empowering individuals to conduct their own lives as they see fit: anyone that isn’t YOURSELF can’t respond to your needs remotely as fast as you can, and the consequences of not taking care of it yourself immediately can sometimes be catastrophic.

“When seconds count, police are only minutes away” comes to mind. A huge farm machine breaking down right before a harvest is another. In these cases, if you can’t handle it yourself or hire someone nearby who knows how to handle it and you have to wait on the lumbering bureaucracy, you end up with an irreversible catastrophe (battery/assault/death or a ruined million-dollar harvest) while waiting on the “authoritative responder” to show up.

Obviously, these are some of the worst cases, but this applies even to less severe things. If none of the plumbing in your house will drain properly and you’re willing to look up how to unclog a drain, buy a plumbing snake, and get your hands a little dirty, you can clear the pipes yourself in a couple of hours and take a satisfying poop afterwards to celebrate being able to flush again. Your other options are to wait a few days or more for a plumber to show up and do it for you OR to desperately look for an emergency plumber, pay a huge premium for immediate service, and still wait for them to show up and fix it. Some people might opt for the plumber, some might not.

Now imagine that all the drains in the house have locks on them and you aren’t allowed to snake your own drain even if you know how to and want to do it…and then, just to make it ten times worse, you can’t hire the fastest plumber or the best plumber that can respond quickly–you have no choice but to call the drain manufacturer and wait for a Massive Dump Tunnel Inc. authorized poop tube technician to come out a month later at a premium monopoly expense.

I hope you have great income and some mad poop retention skills, because you’re gonna be a sad, constipated panda.

Transformative use of Serin Jameson's Star Trek Shitposting 6th Anniversary artwork

Copyright Troll & DMCA Abuser Serin Jameson Learns About Fair Use

Updated 2021-11-26 to change Serin’s pronouns to be ambiguous (and thus more degrading) based on a comment whining about them. Be sure to read the comment for a great example of “tolerance.”

Serin Jameson is the founder and artist for a Facebook group called “Star Trek Shitposting.” I discussed the group and its toxicity in an unedited, rambling video six weeks prior to this post. It took that long for someone in the group to notice the video. What followed was a 109,000-person Facebook group descending upon my video to attempt to make me feel bad for mocking them, with literally thousands of comments being posted, many of which proved that the loudest and most accepted people in that group are indeed horribly toxic and evil people.

Sadly for them, I don’t care what they think about me, and their insults either resulted in pity or amusement on my end.

However, Serin Jameson is the lone figure who stands above all others for taking a good old-fashioned pile of internet banter way too far, jumping straight into the realm of brazenly breaking the law.

Serin Jameson's copyright abuse
Serin Jameson’s copyright abuse

To understand the problem, one must first understand the concept of “fair use” within United States copyright law. In general, copyright law grants the rightsholders of a work the exclusive right to distribute their work for a limited period of time to encourage the ongoing creation of new works. (There are several issues with existing copyright law, including the “limited” term being insanely long, but that’s a conversation for another post.) U.S. copyright law explicitly carves out exceptions to this exclusive right, and the only one that’s important for most people is what is known as the Fair Use Doctrine. This portion of the law lays out four factors by which a fair use exception is judged. A fairly comprehensive explanation of the “four factor test” is available from the University of Texas. I won’t go over too many of the details here, but suffice it to say that my use of Serin Jameson’s artwork for the purpose of criticism and commentary combined with my heavy transformative use of the work place my use squarely within the bounds of the Fair Use Doctrine.

Transformative use of Serin Jameson's Star Trek Shitposting 6th Anniversary artwork
Transformative use of Serin Jameson’s Star Trek Shitposting 6th Anniversary artwork; from top left: original, transparency added, upper/lower layers added, completed image

Serin Jameson used (via a YouTube form) a provision of the Digital Millennium Copyright Act (DMCA) that allows rightsholders to send a DMCA takedown notice to an online service provider to notify them of posts that infringe on their copyright and have them taken down. My work constitutes plainly obvious fair use, so this action was inappropriate, and Serin Jameson knew this to be the case, and has thus stomped into Section 512(f) of the DMCA, which states (irrelevant portions excluded):


Any person who knowingly materially misrepresents under this section…that material or activity is infringing…shall be liable for any damages, including costs and attorneys’ fees, incurred by the alleged infringer.

Serin Jameson is invoking U.S. law regarding material stored on U.S. company servers by a U.S. citizen, jurisdiction here is clearly and exclusively within the United States, despite Serin Jameson apparently being in Australia. United States case law requires that Serin Jameson perform a fair use analysis of the work in question prior to filing a DMCA takedown request (Lenz v. Universal Music Corp., 801 F.3d 1126 (9th Cir. 2015)). Failure to do so violates the declaration of the good faith required by the DMCA in any DMCA takedown request.

Based on the publicly available communications of Serin Jameson, my email notifying him it of his its failure to conduct a fair use analysis, and his its public confirmation of receipt of that message, Serin Jameson fits all of the requirements under the DMCA to be held legally liable for filing a false DMCA takedown against my content. I have already sent a counter-notification to YouTube.

I am posting this as both a lesson and a warning: you should not abuse the law to silence people you don’t like.

Also, for the curious: yes, United States judgments are enforceable in Australia. Serin Jameson could find itself on the receiving end of some Hauge secret sauce.

Finally, as they say on the internet, I’ve brought receipts.

Serin Jameson Facebook post 1 - planning a false DMCA takedown
Serin Jameson Facebook post 1 – planning a false DMCA takedown
My email to Serin Jameson to resolve the conflict
My email to Serin Jameson to resolve the conflict
Serin Jameson Facebook post 2 - confirming receipt of my email
Serin Jameson Facebook post 2 – confirming receipt of my email
Hard drive platter spinning

Don’t destroy used hard drives! Wipe them and reuse or sell them with confidence!

The advice to not use a hard drive from eBay is not the best advice. You should fully wipe the drive (a zero fill will do) and then install a new OS on it. The old data will be 100% unrecoverable and you won’t unnecessarily destroy a perfectly good piece of equipment. Please don’t advocate for this kind of wasteful drive destruction.

Yes, a zero fill is more than enough. The “DoD secure wipe” was designed for hard drives from the 80s and early 90s that used FM/MFM/RLL data modulation. Today, drives use [E]PRML and other advanced techniques instead.

Yes, Peter Guttmann wrote a paper about recovering data from hard drives that said you could easily do so, but that was in the era of widespread MFM/RLL drives, and Guttmann himself later walked back his recommendations:

“In the time since this paper was published, some people have treated the 35-pass overwrite technique described in it more as a kind of voodoo incantation to banish evil spirits than the result of a technical analysis of drive encoding techniques. As a result, they advocate applying the voodoo to PRML and EPRML drives even though it will have no more effect than a simple scrubbing with random data. In fact performing the full 35-pass overwrite is pointless for any drive since it targets a blend of scenarios involving all types of (normally-used) encoding technology, which covers everything back to 30+-year-old MFM methods (if you don’t understand that statement, re-read the paper). If you’re using a drive which uses encoding technology X, you only need to perform the passes specific to X, and you never need to perform all 35 passes. For any modern PRML/EPRML drive, a few passes of random scrubbing is the best you can do. As the paper says, “A good scrubbing with random data will do about as well as can be expected”. This was true in 1996, and is still true now.”

Hard drive platter and arm
It’s a miracle that these things work at all.

Quoting Donald Kenney:

“PRML uses a different approach to improving storage density. To permit greater data density, recorded data amplitude is reduced and bits are packed together more closely. The digital signal is then recovered using digital signal processing techniques on the analog data stream from the drive. PRML achieves a 30-40% improvement in storage density over RLL modulation without PRML. EPRML modifies the algorithms used in PRML to achieve additional improvements claimed to be 20-70%.”

The extremely low magnetic amplitude in [E]PRML modulation puts the analog data signal on the platter so close to the noise floor that a DSP is required to apply filters to the noise to recover the data signal. A simple zero fill will push the previous (very weak) signal firmly back into the noise floor. Snatching data from an MFM drive using a scanning tunneling electron microscope relied on the strong amplitude of the data writes being “messy,” as in the magnetic domains of previous writes (sometimes multiple layers of them) would still be detectable “around” the current writes because so much of the surface was influenced by the previous writes that unnecessary “leakage” of the magnetic domains would occur, and subsequent writes wouldn’t necessarily be able to “reach” all of the affected areas.

PRML techniques massively boost data density; doing so makes the margins in which you’d locate this “leaked” data so tight that there isn’t much room for it to exist in the first place, but on top of that, the strength of the write is an order of magnitude weaker. It’s frankly a miracle of modern science that the data so close to the noise floor and with such an insanely tiny amount of surface area can be read back at all. One simple overwrite pass will destroy the data beyond even the abilities of any given three-letter agency to get it back.

So, in short, a one-pass zero-fill of the drive is enough to “sanitize” the data therein. Please don’t throw away or destroy hard drives just because someone else used them before, and if you’re selling a computer, just wipe the drive completely and your now-destroyed data is perfectly safe from prying eyes.

Python code mistake

I made youtube-dl faster for archivists…and solved a worst-case programming problem elegantly in the process

Update: there is an addendum at the end of this article; I mention it because yes, in the end, I switched over to Python sets. I don’t want any Python experts cringing too hard, after all. Welcome to the joys of a C programmer adapting to Python.

For those who haven’t heard of it, youtube-dlc is a much more actively maintained fork of the venerable but stagnant youtube-dl project that was announced on the DataHoarder subreddit. I have been archiving YouTube channels for many months now, trying to make sure that the exponential boost in censorship leading up to the 2020 U.S. Presidential election doesn’t cause important discussions to be lost forever.

Unfortunately, this process has led me to have a youtube-dl archive file containing well over 70,000 entries, and an otherwise minor performance flaw in the software had become a catastrophic slowdown for me. (Side note: a youtube-dl archive file contains a list of videos that were already completely downloaded and that lets you prevent re-downloading things you’ve already snagged.) Even on a brand new Ryzen 7 3700X, scanning the archive file for each individual video would sometimes progress at only a few rejections per second, which is very bad when several of the channels I archive have video counts in the multiple thousands. The computer would often spend multiple minutes just deciding not to download all of the videos on a channel, and that got painful to watch. That’s time that could be spent checking another channel or downloading a video.

When youtube-dlc popped up and offered goodwill to the people who were treated less than favorably by the youtube-dl project maintainers, I realized that I had a chance to get this fixed. I opened an issue for it, but it became clear that the budding fork didn’t have resources to dedicate to the necessary improvement. The existing code technically worked and performance wasn’t as bad for people using much smaller archive files. Both myself and the youtube-dlc maintainer (blackjack4494) quickly identified the chunk of code that was behind the problem, so I decided to take it on myself.

Discussion about the code causing the problem
There’s our problem!

The troublesome code is cut off in the discussion screenshot, so here’s a better look at it:

Problematic code from youtube-dl
Good programmers probably know why this is slow just from this image.

The code outlined in the red box above is opening the archive file for reading, consuming it line-by-line, and comparing the line read from the archive to the line associated with the candidate video to be downloaded. If a line exists that matches, the video is in the archive and shouldn’t be downloaded, so in_download_archive() returns True. This works fine and is a very simple solution to implementing the archive file feature, but there’s a problem: in_download_archive() is invoked for every single possible video download, so the file is opened and read repeatedly.

Some programmers may see this and ask “why is this a problem? The archive file is read for every video, so there’s a good chance it will remain in the OS disk cache, so reading the file over and over becomes an in-memory operation after the first time it’s read.” Given that my archive of over 70,000 lines is only 1.6 MiB in size, that seems to make some sense. What is being missed in all of this is the massive overhead of reading and processing a file, especially in a high-level language like Python.

An aside for programmers not so familiar with more “bare metal” programming languages: in C, you can use a lot of low-level trickery to work with raw file data more quickly. If I was implementing this archive file check code in C (some low-level steps will be omitted here), I’d repeatedly scan the buffer for a newline, reverse the scan direction until extra white space was jumped over (doing what strip() does in Python), convert the last byte of newline/white space after the text to a null byte, and do a strcmp(buffer, video_id) to check for a match. This still invokes all of the OS and call overhead of buffered file reads, but it uses a minimal amount of memory and performs extremely fast comparisons directly on the file data.

In a language that does a lot of abstraction and memory management work for us like Python, Java, or PHP, a lot more CPU-wasting activity goes on under the hood to read the file line by line. Sure, Python does it in 4 lines of code and C would take more like 15-20 lines, but unpack what Python is doing for you within those lines of code:

  1. Allocating several variables
  2. Opening the file
  3. Allocating a read buffer
  4. Reading the file into the buffer
  5. Scanning the buffer for newlines
  6. Copying each line into the “line” variable one at a time
  7. Trimming leading and trailing white space on the line which means
    • Temporarily allocating another buffer to strip() into
    • Copying the string being stripped into…
    • …while checking for and skipping the white space…
    • …and copy the string back out of that buffer
  8. Finally doing the string comparison
  9. All while maintaining internal reference counts for every allocated item and periodically checking for garbage collection opportunities.

Multiply the above by 2,000 video candidates and run it against an archive file with 70,000 entries and you can easily end up with steps 6, 7, and 8 being executed almost 140,000,000 times if the matching strings are at the end of the archive. Python and other high-level languages make coding this stuff a lot easier than C, but it also makes it dead simple to slaughter your program’s performance since a lot of low-level details are hidden from the programmer.

I immediately recognized that the way to go was to read the archive file one time at program startup rather than reading it over and over for every single download candidate. I also recognized that this is the exact problem a binary search tree (BST) is designed to speed up. Thus, my plan was to do the same line-by-line read-and-strip as the current code, but then store each processed line in the BST, then instead of reading the file within in_download_archive(), I’d scan the BST for the string. The neat thing about a BST is that if it were perfectly balanced, 70,000 entries would only be 17 levels deep, meaning each string check would perform at most 17 string comparisons, a much better number than the 70,000-comparison worst-case of scanning a flat text file line-by-line.

So, I set out to make it happen, and my first commit in pursuit of the goal was dropped.

Binary search tree Python code
The workhorse code; a classic binary search tree
Archive preload code that feeds the binary search tree
Archive preload code that feeds the binary search tree

This actually worked nicely!…until I tried to use my huge archive instead of a tiny test archive file, and then I was hit with the dreaded “RuntimeError: maximum recursion depth exceeded while calling a Python object.” Okay, so the recursion is going way too deep, right? Let’s remove it…and thus, my second commit dropped (red is removed, green is added).

Python code change from recursion to a loop
Let’s do it in a loop instead!

With the recursion swapped out for a loop, the error was gone…but a new problem surfaced, and unfortunately, it was a problem that comes with no helpful error messages. When fed my archive file, the program seemed to basically just…hang. Performance was so terrible that I thought the program had completely hung. I put some debug print statements in the code to see what was going on, and immediately noticed that every insert operation would make the decision to GO RIGHT when searching the tree for the correct place to add the next string. There was not one single LEFT decision in the entire flood of debug output. That’s when I finally realized the true horror that I had stepped into.

I had sorted my archive…which KILLED the performance.

Binary search trees are great for a lot of data, but there is a rare but very real worst-case scenario where the tree ends up becoming nothing more than a bloated linked list. This doesn’t happen with random or unordered data, but it often happens when the tree is populated in-order with data that is already sorted. Every line in my sorted file was “greater than” the line before it, so when fed my sorted archive, the tree became an overly complex linked list. The good news is that most people will not have a sorted archive file because of the randomness of the video strings, but the bad news is that I had sorted mine because it boosted overall archive checking performance. (Since new video downloads are appended to the archive, the most recent stuff is always at the end, meaning rejecting those newer downloads under the original archive code always required the worst-case amount of time.) It is entirely possible that someone else would sort their archive at some point, so I had accidentally set myself up in the worst-case scenario and I couldn’t just ignore it and assume no one else made the same mistake. I had to fix it.

I got 3/4 through changing over to a weighted BST before realizing that it would not improve the load times and would only help the checking behavior later. That code was scrapped without a commit. I had previously added weighted trees with rebalancing to jdupes, but removed it when all tests over time showed it didn’t improve performance.

How do you optimally feed sorted data into a BST? Ideally, you’d plop the data into a list, add the middle piece of data, then continue taking midpoints to the left and right alternately until you ran out of data to add (see this excellent tutorial for a much better explanation with pictures). Unfortunately, this requires a lot of work; you have to track what you have already added and the number of sections to track increases exponentially. It would probably take some time to do this. I decided to try something a little simpler: split the list into halves, then alternate between consuming each half from both ends until the pointers met. Thus, the third commit dropped.

Python code to attempt to add a sorted list to a binary tree
This didn’t work.

This commit had two problems: the pointer checks resulted in failure to copy 2 list elements and the improvement in behavior was insufficient. It was faster, but we’re talking about an improvement that can be described as “it seemed to just hang before, but now it completes before I get mad and hit CTRL-C to abort.” Unacceptable, to be sure. I stepped away from the computer for a while and thought about the problem. The data ideally needs to be more random than sorted. Is there an easier way? Then it hit me.

I used ‘sort’ to randomize my archive contents. Problem: gone.

All I needed to do was randomize the list order, then add the randomized list the easy way (in-order). Could it really be that simple?! Yes, yes it can! And so it was that the fourth commit dropped (red is removed, green is added).

Python code with an elegant solution
Surprisingly simple solution, isn’t it?

This worked wonderfully in testing…until I tried to use it to download multiple videos. I made a simple mistake in the code because it was getting late and I was excited to get things finished up. See if you can find the mistake before looking at my slightly embarrassing final commit below.

Python code mistake

As I wrote this article, I realized that there was probably a cleaner way to randomize the list in Python. Sure enough, all of the code seen in that last commit can be replaced with just one thing: -random.shuffle(lines), and thus dropped my new final commit.

Python randomization loop replaced with one-liner
One-line built-in is better!

I think the code speaks for itself, but if you’d like to see the end result, make sure you watch the video at the top of this article.


I posted this article to Reddit and got some helpful feedback from some lovely people. It’s obvious that I am not first and foremost a Python programmer and I didn’t even think about using Python sets to do the job. (Of course a person who favors C will show up to Python and implement low-level data structures unnecessarily!) It was suggested by multiple people that I replace the binary search tree with Python sets, so…I did, fairly immediately. Here’s what that looked like.

Using Python sets instead of a binary search tree
Code go bye bye

The Python set-based implementation is definitely easier to write and does seem to work well. If I did this again, I’d probably skip the BST with shuffle and just use sets. The performance is almost as good as the BST, but I ran a test using OBS Studio to capture output, then moving frame by frame to find the beginning and end of the archive checking process. The set version took 9 seconds; the BST version took 8 seconds. While the set version looks prettier, the fact that the BST has already been merged into upstream and is a bit faster means that the BST (despite probably offending some Python lovers) is here to stay. Actually, it turns out that I made a mistake: I tested the set version with ‘python’ but the BST version was compiled into an executable; after compiling the set version into an executable, it turns out that the set version takes about 6-7 seconds instead. Excuse me while I send yet another pull request!

If you have any feedback, feel free to leave a comment. The comment section is moderated, but fair.

Cute cloned dogs

A CHALLENGER APPEARS: “fclones”…fastest duplicate scanner ever? It’s complicated.

While perusing my link referrals on GitHub, I noticed this thread where my duplicate scanner jdupes was mentioned. I then noticed the comment below it:

There is also a much faster modern alternative to fdupes and jdupes: fclones. It searches for files in parallel and uses a much faster hash function than md5.

My response comment pretty much says it all, so I’m making that the entire remainder of this post.

I noticed that fclones does not do the byte-for-byte safety check that jdupes (and fdupes) does. It also relies exclusively on a non-cryptographic hash for comparisons. It is unsafe to rely on a non-cryptographic hash as a substitute for the file data, and comparisons between duplicate finders running in full-file comparison mode vs. running in hash-and-compare mode are not appropriate. The benchmark on the fclones page ran jdupes 1.14 without the -Q option that disables the final byte-for-byte confirmation, so there is a lot of extra work for the purpose of avoiding potential data loss being done by jdupes and being skipped entirely by fclones.

jdupes already uses a faster hash function than MD5 (xxHash64 as of this writing, previously jodyhash), and it is fairly trivial to switch to even faster hash functions if desired…but the fact is that once you switch to any “fast hash” function instead of a cryptographic one the hash function used is rarely a bottleneck, especially compared to the I/O bottleneck represented by most consumer-grade hard drives and low-end SSDs. If everything to be checked is in the buffer cache already then it might be a bottleneck, but the vast majority of duplicate scan use cases will be performed on data that is not cached.

Searching for files in parallel is only an advantage if the disk I/O is not a bottleneck, and you’ll notice that the fclones author performed the dedupe benchmarks on a (presumably very fast since it’s paired to a relatively recent Xeon) 512GB NVMe SSD with an extremely fast multi-core multi-threaded processor. There is a very small access time penalty for random read I/O on a fast NVMe SSD, but there is an extremely large access time penalty for random read I/O on a traditional rotating hard drive or RAID array composed of several hard drives. Any number of multiple threads firing off reads on the same RAID array at the same time will slow even most RAID arrays to a single-digit MB/sec death crawl. I understand that many people will be working with SSDs and some duplicate scanner programs will be a better choice for SSDs, but the majority of computer systems have spinning rust instead of flash-based disks.

It is strongly advisable to (A) run your own benchmarks on your specific workload and hardware, and (B) understand how to use the program within your own personal acceptable level of risk. Both of these are different for every different person’s needs.

UPDATE: I found another instance of the fclones author claiming jdupes being single-threaded makes it slow; to quote directly:

Unfortunately these older programs are single-threaded, and on modern hardware (particularly on SSDs) they are order of magnitude slower than they could be. If you have many files, a better option is to use fclones (disclaimer: I’m the author), which uses multiple threads to process many files in parallel and offers additional filtering options to restrict the search.

The points I’ve made above still stand. Unless you’re running the author’s enterprise-grade high-end hardware, your disk random access latency is your major limiting factor. I’d love to see what fclones does on something like a 24TB disk array. I’d wager–exactly as stated above–that 8 or 32 simultaneous I/O threads brings the whole process to a death crawl. Perhaps I should bite the bullet and run the tests myself.

UPDATE 2: I was right. Benchmark article and analysis forthcoming.

Featured image Licensed under CC-BY from Steve Jurvetson,

SubscribeStar Logo

jdupes 1.16.0: File Extension Filtering, And I Need Your Support

Please consider supporting me on SubscribeStar so I can continue to bring open source software and tutorial videos to you!

Over the past weekend, I implemented a feature in jdupes that was pretty strongly desired by the user community. Version 1.16.0 has the ability to filter by file extensions, either by excluding files with certain extensions or only scanning files with certain extensions. Now you can do things like scan only the JPEG files you dumped from your phone while ignoring all of the videos, scan a software project folder’s .c and .h files for duplicates while ignoring all of the others, or find all duplicates except for XML files.

In addition, I’ve cleaned up the extended filter (-X/–extfilter) framework and created an entirely separate help text section (jdupes -X help) that explains the extfilter options and behavior in great detail.

The extended filters are also cumulative, so specifying multiple filter options works as expected; for example, “jdupes -X noext=mp3 -X noext=aac -X size+=:1M” will exclude all files from consideration that end in .mp3/.aac as well as all files that are 1MiB or larger in size.

Unfortunately, there is not currently a way to combine filters, i.e. exclude all files with a particular extension over a particular size. That may be a feature in the future, but right now, I’m trying to add some basic filter functionality that satisfies as many basic filtering use cases as possible with as little work as possible. In the case of the extension filter, it took me about 3-4 hours to code, test, and fix issues with the feature, then issue a new release. It was relatively easy to implement, and even includes the ability to scan a comma-separated list of extensions rather than requiring a separate option for every single extension you want to filter.

Other features that filter on file paths will be more difficult to implement, but are highly desired by the community, so I have plans to implement those soon. The next fix, however, will be for the problematic BTRFS/XFS dedupe support that can’t dedupe (ioctl_fideduperange) large file sets properly.

Thanks as always for your continued support. More support means more time to work on my projects! I appreciate all of your help.

Manny the Martyr – Be That Way MP3 (public domain, CC0, royalty free music)

When the wonderful public domain music website finished its redesign, this song went missing on “Page 2” and it’s one of my favorite public domain songs. Links to it are getting hard to find, so I’ve uploaded it here. Right-click the link to download the song, or click the link to listen in your browser.

Manny the Martyr – Be That Way.mp3