Speeding up your code when multiple cores aren’t an option
The common advice when Python is too slow is to switch to a low-level compiled language like Cython or Rust.
But what do you do if that code is too slow?
At that point you might start thinking about parallelism: using multi-threading or multi-processing so you can take advantage of multiple CPU cores.
But parallelism comes with its own set of complexities; at the very least, some algorithms can only really work in a single-threaded way.
So what can you do?
As it turns out, there’s often still plenty of performance improvements you can get just by tweaking your low-level code.
As a real-world example, in this article we’ll go about optimizing Floyd-Steinberg error diffusion dithering.
The specific variant of the algorithm that we will implement converts a