Thursday, August 4, 2016

Few notes about the previous post

This rant is mostly directed at the commenters that claimed I hobbled the open source codecs (including my own!) by not selecting the "proper" settings:

Please look closely at the red dots. Those represent Kraken. Now, this is a log10/log2 graph (log10 on the throughput axis.) Kraken's decompressor is almost one order of magnitude faster than Brotli's. Specifically, it's around 5-8x faster, just from eyeing the graph. No amount of tweaking Brotli's settings is going to speed it up this much. Sorry everyone. I've benchmarked Brotli at settings 0-10 (11 is just too slow) overnight and I'll post them tomorrow, just to be sure.

There is only a single executable file. The codecs are statically linked into this executable. All open source codecs were compiled with Visual Studio 2015 with optimizations enabled. They all use the same exact compiler settings. I'll update the previous post tomorrow with the specific settings.

I'm not releasing my data corpus. Neither does Squeeze Chart. This is to prevent codec authors from tweaking their algorithms to perform well on a specific corpus while neglecting general purpose performance. It's just a large mix of data I found over time that was useful for developing and testing LZHAM. I didn't develop this corpus with any specific goals in mind, and it just happens to be useful as a compressor benchmark. (The reasoning goes: If it was good enough to tune LZHAM, it should be good enough for newer codecs.)


  1. Hi Rich, I liked your benchmark. Kraken and the other Oodle codecs seem like a reset of expectations, a new baseline.

    Count me as another polite nudge to move Brotli's levels around and see what happens. Yes, Kraken is 5-8x faster than Brotli 10, and it's unlikely that Brotli can make up that ground, compression codecs and benchmarks are full of surprises, and we won't know what we're looking at until we thoroughly exercise Brotli and the others.

    I understand that you can't release your corpus (and I agree with you, on your other post, that the typical enwik8/9 approach is bogus). But we really need the details for everything else in your benchmark. For example, I assume you used the latest versions of brotli, zstd, etc. but I can't be sure because you don't list the versions. As people come to this post in future months and years, those version numbers will become increasingly important, since otherwise readers won't know what exactly you tested (and they probably won't be able to look at the post date and infer what releases all these codecs were on at that date).

    The compiler options you used would be very helpful. You mention "optimizations", but I don't know what exactly you're referring to. Did you use /arch:AVX2? Did you run this on an AVX2 capable system? Visual Studio is getting better at auto SIMD, and some of these codecs will benefit from SIMD more than others.

    I wouldn't bother with zlib -9 since no one uses that and it slows zlib dramatically. zlib -6 is the default on web servers, and is much faster to encode. (Granted it won't make much difference for decode.) And zlib -1 is much faster, and gives a more useful old-school tradeoff reference than zlib -9.

    I don't think I know anything about a compression codec until I know it's memory and CPU usage during decompression. It could decode at 3 petabytes per second, but that's useless to me if it needs half a petabyte of RAM. Some of these codecs use gigabytes of RAM during encode, and it's unclear how much they use to decode. (Mahoney only reports encode RAM in his benchmark.) In the real world, this matters a lot. Obviously mobile is the modal context now, smartphone CPUs really like to throttle down for thermal reasons, and nobody likes big RAM and CPU spikes anyway. In your next round, it would be awesome to see RAM and CPU use for decode (and for encode).

    For the codecs aimed at the web, like brotli and maybe zstd and potentially Kraken, it would be good to see how they perform in their natural habitats – as modules or plugins in nginx, IIS, Apache, Caddy, etc. I don't trust standalone benchmarks when the job is not standalone. Granted, setting web servers will be a lot more work. Maybe I'll do it.

    1. Hi Joe - I'm working on generating new Brotli charts, at a variety of levels. So far, the data I'm seeing shows that level 2 decodes a little faster than level 10, but it's clearly still nowhere near Kraken's performance. Once I have all the data I'll put up another post.

    2. I've added compiler and codec version info to the previous post.

      Next Up: Brotli at settings 0-10 vs. Kraken.

    3. Cool! I see the AVX flag – that's good because so many benchmarks use vanilla make or compiler settings, which means the compiler is defaulting to 1999-era CPUs.

  2. Why the /fp:precise? Do any of the codecs need it? It could differentially slow things down for some. Brotli has some floats in the encoder but not the decoder. I have no idea how much of a difference it would make, but compilers are full of surprises and relaxing the FP pedantry with /fp:fast might improve the performance of some codecs (this might be possible even for those that don't use floats, depending on how the compiler is implemented, if it makes broader optimization/compile time tradeoffs depending on things like which fp flag is set, etc.)

    1. /fp:precise because that's the VS IDE's default setting. I'm hesitant to use /fp:fast - will the majority of users use this flag?

      FWIW, LZHAM purposely avoids floating point math in the encoder to sidestep any worries about FP precision.