Sunday, January 25, 2015

LZHAM v1.0 released on github


I haven't merged over the XCode project yet, but it's fully compatible with OSX now. Also, LZHAM v1.0 is not backwards compatible with the previous alpha version.

Saturday, January 24, 2015

Windows 10: An Arrow Aimed Straight at Steam

I find this very interesting news, and if you're not paying attention you should:

Phoronix: Windows 10 To Be A Free Upgrade: What Linux Users Need To Know

PC World: Windows 10's new features: Cortana, a 'Spartan' browser, Xbox streaming, and more

Rock, Paper, Shotgun: Is Windows 10 Good For PC Gamers Or XBone Owners?

I think Microsoft's strategy here is surprisingly well thought out. Their execs finally figured out "to know your enemy you must become your enemy". Windows 10 is free, has a new state of the art graphics API (DirectX v12) created by the best graphics specialists, real software engineers, and real testers in the business, awesome developer tools (Visual Studio) that actually work with real CPU/GPU debugging and profiling support all built in, all of your existing apps and games still work, and they're pulling out all the stops with the Halo/Xbox branding right down into the OS and browser.

They just need to make the Windows 10 App Store not suck: Continue to use their Xbox brand as a lever, carefully feed and nourish the ecosystem, listen to their customers, and undercut the living hell out of Steam. Steam itself started out as a total pile of crap, but they listened to their customers, fixed the problems over time, gave their customers good deals and shipped apps you couldn't get anywhere else at the right prices, and built and nourished the community. Microsoft can do all the same things, and perhaps they've finally figured this out.

It's now all down to execution, recovering from some obviously bone headed moves (sometimes fueled by excessive Redmond Kool-Aid drinking, like the botched Windows 8 UI and no Start Menu disasters), recognizing and quickly recovering from the inevitable new bone headed moves, and sustaining the effort over the long term (something Microsoft has definitely not been very good at except for their core brands). Competition is great.

Friday, January 23, 2015

LZHAM v1.0 is being tested on iOS/OSX/Linux/Win

Currently testing it on a few machines using random codec settings with ~3.5 million files. We also just switched over our title's bundle decompression step from LZMA to LZHAM, so the decompressor will be tested on many iOS devices too.

I've also tested the compression portion of the code on iOS, but I won't be able to get much coverage there before releasing it. Honestly the decompressor is much more interesting for mobile devices (that's really the whole point of LZHAM).

I'll be porting and testing LZHAM on Android within a week or so - should be easy by this point.

LZHAM v1.0 vs. LZMA decompression perf. on iPhone 6+

I borrowed a coworker's iPhone 6+ and reran my bundle compression benchmarking app. According to wikipedia, it's a 1.4 GHz dual-core ARMv8-A.

LZHAM is 2.3x-9x faster on this device, unless the bundle's compressed size is < 1000 bytes. The comp size threshold where LZHAM is faster is lower than what I'm used to seeing, not sure exactly why yet.

1. Bundles sorted by LZHAM vs. LZMA decompression speedup (slowest on left):

2. Bundles sorted by LZMA compressed size (smallest on left), with relative decompression speedup in blue:

Wednesday, January 21, 2015

First LZHAM iOS stats with Unity asset bundle data

Got everything (both the compressor and decompressor) working. Was surprisingly easy. Had 1 misaligned load to deal with in the compressor's match finder because of a #ifdef problem.

I combined together 3 of our larger Unity asset bundles together into a single .TAR file and here are the current results on my iPhone 4 (800 MHz A4 CPU - 512MB RAM):

LZHAM Compressed from 15209984 to 4999552 bytes
LZHAM Comp time: 112.771710, BPS: 134874.110168
LZHAM Decomp time: 0.895846, BPS: 16978346.767638

For compression, I used a 16MB dictionary, highest compression (level 4) with normal parsing. Compression is slow, but LZHAM is designed for offline use so as long as it works at all I'm not sweating it for now.

Decompression is around 47 cycles per byte on these bundles files, which contain a variety of Unity asset data.

Now LZMA stats (level 9 16MB dictionary, default tuning options):

LZMA compressed from 15209984 to 4726211 bytes
LZMA Comp time: 41.805776, BPS: 363824.942238
LZMA Decomp time: 1.993880, BPS: 7628334.723455

LZMA decompression was ~105 cycles/byte.

So LZHAM decompresses this data 2.2x faster. Its ratio is slightly lower, but this can be somewhat compensated for by enabling LZHAM's better parser and compressing offline (with a multicore desktop CPU). This helps a little: 4960935 bytes. By using more frequent Huff table updates (level 3 vs. the default 8) and extreme parsing, I get 4942383 compressed bytes, but decompression is ~18% slower. I'm going to graph all of this data next.

For reference, my iPhone 4's CPU is ~13.6x slower for compression and ~8.5x slower for decompression vs. my Core i7 3.3 GHz desktop CPU (comparing absolute wall time, no multithreading, same settings and file data, etc.).

Update: Here are the testing results after compressing & decompressing all of our uncompressed asset bundles on my iPhone 4. I limited LZHAM's compressor to a dictionary size of 8MB, less frequent table updating (table update speed of 12 vs the default 8), and normal parsing, which limited its ratio a bit vs. running it on desktop.

LZHAM is slower on a few files totaling ~.2% of the data (~320k out of 172MB), from there it rises to between 1.8x-4.8x faster. (Note I'm currently regenerating this graph so LZHAM's dictionary size matches LZMA's.)

1. Red=Speedup, Blue=LZMA compressed size, sorted by compressed size.

2. Red: Speedup, Blue: LZMA_comp_size/LZHAM_comp_size, sorted by speedup.

Tuesday, January 20, 2015

LZHAM v1.0 vs. LZMA decomp. perf on a large corpus of files

LZHAM isn't always faster than LZMA. LZHAM has a more expensive startup cost (which I've reduced a bunch since the alpha but it's still there), and it must update several large Huffman tables at periodic intervals. The previous alpha versions had way too many Huff tables, which really dragged the codec down on some files. The new version now only has a handful of tables, I've reduced the default table update interval, and you can now fine tune the update interval. This graph was generated at update speed 20 (least # of updates/fastest).

To visualize where LZHAM is faster or slower, I ran a test app on a corpus of 21,702 files, all >= 1024 bytes (to reduce the sheer number of them) and timed how long LZHAM vs. LZMA took to decompress each file. This is a mix of game assets from various titles, the usual standard corpus files (calgary etc.), XML/JSON/binary JSON/source files, random WAV/BMP/TGA/JPG/MP3's, executables+DLL's from popular installs, etc. Just random stuff.

I tossed any results where LZMA expanded the data because in these cases LZMA is up to ~60x slower than LZHAM. LZHAM has special handling for uncompressed data, and LZMA does not, so it just bogs down really badly in these cases. There are still many cases in this graph where LZMA just bogs down terribly on nearly uncompressible files, LZHAM can win massively in this case because it chooses which 512k blocks to store uncompressed.

Here's the resulting graph showing LZMA's vs. LZHAM's decompression time on a 3.3 GHz Core i7, sorted by the LZMA's compressed file size. The blue line is the speedup, where less than 1.0 means LZMA was faster, and greater than 1.0 means LZHAM was faster. 

The red line is the compressed file size on a log scale. This corpus has a ton of small (<4kb) files.

This graph shows than LZHAM v1.0 is pretty much always slower than LZMA if the compressed file size is <= ~2400 bytes. LZHAM can be only ~20% as fast in these cases. At around ~13k LZHAM is usually faster, and the greater the amount of compressed data the higher the likelihood that LZHAM is faster. You can estimate the threshold amount of original/source data if you know your data's average compression ratio.

Basically, LZHAM sucks on small blocks. I can make this somewhat better by reducing the startup cost, and optionally allowing the user to disable the huff table updating (or just slow it down even more). Another alternative is to have the compressor intelligently break up the input stream into a handful (or just 2) carefully chosen LZHAM blocks and issue a "update all huff tables now" command to the decompressor in between the block boundaries.

Note all of my timings include the time LZHAM takes to allocate its work memory and initialize its internal data structures. The dictionary size for LZHAM was always 64MB in this test, while the dict. size for LZMA was tuned to be the first pow2 >= the source file size, so it's possible LZHAM is at a bit of a disadvantage here due to extra memory allocation costs relative to LZMA. I'm running another test (in LZHAM's unbuffered mode) to find out if this makes any difference.

(Thanks to John Brooks for giving me some feedback on this graph.)

Update: Here's a new, less noisy graph with the following differences:
- Filtered out all files with a LZMA comp ratio less than 2% (because we know LZMA totally sucks at these low ratios)
- Switched LZHAM into unbuffered mode, for a minor decompression speed boost.

Monday, January 19, 2015

Parallelized download+decomp performance of various codecs

I finished porting and testing LZHAM v1.0 on OSX today. Everything works, including multithreaded compression. I've also tuned the Huffman table updating options more. Next stop is iOS.

The graphs in the previous post show the summation of load_time+decomp_time at various download rates. For large amounts of data (not small blocks), it makes sense to decompress each buffer of compressed data as it becomes available (from the disk or network) using streaming decompression, instead of waiting for all the compressed data to be fully downloaded first.

The following graphs show uncompressed_size / MAX(load_time, decomp_time) at various download rates on ~166MB of uncompressed Unity iOS asset bundle data compressed as a single .TAR file with various codecs. (Of course, in case it's not obvious, I'm assuming everywhere here that the data has already been pre-compressed. I am not including compression times anywhere here.)

My brain is tired, but this should roughly approximate the total amount of time it would take to deliver the uncompressed data (ignoring the effects of buffering and other overheads). I'm also assuming it's a single large stream of compressed data and you're not doing something fancy like parallelizing decompression.

The version of LZHAM referenced in this post is v1.0, which I'll be releasing on github hopefully this week (not the old alpha on Google Code).

Note: I used the regular LZ4 compressor for these graphs, *not* LZ4-HC which compresses slightly better.

1. 3.3GHz CPU, X axis scale=281.63 megabytes/sec.

Graph 1: On fast CPU's, up until 13 MB/sec. download rates LZMA is the winner (because its ratio is highest), then up to 95MB/sec. LZHAM v1.0 switches to the winner (because its ratio is almost as high as LZMA but it decompresses more quickly - at these rates LZMA is just too slow to keep up with the download rate). After than, good old Deflate (miniz's decompressor) is best up until ~130MB/sec., then LZ4 takes over from there.

2. 3.3GHz CPU, X axis scale=2.35 megabytes/sec. (zoomed version of above)

Graph 2: A zoomed in version of the left of graph 1 showing more detail at the slower downloads speeds. Graph 2 shows that on fast CPU's and slow networks, LZMA is the clear winner because all that matters to the overall delivery time is compression ratio. LZHAM is close but loses because it has a slightly lower ratio (usually 1-4% lower) on this data. The other codecs just have too low a ratio to compete at these low network speeds.

3. VERY SLOW CPU, X axis scale=4.7 megabytes/sec.

Graph 3: A much slower CPU (simulated 1/12th the performance of my desktop CPU - just multiplied the decomp time by 12). This graph is zoomed in to show detail at low network speeds. LZMA is the winner up to ~1MB/sec., then it plateaus because it's too slow to keep up on this slow ass CPU. LZHAM can sustain up to 3MB/sec, then Deflate takes over. From there LZ4 will eventually take over as the winner at the higher network rates because Deflate eventually won't be able to keep up. At the very highest network rates it makes more sense to not even bother using compression at all, because the CPU won't be able to keep up, even with LZ4.

My 2 cents: The "best" codec to use (where "best" minimizes the amount of time the user needs to wait for the full download) depends on the client's CPU speed, the codec's compression ratio, and the download (or disk) rate. A smart content server (that doesn't give a crap about how much data it actually sends over the network) could choose the best codec to use depending on these factors.

To minimize content delivery (download) time on fast desktop CPU's, use something like LZMA or LZHAM. Single threaded LZMA doesn't scale beyond ~13MB/sec. on my CPU, where LZHAM will scale up to 95MB/sec. but has a slight download time penalty (around 1-4%) at the slower download rates.

On slow CPU's, LZMA plateaus very early. Beyond this download rate, LZHAM v1.0 is the winner, then Deflate followed by LZ4.

Of course, we do care about how much data must be downloaded, due to ISP caps, monthly rate plans, etc. In this scenario, we need to stick to using a high ratio codec like LZMA or LZHAM. LZMA doesn't scale well on slow CPU's, so we need something like LZHAM that has a similar ratio but scales more effectively.

To ultimately improve the current state of the art, we need a new codec with higher ratios than LZMA that doesn't run like a complete dog, use massive amounts of RAM, or require large #'s of cores to be practical. I don't know of any open source codecs that fit these requirements yet. Somebody needs to write one.