I'd argue that bzip2 is a better example of a compression algorithm which no one...

mappu · on Dec 12, 2015

You can easily apply the same argument to xz here, by introducing something rarer with an even better compression ratio (e.g. zpaq6+). Now xz isn't the best at anything either.

But despite zpaq being public domain, few people have heard of it and the debian package is ancient, and so the ubiquity argument really does count for something after all.

voltagex_ · on Dec 12, 2015

"This package has been orphaned, but someone intends to maintain it. Please see bug number #777123 for more information"

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=777123 - get in touch with the new owner of the package if you're interested. It's probably on their Never Ending Open Source To Do List.

dchest · on Dec 12, 2015

No, xz (on a particular level setting) is both faster than bzip2 and provides better compression ratio, but zpaq is just slower.

mappu · on Dec 12, 2015

>No, xz (on a particular level setting) is both faster than bzip2 and provides better compression ratio, but zpaq is just slower.

Are you implying that xz out-compresses zpaq? Can you supply a benchmark?

Here's one from me - http://mattmahoney.net/dc/text.html - showing a very significant compression ratio advantage to zpaq.

dchest · on Dec 13, 2015

No, where did you find this implication in my comment? What I meant is that:

* xz is faster than bzip2 and provides better compression ratio [than bzip2]

* zpaq is slower [than bzip2 and provides better compression ratio than bzip2]

But looks like I'm mistaken? It seems like it can be faster and give better compression ratio than bzip2, can it?

thaumasiotes · on Dec 12, 2015

> So, xz, lzop, and gzip are all the "best" at something. Bzip2 isn't the best at anything anymore.

Two points:

(1) It's very, very easy for the best solution to a problem not to simultaneously be the best along any single dimension. If you see a spectrum where each dimension has a unique #1 and all the #2s are the same thing, that #2 solution is pretty likely to be the best of all the solutions. Your hypothetical example does actually make a compelling argument that bzip2 is useless, but that's not because it doesn't come in #1 anywhere; it's because it comes in behind xz everywhere. (Except ubiquity, but that's likely to change pretty quickly in the face of total obsolescence.)

(2) lzop, in your example, is technically "the best at something". But that something is compression and decompression speed, and if your only goal is to optimize those you can do much better by not using lzop (0 milliseconds to compress and decompress!). So that's actually a terrible hypothetical result for lzop.

Heck, zero compression easily wins three of your four categories.

lmm · on Dec 12, 2015

Zero compression is very often the correct choice these days.

tmd83 · on Dec 12, 2015

No even when speed matters sometime lz4 is the best answer. I wrote a data sync that worked over 100mbps WAN and using lz4 on the serialised data transferred far faster than the raw data. Not just on network you can often be processing data faster (specially on spinning disk) since the reduction in disk I/O can in some cases can actually make the processing faster.

jtolmar · on Dec 12, 2015

Being second-best on ratio and ubiquity is still pretty handy for serving files. It's compress-once, decompress on somebody else's machine, so neither of those matter. Ratio saves you money and ubiquity means people can actually use the file.

bfung · on Dec 12, 2015

> It's compress-once, decompress on somebody else's machine, so neither of those matter.

Last week, there was a drive mount that was filling up, rate was roughly 30Gb/hr. The contents of that mount was used by the web application. Deletion was not an option. Something that compressed quickly was needed. And on the retrieval end, when the web app needs to do decompression, seconds matter.

faizshah · on Dec 12, 2015

I found lz4 to be the best for general purpose analysis, it increased the throughput of my processing 10x compared to bz2. Then if you're working with very large files you can use the splittable version of lz4, 4mc, which also works as a Hadoop InputFormat. I just wish they would switch the Common Crawl archives to lz4.

I should probably mention the compression ratio was slightly worse than bz2 (maybe 15% larger archive) but for the 10x increase in throughput I didn't really mind that much. I could actually analyze my data from my laptop!

aidenn0 · on Dec 12, 2015

If I'm actually doing something with my data, gzip -1 beats out lz4 for streaming, as gzip -1 can usually keep up with the slower of the in/out sides, and gzip -1 is higher compression ratio than lz4 and faster compression (but not decompression) than lz4hc.

faizshah · on Dec 12, 2015

I just tested this on my laptop, I used the first 5 million JSONLines of /u/stuck_in_the_matrix reddit dataset (~4.6GB).

For compression lz4 took ~22 seconds (~210 MB/s) and I got ~30% compression, gzip -1 took ~56 seconds (~80 MB/s) and I got ~22% compression.

For decompression lz4 gave me 500MB/s while gunzip gave me 300MB/s.

Commands used:

    lz4 -cd RS_full_corpus.lz4 | pv | head -5000000 | gzip -1 > test.gz

    gunzip -c test.gz | pv > /dev/null


    lz4 -cd RS_full_corpus.lz4 | pv | head -5000000 | lz4

    lz4 -cd stdin.lz4 | pv >/dev/null

aidenn0 · on Dec 13, 2015

Interesting; on a mix of source and binaries (archived fully-built checkouts) gzip -1 outperformed lz4 in compression ratio.

faizshah · on Dec 14, 2015

No you're correct, gzip -1 outperformed lz4 in my test in compression ratio. I don't know why I typed "30% compression" instead of "compression ratio of 30%." Sorry about that.

dchest · on Dec 12, 2015

FYI, this cool little project — https://code.google.com/p/miniz/ — implements faster gzip compression at level 1.

rshm · on Dec 12, 2015

Last time i checked lz4 did not had a streaming decompression support on their python lib. It will be problem for larger files like common crawl if you are not planning to pre-decompress before processing.

faizshah · on Dec 12, 2015

It's not a problem for me since I mostly use Java. However, you can probably just pipe in your data from the lz4 CLI then use that InputStream for whatever python parser you're using and you should be fine.

The biggest problem is using a parser that can do 600MB/s streaming parsing. If you use a command line parser don't try jq even with gnu parallel.

kbenson · on Dec 12, 2015

Being the best at something does not make it necessarily the best choice for most situations. This is trivially shown through this example. Assume for the four measured aspects, there is a program this is the best, but in the other three aspects it is orders of magnitude worse than the best in that aspect. Now consider another program which is best in nothing, but is 95% the way to best in every aspect. It's never best in any aspect, but it's clearly a good choice for many, if not most, situations.

lectrick · on Dec 12, 2015

Doesn't bzip2 have a concurrent mode that those others don't?

randerson · on Dec 12, 2015

bzip2 can take advantage of any number of CPU cores when compressing.

justinmayer · on Dec 12, 2015

Bzip2 doesn't handle multiple cores as far as I'm aware, but tools such as pbzip2 can. I wrote about this some time ago: https://hackercodex.com/guide/parallel-bzip-compression/

That said, parallel XZ is even better: https://github.com/vasi/pixz

SapphireSun · on Dec 12, 2015

So can pigz http://www.zlib.net/pigz/

mattst88 · on Dec 12, 2015

I don't believe that's true, though there are multiple projects that offer that feature. lbzip2.org and compression.ca/pbzip2 to name a couple.

hrez · on Dec 12, 2015

Really? How? My bzip2 has no option for it and when tested it stuck to one CPU. xz on the other hand has

  -T, --threads=NUM   use at most NUM threads; the default is 1; set to 0
                      to use the number of processor cores

mappu · on Dec 12, 2015

Side note: `xz` only got -T option in stable releases less than 12 months ago (5.2.0 in 2014-12-21), so it hasn't made it into every distro yet.

zbuf · on Dec 12, 2015

The xz installed on my system carries a rather promising -T option, but then this text below it

> Multithreaded compression and decompression are not implemented > yet, so this option has no effect for now.

TheWoodsy · on Dec 12, 2015

I believe you're looking for pbzip2. Parallel bzip2 file compressor. I've replaced it as my go-to compression.

xixi77 · on Dec 12, 2015

lbzip2 is pretty good

hdmoore · on Dec 12, 2015

pbzip2 output isn't universally readable by third-party bz2 decompressors (Hadoop, for example).

amelius · on Dec 12, 2015

If you'd included the "zip" format in your analysis, gzip would not be the best at something anymore.

_ofdw · on Dec 13, 2015

I use bzip2 purely for sentimental reasons.